
Internship - Perception and Spatial AI
Job Description
Here at Humanoid, we believe in a future where robots amplify human potential. That’s why we’ve set out on a mission to build the world’s most capable, commercially-scalable, and safe humanoid robots. We’re bringing that mission to life with HMND‑01 Alpha - our rapidly developed humanoid platform now running in real industrial pilots - and we’re growing the team to take it even further.
Our Mission
We’re building software systems that enable robots to operate effectively in the real world expanding human capability and redefining how work gets done.
The Opportunity
We’re looking for interns who are curious, proactive, and excited to work on real-world robotic systems.
This is an open-ended internship, you won’t be confined to a single component, but will work across perception, navigation, and multimodal systems, collaborating closely with the team to find where you can have the most impact.
You may work anywhere along the stack, from camera systems (timestamping, synchronization, validation), through perception and scene understanding, to navigation and integration with locomotion. The scope is intentionally broad. We’re looking for people who are excited to dive into unfamiliar areas and learn quickly.
This is a full-time internship (5 days per week) over the summer (mid June - mid September), based in our London Paddington office, where you’ll contribute to real systems from early on with guidance and support from experienced researchers and engineers.
Duration: 12 weeks | Start date: June | Compensation: Competitive pay + we'll keep you fed (seriously, the food is good)
What you might work on
Develop perception systems for robot navigation and interaction in real-world environments
Work on focused problems within Vision-Language(-Action) or multimodal models (components, datasets, evaluation)
Run and analyse experiments using existing pipelines
Improve data quality through curation and labeling
Explore scene understanding, 3D perception, or navigation methods and apply them to real systems
Prototype ideas and iterate quickly with guidance
Collaborate on integrating models into robotic platforms
What we’re looking for
Pursuing a degree in Computer Science, Machine Learning, Robotics, or a related field.
Strong foundations in machine learning and/or computer vision.
Hands-on experience with PyTorch and training ML models.
Experience running experiments and interpreting results.
Interest in multimodal models, 3D vision, spatial reasoning, navigation or embodied AI.
Ability to take ownership and iterate with guidance.
Strong problem-solving skills and attention to detail.
Fast learner, comfortable in a research-driven, fast-moving environment.
How to apply
Complete the challenge below and submit your solution as a public GitHub repository — include a README with instructions to run your system, example outputs, and a short note on your design choices. You will be able to include your GitHub repository URL when you fill out the application form, alongside your name and CV.
We’re not looking for standard solutions, we're looking for how you think. The strongest submissions are creative, original, and push beyond the obvious.
Intern Challenge: From Video to 3D Reconstruction
Build a system that takes a short video (e.g. captured on a phone), of a small indoor area such as a small room, and reconstructs a 3D scene.
The core goal is geometric reconstruction from video. Semantic understanding is welcome, but optional.
At a minimum, your system should:
Generate a 3D representation of the scene from video input
Produce a reconstruction that is geometrically coherent and consistent
Optional extensions:
Assign semantic labels in 3D (e.g. tables, chairs)
Ensure any semantic predictions are aligned with the underlying geometry
There are no constraints on real-time performance, We’re intentionally leaving the approach open, use any tools, models, frameworks, or agentic workflows you find effective.
What to submit
A working codebase
Clear instructions on how to run your system
Example input(s) and output(s)
(Optional) A short note explaining your design choices and tradeoffs
What we care about
Simplicity and usability of your solution
Creativity in approach
Quality of 3D scene reconstruction
Clear, compelling presentation of results
Coherence between geometry and semantics
Make something you’re proud of.