Humanoid Robots: The Long Road Ahead
Answers to all your questions about humanoids, AI, and robotics
Today we’re doing an anonymous Q&A with KL Divergence, a robotics PhD currently in industry working on humanoid robots. For an introduction to humanoid robotics in China, see our article here, and for a deeper look into who’s leading China’s humanoid market, see our latest translated interview with the CEO of Unitree Robotics.
This Q&A covers:
When and how AI-driven robotics will reach a tipping point in viability,
Challenges and solutions for collecting data to build a robotics AI model,
Successful strategies for companies to compete in humanoid robotics.
When will AI + robotics reach a tipping point in viability?
This is extremely difficult to predict.
Here's my non-answer: whenever the world achieves a data flywheel for robotics, i.e. accumulate a dataset large-enough (and algorithms to use it) that allows some robots to achieve a diverse set of somewhat useful tasks, with enough reliability that people allow those robots to operate in their factories, logistics centers, homes, offices, etc.
Once a robot has a “reason for being” in a space, and works well enough, the data flywheel will spin, and the robots will get better and better. This is the same process we are seeing play out in self-driving cars, and why Waymo’s early advantage in deployment is such a big deal, making Tesla dance and move forward plans for Robotaxi. I don't think we will see this happen all at once, in all application domains.
Today, this has already arguably happened for a specific application: robotic picking/packing in e-commerce logistics. Amazon, Dexterity, Covariant, Berkshire Grey, and Ocado all have massive robotics datasets for this basic task, and already use them to create their own “flywheel.” This is short of what we want though, because that data is only for one task, and is specific to those companies' unique robots.
What’s the pathway to viability? How will AI + robotics diffuse through different industries?
So the next stage will likely be doing lots of different tasks (10s-100s) in a structured environment. I would guess logistics centers and manufacturing. I think this could reasonably be achieved in 2-3 years research-wise, and 5-7 years to become commercially commonplace. Along the tail end of that period, you might see these robots start to appear in retail, hospitality, and food service back-of-house. Think: robots doing laundry or restocking shelves. Next, offices. And last, homes.
Will we be seeing AI-driven robots in homes?
We’re 10+ years away even as a question of research, if we ever get there.
Homes (and to some extent offices) are much more difficult than commercial/industrial spaces because of 4 factors: lack of structure and wide variation, safety, and cost.
Structure and variation: Homes are the ultimate “unstructured” environment. They come in infinite variations, and change from moment to moment as people and stuff move around. One day you might decide to put the cucumbers in the top vegetable drawer, then next you might move them into the bottom drawer. Multiply that by everything a home robot might have to ever interact with and the amount of variation becomes mind-boggling. It is impossible to create a system which quantifies and anticipates it all explicitly. The realization has been the impetus for the move towards learned — rather than programmed — robotics AI systems over the last 10 years.
Safety: It’s an engineering achievement to make a robot that can complete tasks and weigh only as much as a smallish human. If that thing falls in a house (dead battery, malfunction, etc.), the stakes are high: it might fall on a pet, break a glass table, or knock over a candle. Contrast with a controlled commercial environment, where people working near the robots can be specifically trained, the environment is arranged so that failures don’t lead to catastrophic danger, and the robots might even be cordoned off behind a cage to minimize the impact of accidents.
Cost: Most proposals for robots in the home have them providing typical domestic labor: cooking, cleaning, tidying up, etc. People already pay for these services in their homes, and it is invariably some of the lowest-paid work in any economy. A humanoid robot has a similar part count and manufacturing complexity to an electric car, so it’s intuitive that the most optimistic cost estimates land the price of these machines at similar numbers: $15-50k, depending on the source. How much would a home robot have to do for a family to justify a $25k price tag with a 5 year life span, assuming no recurring service or subscription costs?
So why do we see some players in the robotics/AI space, humanoid or otherwise, proudly touting their goal of putting robots in homes? My best guess is that it’s a more compelling narrative to attract investment and inspire talent — and it’s not too hard to pivot back to industrial robots anyway.
What are the challenges of getting good training/testing data for AI-driven robots?
You already answered this well in your piece. It's because we need to get robots into the real world to collect lots of good data, but they currently don't work well, can be unsafe, take up space, etc. They have no economic reason to take up space and human attention where you would want them to collect data.
How are proposed solutions addressing these data challenges?
Ways of addressing:
Simulation (but this has flaws, as you mentioned)
Spend a lot of up-front capital to collect robot data directly, in hopes of collecting enough up-front to bootstrap a useful robotics foundation model (i.e. vision-language-action model or VLA)
Find a way to re-use data from the internet, e.g. watching human cooking or furniture assembly videos from youtube (this is very active research, but so far the results have been disappointing),
Master one task at a time (using a combination of 1 and 2, and old fashioned engineering), and hope you collect a dataset diverse-enough before you run out of money (if the dataset is not diverse in tasks, you will have a robot which can do a handful of tasks, but is expensive to train to do new tasks).
What's it like on the ground for a factory collecting this data?
This can vary greatly, but typically you have a task in a factory which has already been defined for a human worker, and it's fairly repetitive. E.g. Tesla's first application is to make Optimus, to grab battery cells rejected into a slide coming out of a battery quality control machine, slot many of them into a grid on a purpose-shaped tote, then walk those totes to a different area of the factory when full. It's very simple and repetitive, and today it's done by humans across dozens of machines. You can imagine other scenarios. For example, sorting packages into bins bound for different geographies in a logistics center. Well-defined tasks and lots of existing automation and built environment (e.g. screens, conveyor belts, well-placed bins and racks) to help humans.
What does it look like to collect the data? That depends on the approach. The most straightforward is teleoperation: a human dons a (usually) VR headset, special gloves for capturing finger movements, and other hardware you'd see in a VR space, and uses them to control the robot directly to do the job. This is “robot-in-the-loop.” It’s slow (the human can't move the robot as fast as they would themselves), and costly (it's actually more expensive than having a human do the job, because it's slower).
Another approach is motion capture: via various methods (camera systems in the work area, body-worn suits, even lightweight worn exoskeletons), we can capture the motion of humans who are already doing the job. This is more speculative, as it’s a difficult research problem to turn these motion recordings into instructions for the robot to achieve the task later.
The last major approach is simulation: usually through the help of a skilled human artist or engineer, create a detailed and functional 3D graphics simulation of the real environment in which the robot is supposed to perform. This allows us to use teleoperation, programmed routines, and reinforcement learning, to control the robot in simulation and collect data on its successes and failures. The weakness of this approach is that the model usually cannot be used immediately on the real robot, because it’s extremely difficult to capture all of the important behavior of a real work task, even a very small one, in a simulation. Roboticists refer to this problem as the simulation-to-reality (sim2real) gap.
On the research horizon, there are a variety of approaches that may allow us to generate or make use of data without actual or simulated robots. A “holy grail” of robot learning for the past decade or more, has been to create a robot learning system which can “learn from watching YouTube videos.” What all of these approaches have in common is that they seek to lower the cost of data for robotics models, by finding ways to make use of lower-quality data (i.e. weaker supervision). The key missing technical piece in most of these approaches is to find a way to map from non-robot behavior in one environment to the actions a robot would take to do the same task in a new environment.
What would indicate a successful humanoid robotics strategy?
How many robots does the competitor have in the real world doing tasks and collecting data, and (importantly) how diverse is that set of tasks? Humanoids are a very expensive way to automate just one thing, so the investment needs to be amortized across many different jobs.
What strategies are robotics firms taking to compete in the market? What will determine who succeeds?
Boy, this is a big question. I won't try to answer the whole thing, but I'll give you a framework.
There are a few fundamental assets to look at here: technical talent (people), chips (compute), robots (how much do they cost? what are their capabilities?), data, and distribution (customer relationships, pilots). Any robotics+AI company or partnership effort needs to assemble all of these ingredients to be successful.
Resources which are less scarce:
Robots: you have a whole article about how China is commodifying robots. However not everyone agrees that *good* robots will be so plentiful (perhaps because of protectionism), and others (e.g. Figure, Boston Dynamics) believe they can create an edge by having the *best* hardware.
Customer relationships: Tech demo deals like the Figure-BMW, Apptronik-Mercedes, and Agility-Amazon partnerships are very low-risk for the larger company and easy to make. CEOs at humanoid companies tell me they have no problem getting 100s of leads.
So a successful strategist will try to gain an edge in the scarcest resources: talent, chips, robots, and data.
Chips: notably — every major NA effort has decided that they need to team up with a giant foundation model provider to have the chips and frontier models to compete. Figure-OpenAI, NVIDIA is in-house, Tesla-xAI, and Boston Dynamics has teamed up with TRI's foundation model team.
Robots: Most people attempting to make their own, however Skild, NVIDIA, and Physical Intelligence have all taken a partnering or purchasing approach for robots instead. Whether a competitor sees robots as a competitive advantage, or an expense, is a major dividing line in strategy in this area.
Talent: immensely cut-throat. Until very recently robotics+AI was a very niche field. An investor told me he believes there are ~25 people in the world who could lead one of these companies well. Even below leadership, the number of people with any training at all in this subfield is in the low 100s. The best-paying outfits in the world with the best reputations take 6 months to hire someone, and are often just waiting for new PhDs to graduate to fill positions. Talent with <1 year of professional experience but relevant education (usually a PhD) can fetch $500k-1M/yr in this field, and/or significant equity, depending on the size of the company.
And finally, data is the most strategic asset these companies seek to accumulate long-term. At the end of the day, the firm who has the best data (or best strategy for getting it) wins the game.