The Human Company

Vincent Liu, Ademi Adeniji

We exist at a precipice in human history. The last decade has seen the birth of a new class of intelligence that rivals our own cognitive capabilities as humans. One of the greatest promises of artificial general intelligence is its potential to reason about and automate tasks in the physical world, and this future of general-purpose robotics is now within reach. Our mission is to accelerate this path to abundance.

Generalization of deep neural networks scales with Internet-scale data. Both data quality and quantity are needed to produce behaviors that are correct and robust1. Teleoperation, by far the most popular approach, has some unforgivable limitations—it requires trained human teleoperators, working robots, and is constrained by the set of deployable workspaces and tasks. It will be difficult to reach even 2T-token/GPT-2 scale this way2, not to mention that data collected on one robot is incompatible with other morphologies. By starting from specific tasks and collecting data retroactively, teleoperation also inherently fails to reflect the true distribution of human labor.

The philosophical question should not be “how do we collect data for this task?” but rather “how do we collect data that represents the collective human experience?” Today’s powerful artificial intelligences only exist because of the Internet, which is a real-time digital reflection of the human experience. As a result, deep neural networks trained continually on the Internet will understand human culture and its evolution over time. Human data is the renewable energy of artificial intelligence and robotics.

Human foundation models trained on human data at global scale will be the only class of models that can exhibit physical intelligence at the scale of humanity. Although the human and robot embodiments are different today, the discrepancy is shrinking rapidly as humanoid hardware is becoming more commercially viable. We believe that the key to crossing this distribution gap is to bootstrap the world knowledge distilled by language and vision foundation models. We have seen a strong emergence of this trend in vision-language models3, world models4, and video models5. As deep learning continues to scale, the emergent reasoning capabilities in human foundation models will produce general-purpose physical intelligence.

We have spent the past year developing this vision. Without raising any capital, we have developed breakthroughs in robot learning from human behavior. We have shown how to transfer in-the-wild human data from smart glasses into robot policies6 and tactile human data into robot policies with a sense of touch7. Our results promise orders-of-magnitude gains in both sample efficiency and data collection speed over the current approaches of Physical Intelligence, Figure, Tesla, and Google Deepmind. Our method will scale as a human foundation model akin to the robot vision-language-action model8, while retaining native compatibility with human data. Thus, human foundation models will scale at the rate of human data. Robot foundation models may never scale because robot data is expensive, and learning cross-embodiment invariances is even more expensive. The human foundation model still requires algorithmic innovation, but trends in other domains indicate that open-source academic research and world models will converge to an effective solution relatively quickly. In the next 2-3 years, we expect to see native human foundation models exhibit near-perfect transfer to humanoid robots.

The human-centric philosophy has advantages at both the data and model layers, but neither should be monetized. First, the model layer will be commoditized. LLMs have shown that algorithms are expensive to develop, easily copied, and trending towards $0/token rates. Second, the data layer is important but encounters several headwinds as a business model. Data monetization creates a misalignment of incentives between the laborers paid to collect data and the robotics companies buying the data. The low technical barrier to start such a company also means that competition will squeeze margins on both sides. Furthermore, the “Scale AI of robotics” is a misconception—in robotics, data and labels are inseparable, and the real cost lies in data collection rather than cheap, a-posteriori labeling. Unlike LLMs, where the data representation is standardized and static, robotics data collection and model training are tightly coupled and evolving.

And so, the one true robotics business model is to sell automation. The largest value capture will occur at the frontier of GDP growth. Monetizing automation also aligns incentives across the entire stack. People who buy automation (factory robots, house robots) will be incentivized to contribute their own data to improve their own experiences. The human foundation model will be decoupled from any robot hardware and learn continually from human data. The product will be monotonically improving automation driven by a flywheel of indefinite human data9. And as goods and hardware become cheaper10, economic output and margins will exponentiate.

We end this essay on a more philosophical note. Without human intelligence, there would never be silicon intelligence. We exist in a world built by humans for humans—physically, digitally, economically—so an artificial intelligence system succeeds best when it aligns with our incentives. This bleeds into all layers of the stack—data, models, product, monetization. Human data is the path to a singularity, after which robots can autonomously explore and overcome Moravec’s paradox11. We are not just creating a system to scale general-purpose robots. We are building the behavioral structures that will usher humanity’s transition into abundance.

This essay was written by humans without any AI.

Footnotes.

  1. Brad Porter. This Business of Robotics Foundation Models.

  2. Chris Paxton. How can we get enough data to train a robot GPT?

  3. OpenAI. Hello GPT-4o.

  4. Meta. Introducing V-JEPA 2: A self-supervised foundation world model.

  5. Google Deepmind. Veo: Our state-of-the-art video generation model.

  6. Vincent Liu and Ademi Adeniji. EgoZero: Robot Learning from Smart Glasses.

  7. Ademi Adeniji and Vincent Liu. Feel-the-Force: Contact-Driven Learning from Humans.

  8. Physical Intelligence. π0: Our First Generalist Policy.

  9. Tesla. Full Self-Driving (Supervised).

  10. Bain & Company. The Hardware Paradox: Machinery Must Expand beyond Machines.

  11. Ege Erdil, Epoch AI. Moravec’s paradox and its implications.