# The Data Wall

## Every major AI breakthrough was a data breakthrough first

GPT didn't happen because someone invented transformers. It happened because the internet had already generated trillions of words of text sitting there waiting to be scraped. Self-driving didn't happen because someone figured out lidar. It happened because Tesla put cameras on millions of cars and Waymo drove billions of real-world miles.

Robotics is no different. Foundation models like [π₀ from Physical Intelligence](https://x.com/physical_int/status/2003161637734518985) can wash pans, clean windows, and make sandwiches. [Generalist AI's GEN-0](https://x.com/GeneralistAI/status/1985742083806937218) just proved that scaling laws hold for robotics: bigger models + more data = strictly better robots. Google's RT-X models trained on data from [22 different robots across 21 institutions](https://robotics-transformer-x.github.io/) show 50% better performance than any single-robot approach.

***

## The numbers are bleak

Here's the gap we're dealing with:

| Domain             | Training data available          | Status                                  |
| ------------------ | -------------------------------- | --------------------------------------- |
| Language (LLMs)    | Trillions of tokens              | ✅ Scraped from the internet             |
| Vision             | Billions of images               | ✅ Scraped from the internet             |
| Autonomous driving | Billions of miles                | ✅ Collected from camera-equipped fleets |
| **Robotics**       | **\~100K hours total worldwide** | ❌ Collected by hand, one demo at a time |

> *"Robot interaction data amounts to just one in a hundred thousand of what LLMs process."* - [Survey of Embodied AI Data Engineering, CUHK (2025)](https://airs.cuhk.edu.cn/sites/default/files/2025-06/Survey_Arxiv.pdf)

> *"Unlike text or images, which are readily available online, robotics requires real-world physical interactions. You can't just search the web for millions of examples."* - Ben Levin, GM of Physical AI at Scale AI, [via IBM Think](https://www.ibm.com/think/news/the-data-gap-holding-back-robotics)

The Open X-Embodiment dataset, which is the largest open-source robot dataset ever assembled from 34 labs worldwide, contains about 1M trajectories. GPT-4 trained on roughly 13 trillion tokens, which is larger by **seven orders of magnitude**.

***

## Why this matters right now

The humanoid robotics industry is in full sprint:

**The funding explosion:**

* 2025 saw **$6.1 billion** invested in humanoid robotics across 139 deals, a [300%+ increase from 2024](https://pitchbook.com/news/articles/apptronik-raises-520m-as-vc-funding-for-humanoid-robotics-explodes-300), per PitchBook
* Figure AI raised [$1B at a $39B valuation](https://x.com/adcock_brett) in September 2025 and Unitree is preparing for an IPO -> capital allocators are hungry to deploy

**The cost collapse:**

* Goldman Sachs estimates humanoid BOM costs dropped [\~40% in a single year](https://x.com/GrishinRobotics/status/1998347017370648594): from $50K-$250K (2024) to $30K-$150K (2025)
* With Chinese supply chains, Morgan Stanley puts the number at [$46K today vs. $131K non-Chinese](https://x.com/azeem/status/1919708745745227917), projected to hit $16K by 2034
* iCapital projects BOM to reach \~$40K by 2026, comparable to an average U.S. worker's annual salary, and [\~$10K by 2040](https://icapital.com/insights/investment-market-strategy/icapital-market-pulse-ai-gets-a-body-the-coming-rise-of-humanoids/)

**The scaling proof:**

* Generalist AI trained GEN-0 on [270,000+ hours of real-world manipulation data](https://x.com/EmbodiedAIRead/status/1987740960894173256) and proved scaling laws hold for robotics, adding 10,000 hours/week and accelerating

{% hint style="warning" %}
**The bottom line:** The hardware is getting cheap. The algorithms are getting good. The only thing standing between here and mass-deployed humanoid robots is data - specifically, the kind of rich, physical-world demonstration data that you cannot simulate, cannot scrape, and cannot shortcut.
{% endhint %}

***

## Why human data specifically?

There's a tempting shortcut to just simulate everything. Build a physics engine, generate synthetic data and transfer it to the real world.&#x20;

While simulation is genuinely useful for certain things like locomotion, navigation, basic grasping, the "sim-to-real gap") is still massive for anything involving:

* Deformable objects (folding clothes, handling food)
* Tool use (turning a wrench, using scissors)
* Multi-step reasoning (cooking a meal, organizing a shelf)
* Contact-rich tasks (inserting a key, threading a needle)

This is why every leading robotics lab is converging on the same conclusion: **you need real-world human demonstration data.**

The research backs this up across the board:

* **Cross-embodiment transfer works.** UC Berkeley's Sergey Levine showed that models trained on diverse data from multiple robot types [outperform specialists by \~50%](https://robotics-transformer-x.github.io/). The diversity of the data matters more than the robot it was collected on.
* **Egocentric human data transfers to robots.** Research from Stanford (EgoMimic), CMU (EgoZero), and Physical Intelligence all show you can train on [first-person human video and transfer those skills to robots](https://x.com/EmbodiedAIRead/status/2001544945770049786). The human hand becomes a "universal manipulator" that robots can learn from.
* **Scale beats curation.** The GEN-0 results show this definitively. Massive, diverse, somewhat messy human data produces better robots than small, perfectly curated robot data.

***

→ [Next: Why Crowdsource](https://whitepaper.caspius.ai/the-thesis/crowdsourced-network)
