> For the complete documentation index, see [llms.txt](https://whitepaper.caspius.ai/llms.txt). Markdown versions of documentation pages are available by appending `.md` to page URLs; this page is available as [Markdown](https://whitepaper.caspius.ai/the-thesis/the-data-wall.md).

# The Data Wall

## Every major AI breakthrough was a data breakthrough first

GPT didn't happen because someone invented transformers. It happened because the internet had already generated trillions of words of text sitting there waiting to be scraped. Self-driving didn't happen because someone figured out lidar. It happened because Tesla put cameras on millions of cars and Waymo drove billions of real-world miles.

Robotics is no different. Foundation models like [π₀ from Physical Intelligence](https://x.com/physical_int/status/2003161637734518985) can wash pans, clean windows, and make sandwiches. [Generalist AI's GEN-0](https://x.com/GeneralistAI/status/1985742083806937218) just proved that scaling laws hold for robotics: bigger models + more data = strictly better robots. Google's RT-X models trained on data from [22 different robots across 21 institutions](https://robotics-transformer-x.github.io/) show 50% better performance than any single-robot approach.

***

## The numbers are bleak

Here's the gap we're dealing with:

| Domain             | Training data available          | Status                                  |
| ------------------ | -------------------------------- | --------------------------------------- |
| Language (LLMs)    | Trillions of tokens              | ✅ Scraped from the internet             |
| Vision             | Billions of images               | ✅ Scraped from the internet             |
| Autonomous driving | Billions of miles                | ✅ Collected from camera-equipped fleets |
| **Robotics**       | **\~100K hours total worldwide** | ❌ Collected by hand, one demo at a time |

> *"Robot interaction data amounts to just one in a hundred thousand of what LLMs process."* - [Survey of Embodied AI Data Engineering, CUHK (2025)](https://airs.cuhk.edu.cn/sites/default/files/2025-06/Survey_Arxiv.pdf)

> *"Unlike text or images, which are readily available online, robotics requires real-world physical interactions. You can't just search the web for millions of examples."* - Ben Levin, GM of Physical AI at Scale AI, [via IBM Think](https://www.ibm.com/think/news/the-data-gap-holding-back-robotics)

The Open X-Embodiment dataset, which is the largest open-source robot dataset ever assembled from 34 labs worldwide, contains about 1M trajectories. GPT-4 trained on roughly 13 trillion tokens, which is larger by **seven orders of magnitude**.

***

## Why this matters right now

The humanoid robotics industry is in full sprint:

**The funding explosion:**

* 2025 saw **$6.1 billion** invested in humanoid robotics across 139 deals, a [300%+ increase from 2024](https://pitchbook.com/news/articles/apptronik-raises-520m-as-vc-funding-for-humanoid-robotics-explodes-300), per PitchBook
* Figure AI raised [$1B at a $39B valuation](https://x.com/adcock_brett) in September 2025 and Unitree is preparing for an IPO -> capital allocators are hungry to deploy

**The cost collapse:**

* Goldman Sachs estimates humanoid BOM costs dropped [\~40% in a single year](https://x.com/GrishinRobotics/status/1998347017370648594): from $50K-$250K (2024) to $30K-$150K (2025)
* With Chinese supply chains, Morgan Stanley puts the number at [$46K today vs. $131K non-Chinese](https://x.com/azeem/status/1919708745745227917), projected to hit $16K by 2034
* iCapital projects BOM to reach \~$40K by 2026, comparable to an average U.S. worker's annual salary, and [\~$10K by 2040](https://icapital.com/insights/investment-market-strategy/icapital-market-pulse-ai-gets-a-body-the-coming-rise-of-humanoids/)

**The scaling proof:**

* Generalist AI trained GEN-0 on [270,000+ hours of real-world manipulation data](https://x.com/EmbodiedAIRead/status/1987740960894173256) and proved scaling laws hold for robotics, adding 10,000 hours/week and accelerating

{% hint style="warning" %}
**The bottom line:** The hardware is getting cheap. The algorithms are getting good. The only thing standing between here and mass-deployed humanoid robots is data - specifically, the kind of rich, physical-world demonstration data that you cannot simulate, cannot scrape, and cannot shortcut.
{% endhint %}

***

## Why human data specifically?

There's a tempting shortcut to just simulate everything. Build a physics engine, generate synthetic data and transfer it to the real world.&#x20;

While simulation is genuinely useful for certain things like locomotion, navigation, basic grasping, the "sim-to-real gap") is still massive for anything involving:

* Deformable objects (folding clothes, handling food)
* Tool use (turning a wrench, using scissors)
* Multi-step reasoning (cooking a meal, organizing a shelf)
* Contact-rich tasks (inserting a key, threading a needle)

This is why every leading robotics lab is converging on the same conclusion: **you need real-world human demonstration data.**

The research backs this up across the board:

* **Cross-embodiment transfer works.** UC Berkeley's Sergey Levine showed that models trained on diverse data from multiple robot types [outperform specialists by \~50%](https://robotics-transformer-x.github.io/). The diversity of the data matters more than the robot it was collected on.
* **Egocentric human data transfers to robots.** Research from Stanford (EgoMimic), CMU (EgoZero), and Physical Intelligence all show you can train on [first-person human video and transfer those skills to robots](https://x.com/EmbodiedAIRead/status/2001544945770049786). The human hand becomes a "universal manipulator" that robots can learn from.
* **Scale beats curation.** The GEN-0 results show this definitively. Massive, diverse, somewhat messy human data produces better robots than small, perfectly curated robot data.

***

→ [Next: Why Crowdsource](/the-thesis/crowdsourced-network.md)


---

# Agent Instructions
This documentation is published with GitBook. GitBook is the documentation platform designed so that both humans and AI agents can read, navigate, and reason over technical content effectively. Learn more at gitbook.com.

## Querying This Documentation
If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://whitepaper.caspius.ai/the-thesis/the-data-wall.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.