HomeArticle

When TSMC Writes Humanoid Robots in Its Financial Report: The "Open Card" on the Chip Side and the "Stealth Battle" on the Data Side

物联网智库2026-05-07 19:52
The next moat is very likely to be built on the data side.

At the end of April, at TSMC's 2026 North America Technology Symposium, this semiconductor industry giant spent a great deal of time depicting an industry that has not yet been fully defined: Humanoid Robots.

TSMC provided an extremely precise industry definition:

Humanoid Robots = Agentic AI + Physical AI

This indirectly confirms a grand trend: AI is undergoing a historic leap, moving from "understanding the world" to "participating in the world."

Previously, TSMC systematically disassembled humanoid robots into four technical quadrants: Brain, Sensing, Movement, and Power. Each quadrant corresponds to a specific set of chip systems: AP, connection chips, sensors, MCU, PMIC... Together, they piece together a complete silicon-based roadmap.

TSMC plans to triple the production capacity related to humanoid robot chips in the next three years. This means that embodied intelligence will, for the first time, translate into real money on the financial reports of chip giants.

So far, the story on the hardware side is clear. The roadmap has been drawn, the financial reports have been released, and tripling the production in three years is an established rule in the board's resolution.

However, there is a fatal question that no one has answered directly so far: Who will "feed" these chips?

As 2.5 million robot-grade chips are being scheduled for production at TSMC every year globally, and as each AP, MCU, and PMIC is waiting to be given a "soul," the industry suddenly realizes a harsh reality: The hardware is advancing at an exponential rate, while the data is still stuck two years ago.

This is the biggest "gap" in the embodied intelligence industry in 2026.

The situation on the hardware side is basically settled. However, what truly determines the outcome of the industry from 2026 to 2030 is a hidden card that few companies dare to claim they have fully understood: Data.

The training corpus for large text models is measured in tens of billions of hours, while the current industry-wide stock of high-quality embodied intelligence data is only about 500,000 hours. Expanding from 500,000 hours to tens of billions of hours represents a 20,000-fold increase. This is not an ordinary market opportunity; it is a national-level race.

Behind this hidden card, the most interesting story is unfolding. A large-scale, unprecedented battle for data infrastructure, involving the National Data Administration, local governments, industry giants, and venture capital firms, has quietly begun.

Hardware is an "Open Card": The Data Gap Map Inferred from the Four-Quadrant Framework

In the past, discussions about humanoid robots were almost entirely dominated by obsessions with scenarios: Can it do a backflip? Can it climb stairs? Can it perform martial arts? It seems that only when it moves can it attract buyers.

However, if we shift our focus from the industry's hotspots back to TSMC's chip map, we will find a counterintuitive fact: The most-discussed quadrant among the four is precisely the one with the relatively lowest technical barrier.

Brain: Corresponds to AP and AI accelerators, requiring data on intention understanding and long-term planning. Currently, it is almost blank.

Sensing: Corresponds to CIS, MEMS, 6D torque, and tactile sensors, requiring multi-dimensional fusion data of vision, hearing, force sense, touch, and proprioception. However, 90% of the industry's efforts are only focused on RGB vision.

Movement: Corresponds to MCU and servo, requiring trajectory and force feedback data. This is the absolute center of current discussions.

Power: Corresponds to PMIC and BMS, requiring energy consumption - action coupling data. It is also almost blank.

Among the four quadrants, three have not yet established a systematic data collection system.

In particular, the sensing quadrant is worth questioning. TSMC's sensor list implies that a robot's data senses are at least six-dimensional. But how many dimensions of data are currently being collected in the industry? Two to three dimensions. Most companies are still at the primary stage of "RGB video + action labels." A few leading players have introduced the VLA (Vision - Language - Action) model, but there is still a gap of at least two orders of magnitude from true high-dimensional multimodality.

This gap is not an engineering problem but a cognitive problem.

Internet giants are entering the field of embodied intelligence with the "muscle memory" of building large models. They are good at collecting videos and processing images. However, when humans perform delicate actions, such as screwing a screw, peeling an egg, or threading a needle... most of the key information does not come from the eyes but from the pressure on the fingertips, the torque of the wrist, and the proprioception of the entire arm. Once this information is missing, even the smartest VLA model is just "performing" rather than "working."

Therefore, the "robot data" collected by Internet giants only solves 10% of the problems in the "sensing" quadrant.

The remaining 90% of force sense, touch, proprioception, and energy consumption - action coupling data have no ready-made path for large-scale collection.

This is not a question of "how much" data but "whether the data is correct."

In China, a small group of players have begun to quietly play these three hidden cards. Their story starts with five cognitive misunderstandings currently plaguing the industry.

Data Fog: Five Cognitive Misunderstandings Are Misleading the Entire Industry

Misunderstanding 1: Equating Embodied Intelligence Data with Video Data.

This is the most popular lazy answer: Since large models are built on corpora, robots can continue to rely on videos. However, Internet videos are from an "observer's perspective," while embodied intelligence requires "first-person + multimodal action data." Text large models only need "knowledge," while embodied large models need "experience," and experience must include the process data of "making mistakes and being corrected." Watching ten thousand hours of cooking videos will not teach an AI the optimal force for holding a knife when cutting vegetables.

Misunderstanding 2: Believing that Simulation + World Model Can Completely Replace Reality.

The gap between simulation and reality is not an engineering problem but a physical nature problem. A voltage fluctuation of one kilowatt-hour, a slight change in the ground friction coefficient, or a subtle texture on the leather surface can cause an algorithm that runs perfectly in a simulated environment to fail immediately in the real world. The optimal ratio of real, simulated, and video data will become the most closely guarded core secret of model companies in the next few years.

Misunderstanding 3: Equating Data Collection Costs with Equipment Costs.

A research institution once provided a statistical figure: A trainer works 8 hours a day, but only two to three hours of usable data can be obtained. A robot needs thousands of hours of data accumulation to learn a single action like "picking up a cup." This means that the current data yield rate in the industry is roughly between 25% and 37%. The real cost lies not in the collection equipment but in annotation, verification, and skill abstraction. A player who can increase the yield rate from 30% to 70% will directly have a 2 to 3 times cost advantage, which is a greatly underestimated opportunity.

Misunderstanding 4: Equating Dexterous Hand Data with Ordinary Robot Data.

Ordinary robot data focuses on scene breadth and navigation obstacle avoidance, while dexterous hand data requires high-dimensional, multimodal, and strongly sequential information, which must integrate various information such as posture, force sense, and touch. Currently, the supply of high-quality dexterous hand data is less than 10% of the actual industrialization demand. This is not a production capacity problem but a structural shortage. 60% of the commercial value of humanoid robots depends on whether the hands can work. The collection cost of dexterous hand data is 5 to 10 times that of leg data, but currently, many companies in the industry are collecting mobile data to "occupy territory," and the truly bottleneck link lacks systematic investment.

Misunderstanding 5 (the Most Hidden Misunderstanding): Treating Data as a Consumable Rather Than an Asset.

In the financial models of most companies, data collection is a one-time expense during the training phase. After the model is trained, the data is archived and forgotten. However, what truly determines long-term competitiveness is whether this data can be precipitated into a reusable, circulable, and standardized high-quality dataset to repeatedly train the next-generation model, migrate to new entities, and be shared with ecological partners. Once the industry starts to regard data as an "asset," the underlying logic changes completely. Because assets require infrastructure to carry, confirm rights, and circulate.

And this is the fundamental reason for the "national team" to enter the field.

The State Enters the Scene: The Evolution from "Market Failure" to "Infrastructure Building"

Once data is elevated to an "asset," it necessarily calls for a very solid underlying infrastructure to carry it: rights confirmation, pricing, circulation, security, and standards are all essential. This grand systematic project is destined to exceed the carrying capacity of any single enterprise, and the state must lay the "highway."

On April 28, the Ministry of Industry and Information Technology and the National Data Administration jointly issued the "Notice on the Joint Implementation of the 2026 'Model - Data Resonance' Action." However, if we trace the origin, the starting gun for this "Model - Data Resonance" was quietly fired as early as December 2024.

At that time, the National Development and Reform Commission, the National Data Administration, and other departments jointly issued a document, for the first time, putting the strategic position of "high-quality datasets" on the table. After more than a year of precipitation, the policy side is now sending a strong "practical" signal: In 2026, more than 30 national standards in the data field will be intensively introduced, and in-depth layouts will be made in frontier areas such as agents and embodied intelligence.

Following this is the joint action plan of the Ministry of Industry and Information Technology and the National Data Administration: By the end of 2026, a virtuous cycle of "data - model - scene application" will be basically formed, and the "Model - Data Resonance" space will be encouraged to be interconnected with the national data infrastructure.

The four words "Model - Data Resonance" are worth pondering. It means that the state officially lists "models" and "data" as the two cornerstones of new infrastructure.

The highest - level endorsement comes from the "15th Five - Year Plan Outline": Build high - quality datasets and cultivate and develop future industries such as embodied intelligence, brain - computer interfaces, and 6G. Here, embodied intelligence is no longer a supporting role but a national - level future industry on par with 6G.

If policies are the superstructure, physical infrastructure has been springing up in more than 20 cities.

The National - Local Joint Embodied Intelligence Robot Innovation Center has been established in Yizhuang, Beijing, and more than 300 robot and intelligent manufacturing ecological enterprises have gathered in the area; the "Implementation Plan for the Development of the Embodied Intelligence Industry" issued by Shanghai in August 2025 clearly states that by 2027, the scale of the core industry will exceed 50 billion yuan; Zhangjiang has built the first heterogeneous humanoid robot training ground in the country, aiming to precipitate 10 million high - quality embodied datasets within the year; the Tianjin Pacini Super Data Factory covers an area of 12,000 square meters and produces nearly 200 million high - dimensional training data annually.

A data collection army jointly built by state - owned assets and local governments is mining the "crude oil" of the embodied intelligence era in a standardized and large - scale manner.

The special feature of embodied data is that, unlike government data, which is a stock asset, it requires real - time incremental collection and dynamic interaction. This means that the most likely relationship between the commercial and national levels is not one of substitution but of division of labor: The state builds the foundation for general - knowledge datasets, and enterprises create differentiated datasets for specialized knowledge. The state lays the foundation, and enterprises build the superstructure.

This is the most likely evolutionary path for China's embodied data industry.

Global Coordinate System: The Divergence in Development Philosophy Behind the Path Choices of China and the United States

If we look across the Pacific Ocean, we will find that China and the United States have chosen two completely different paths for embodied data. This is not simply a dispute over technical routes but a split in the underlying development philosophy.

The United States follows the "Simulation First" route.

The core logic of this method is to "create" data with computing power. NVIDIA GR00T Blueprint can generate 780,000 synthetic trajectories in 11 hours, equivalent to 6,500 hours of human demonstration data. When the cost of real - world data collection is too high, it uses computing power to replace collection, as the computing power belongs to NVIDIA.

China follows the "Reality First" route.

From Yizhuang in Beijing to Zhangjiang in Shanghai, from the Pacini Super Data Factory in Tianjin to the Wuhan Innovation Center, physical data collection factories are springing up. The Pacini Super Data Factory has built the world's first VTLA (Vision - Touch - Language - Action) embodied intelligence model. When top - level computing power is limited, it uses real - world scenarios to obtain data and builds factories to collect data.

The underlying assumptions of the two methods are completely opposite.

The United States believes that "computing power will ultimately overcome reality." As long as there are enough GPUs and a good enough world model, bits can generate atomic - level reality.

China believes that "real - world scenarios will ultimately overcome computing power." As long as there are enough factories and a full range of modalities, data such as touch, force sense, and proprioception, which cannot be perfectly simulated, will become the ultimate moat.

This divergence is not accidental but an industrial adaptation forced by external constraints. GPU export controls prevent China from conducting simulation synthesis on the same scale. However, this forced shift has given rise to a greatly underestimated asymmetric advantage: VTLA.

From VLA to VTLA, the addition