HomeArticle

Behind the trillion-yuan data industry: the office workers trapped by AI

豹变2026-06-09 17:53
Data problem-solvers

「Core Tip」

The development of AI has brought about new positions such as data annotation and data collection. However, career bottlenecks and salary limitations have restricted the inflow of talent into these positions, which in turn affects the ceiling of AI capabilities.

The development of AI is giving rise to such a scenario: in terms of the division of labor, humans are responsible for the upper - level "judgment and decision - making" and the lower - level "labeling and organization". The middle - level "analysis and summarization", those mental labor traditionally done by analysts, consultants, and secretaries, is being filled by various AI tools.

The good news is that some new positions have emerged in the upper and lower levels. For example, data annotation, data construction, and data collection. These positions are pouring into the job market at an unprecedented speed. According to the Maimai report, the number of AI positions in the spring recruitment of 2026 increased by 8.7 times year - on - year.

Data collection is closely related to embodied intelligence: collectors need to wear motion capture equipment to record multi - modal data such as tactile, visual, and mechanical data, helping robots learn actions such as grasping, walking, and obstacle avoidance.

Data construction is a process of "removing impurities" from data: public data or enterprise databases often have chaotic formats and errors, which require manual screening and organization.

Data annotation is the "referee" for the content output by AI, telling the large - scale model what kind of output is "good" and helping AI form positive learning feedback to improve the quality of the content output by the large - scale model.

Are these new jobs a long - term trend or just a flash in the pan? Can they be the "broad road for liberal arts students" or just the "new generation's dead - end jobs"? To this end, "Baobian" interviewed some people engaged in related work to try to restore the real situation behind the new positions spawned by AI.

The True Face of "Data Problem - Solvers"

Jing Li works as an outsourced data annotator in an Internet giant in Beijing. Her job is to improve the output quality of AI cultural and creative tools. Her major in college was drama and film literature.

Jing Li told "Baobian", "The categories I've annotated include speeches, novels, and papers. Now, most of my work is annotating scripts for comic dramas or AI short dramas."

There are also a large number of non - full - time recruitments in the data annotation industry. Wen Qi, a college student in Chengdu, found a remote part - time job in data annotation, which is to annotate English speech - to - text content.

Their work process is generally as follows: several output results of AI will be displayed on the computer. The data annotator is responsible for selecting the optimal result, which will be judged again by colleagues in charge of quality inspection, spot - checked by the person in charge, and finally checked by the client. Based on this optimal result, AI can gradually 'understand' human evaluation criteria, thereby improving the output quality.

Some of Jing Li's outsourced colleagues have a background in mathematics or computer science. They will undertake part of the data construction work, that is, crawling public data, cleaning and organizing the data in a specific way, and finally using it for the annotation and training of large - scale models. In terms of division of labor, data construction is upstream of annotation.

The industry jokingly refers to construction and annotation work as 'problem - solving'. Without these 'data problem - solvers', there would be no various AI tools.

According to the calculation of the National Data Development Research Institute, the output value of professional data products (including high - quality data sets for artificial intelligence training) will exceed 2.3 trillion yuan in 2025.

In March 2025, data from the National Data Bureau showed that the seven major data annotation bases in cities such as Chengdu, Shenyang, and Hefei drove 58,000 employees, and the relevant output value exceeded 8.3 billion yuan.

The market is large, and the salaries of these positions vary. Jing Li and her colleagues can get a fixed monthly salary of about 12,000 to 18,000 yuan, and a few people can get additional bonuses; Wen Qi can also get a fixed monthly salary of nearly 10,000 yuan from her part - time job.

However, the salaries for data annotation are not so attractive outside first - tier cities. Jing Li said that in some northern provincial - capital cities, the salary for the same position is about half of that in Beijing.

In some small cities, the salaries are even lower, and the personnel mobility is very high. "New employees are swiping the BOSS Zhipin app to look for jobs while waiting for the elevator after work." A newly - recruited data annotator in a small city revealed this to "Baobian". His first - month salary was 1,500 yuan.

The differences come not only from the city but also from the company's position in the industry. Before the emergence of data annotation, the company where Jing Li works was a well - known outsourcing company in the industry, with clients including several domestic Internet giants.

This also determines their recruitment requirements. The position where Jing Li works requires experience in screenwriting and literary creation. In the past few years, fresh graduates were required to have a bachelor's degree, and now they are required to be from 985/211 universities and major in literature. Wen Qi's part - time job is related to English, requiring an English major with a TEM - 8 score of at least "good".

AI Needs "Referees", "Translators", and "Nannies"

Why does AI need these jobs?

Because AI lacks the judgment ability accumulated through practice. At present, mainstream AI has learned all the public information on the Internet. However, in various segmented industries, there is still a large amount of "underwater information": implicit knowledge and experience judgment within the industry, and even second - hand information in the market needs to be screened. Data annotation is such an "information referee" that helps AI understand human evaluation criteria.

Take the legal field as an example. AI can memorize all the laws and regulations, but when facing the analysis of the evidence chain of a specific case, it needs to understand the judge's judgment tendency in a specific region and the probability of certain evidence being accepted in practice, which will not appear on the judgment document website.

In the script track where Jing Li works, the output quality of AI before annotation is difficult to satisfy humans. "From the perspective of drama creation, there are many obvious problems in the content generated by AI. The standards for dealing with these problems are relatively simple and objective. Sometimes, none of the alternatives given by AI are very good, and it's even difficult to find the optimal one."

If data annotation is the information referee, then data collection for embodied intelligence is the translation between AI and the physical world. There is a vast amount of physical information in the real world. The nervous systems of humans and animals can adapt autonomously, but robots have to rely on humans to "tell" them the real situation.

Previously, some industry insiders said that the training corpus of the large - language model GPT - 5 is equivalent to about 10 billion hours, while the high - quality embodied data gathered by the entire industry is only about 500,000 hours, with a gap of tens of thousands of times.

The large gap in data collection has also given rise to capital enthusiasm. Currently, the leading startup players in the industry, Guanglun Intelligence and Pasini Perception, both have valuations reaching the tens of billions level.

In 2025, Pasini Perception put into production the world's largest embodied intelligence data collection factory - Super EID Factory in Tianjin, deploying more than 150 standardized collection units and producing 200 million high - quality training data items annually; in 2026, it built 4 super factories in Suqian, Jiangsu, Wuhan, Hubei, Zigong, Sichuan, and Ganzhou, Jiangxi.

What's complex is not only the physical world but also the enterprise's database. A person engaged in the manufacturing industry told "Baobian" that there is a development gap between personal and enterprise - level AI Agents because AI is essentially a probabilistic model and is difficult to complete some "precise and complex" work in enterprises, such as data management.

An AI product manager said, "For our current data management intelligent agents, data cleaning before formal operation still needs to be done manually. If AI wants to be applied in traditional manufacturing, the requirements for data quality are very high."

The reason is that most manufacturing industries do not use databases in a unified format. Different departments use different data standards, and the same set of data has different field names in different tables. There is also a large amount of redundant information and errors in the data. Since AI may have hallucinations to some extent and cannot accurately digest this "dirty data", it must be cleaned, aligned, and completed.

This results in the need for someone to be the "nanny" of AI tools for them to work in enterprises. Most of the current enterprise - level AI Agents are applied in the manufacturing industry as integrated service solutions, including online data, data cleaning, and finally the specific application of AI Agents.

The "Troubles" of Humans and AI

Not only traditional manufacturing industries but also managers of AI giants hope to improve the efficiency of daily enterprise operations through AI. However, the reality is that enterprise management often hopes that AI can reduce costs and increase efficiency but underestimates the role of grass - roots employees in decision - making.

Some employees of large companies told "Baobian" that the company's strong promotion of AI actually increases work pressure because employees have to "clean up the mess" for AI's work output. Employees are required to complete more tasks with the assistance of AI, but the results output by AI need to be repeatedly checked and corrected manually.

This is also consistent with some public research results.

The employee behavior analysis platform ActivTrak tracked the digital work behavior data of more than a thousand enterprises and 443 million hours from 2023 to 2025 and concluded that as AI is implemented in the workplace, the workload of practitioners has not decreased. Instead, there has been an increase in weekend overtime and fragmentation of work. Among them, the duration of employees' collaborative communication increased by 34%, and the multi - tasking time increased by 12%.

Of course, this pressure generally does not fall on the heads of data outsourcing workers. "I go to work at 10 a.m. and get off work at 7 p.m., working 8 to 9 hours a day, and I can take breaks from time to time during the day." Jing Li told "Baobian" like this.

Although she thinks the cost - performance of the job is okay, Jing Li is still considering other directions. "My goal is to become a short - drama screenwriter. This current job is very mechanical, and doing it for a long time is not helpful for career development." Most of her colleagues, however, think that it's not easy to find a job with less work and close to home now, so they just keep doing it for the time being.

The different ideas may be related to the group. Jing Li has just started working, while most of her colleagues are over 30 years old. In Internet giants, this is a group with a relatively large average age.

Wen Qi also clearly stated that she only does the part - time data annotation job to earn some extra money and will not look for data annotation jobs in campus recruitment. Most of the people in Wen Qi's part - time job group are students or other people who need to earn quick money.

This may mean that people engaged in data annotation have to face long - term career bottlenecks.

This situation, where there is no participation of industry veterans and limited room for improvement, also limits the capabilities of AI. Some leading data annotation companies have also tried to find professionals, but overall, they have not been successful. A senior lawyer revealed to "Baobian" that a data annotation company approached him, but he refused because the offered price was too low. "Even if you offer me 8,000 yuan per hour, I still have to think about whether to take the risk of losing my job, let alone only 200 yuan per hour?"

The more complex the field that requires judgment, the higher the cost of data annotation, but many annotation companies are not willing to pay a high enough premium. As a result, there is a long - term data gap in these fields, and the performance of models in vertical scenarios is difficult to break through.

Embodied intelligence also faces a similar data price bottleneck, and the consequence is that the gap between enterprises is widened. Real - machine remote control operation is the data collection solution with the highest recognized quality in the industry. The cost of effective data per hour can be as high as several thousand yuan. Leading robot companies have the richest real - machine data accumulation due to their capital advantages.

However, many companies are limited by their capital scale and can only use the public data or simulation data of leading robot companies to train their models. However, there is a deviation between simulation data and the real physical environment, and there is often a "Sim2Real Gap" (the gap from simulation to reality) when migrating to real machines.

In the long run, the data cost will eventually be diluted with scale. However, AI always has to face the problem of "who is responsible if something goes wrong".

Behind the responsibility is the legal and social recognition of the "personified subject". However, AI is not a legal subject and cannot bear civil liability. If an enterprise uses AI to replace professionals to complete these tasks, once something goes wrong, the responsibility chain will become unclear.

This is another reason why many jobs cannot be replaced by AI. These jobs are not only the cornerstone of AI development but also proof of AI's limitations. As long as AI is still learning human knowledge, as long as the physical world needs to be "translated" into digital language, and as long as society needs a clear responsible subject, these jobs will continue to exist.

(All names in this article are aliases at the request of the interviewees.)

This article is from the WeChat official account "Baobian" (ID: baobiannews), written by Zhang Jingwei and published by 36Kr with authorization.