How do humans "sell" themselves to nurture smarter AIs?
Lin Zhixia does the same thing every day: teaching AI to be more human.
She listens to the Cantonese voice generated by the model repeatedly, judging where it lacks naturalness, where it has a mechanical flavor, and where it doesn't sound like a real Cantonese native speaker. Sometimes, she can detect the deviation of nasal sound in a single character; she can also spot a subtle swallowing error.
In the past two years, she has passed on these experiences to the AI bit by bit. And she has witnessed the AI becoming more and more "human".
By the end of 2025, the voice model she was responsible for training was able to fluently complete most Cantonese scenarios. The problems that used to require repeated corrections occurred less and less often.
As the model improved, a subtle emotion began to emerge. Because she found it increasingly difficult to judge whether she was training an AI or something that might replace her in the future.
This contradiction doesn't only belong to Lin Zhixia. From data strategists in Internet giants to doctoral students who write Rubrics (scoring criteria) part - time; from product image reviewers to voice model evaluators, a group of new workers are doing the same thing - disassembling their knowledge, experience, and judgment into a form that machines can learn.
They are AI trainers. And they may also be the first group of people to participate in creating their own replacements with their own hands.
If we look at it from a long - term perspective, this is not just a story of occupational change, but more like the first large - scale process of transferring judgment to machines in human history.
From Boxing Cats to Teaching AI to Think
AI trainers are not a new profession that emerged in the ChatGPT era.
As early as around 2010, with the rise of deep learning, a large number of data annotators appeared in the artificial intelligence industry chain. They drew boxes around cars and traffic lights in pictures, marked pronunciations for voice data, and supplemented road condition information for map data.
At that time, the industry generally believed that "data is the oil of the new era."
The ImageNet competition in 2012 became a key node in the development of deep learning. In the following more than a decade, global technology companies began to collect data crazily. A number of specialized data annotation enterprises were also born in China, forming a huge data annotation industry in Guizhou, Henan, Shanxi and other places.
At that time, annotators were more like assembly - line workers. If the model couldn't recognize a cat, humans would tell it what a cat was; if the model couldn't recognize a car, humans would draw boxes around cars one by one.
The task of AI trainers was to provide answers to machines.
When Lin Zhixia first joined the iFlytek AI Research Institute, many of her tasks also had this "assembly - line color".
She had to search for language materials from platforms such as Bilibili and Himalaya every day, screen video materials with pure human voices, no background noise, and a single voice, and then organize them into datasets for training. "In the beginning, it wasn't that profound," she said. "It was more about data preparation."
Photo | Provided by the interviewee
But soon, she found that things were changing. When she first took over the project in 2024, the Cantonese voice model trained by the team still seemed clumsy. The machine would stutter when speaking, the speed would be erratic, the tone was not stable enough, and many sentences still sounded very mechanical. "You could tell it was a machine as soon as you heard it."
At that time, many domestic voice models were still in the catching - up stage. "It's unrealistic to catch up with what the United States has been doing for 20 years in five or six years," Lin Zhixia said.
But the progress of AI far exceeded many people's expectations. More than a year later, when she left the project, the same model was able to fluently complete most Cantonese expressions. The intonation, pauses, and rhythms were getting closer and closer to those of a real person, and it could even imitate the accent characteristics of different regions. "It's really becoming more and more human."
A similar change also occurred at JD.com. Chen Ruoning joined JD.com in 2025 and was responsible for the annotation work related to product image generation. When she first joined the company, the team's requirements for AI image generation were not high. "As long as it could cut out the product and change the background, we thought it was good enough."
But just half a year later, the situation was completely different. Google's Nano Banana model changed everything. Scenarios that used to require a lot of manual design and post - processing can now be automatically generated by the model. Given a washing machine, it can generate a scene of a user opening the washing machine door; given a piece of clothing, it can automatically match the model, lighting, and display environment.
More importantly, the model began to understand the meaning behind the pictures. In the past, large models had poor ability to process Chinese, and the text in product pictures often generated garbled characters. Many e - commerce teams even defaulted to "don't let the model write".
Now, the model can not only recognize the text on product pictures but also understand the selling points behind the product information. After recognizing an enamel cup, it will generate descriptions such as "durable" and "not easy to break"; after recognizing baby products, it will also automatically adjust the copywriting style.
The changes are happening so fast that many training rules are constantly becoming invalid.
Outsourced annotator Meng Lin felt this deeply. When he first entered the industry in 2025, he was responsible for a large number of multiple - choice question training tasks. When setting the rules at that time, there was almost always a rule: "The answer must not exceed the given option range." Because the model often created a fourth answer outside the three options.
But by the beginning of 2026, this rule was cancelled. "The quality inspection directly told us not to write it anymore," he said. "Because now the model doesn't make such low - level mistakes anymore."
The model is overcoming more and more mistakes that used to require manual correction. And this also means that trainers must find new problems. The question has changed from "what is the correct answer" to "what is a better answer".
Behind this change is a transformation that the entire large - model industry is going through.
Handing Over One's Judgment to Large Models
If the pre - training era taught AI knowledge, then the post - training era teaches AI how to use knowledge.
In this production chain that makes AI more "intelligent", the key nodes are two types of people: one type is the "test - takers" who directly face the tasks and produce data according to the rules; the other type is the "question - setters" who are responsible for designing tasks, disassembling fields, writing rules, and setting standards.
What these two types of people jointly accomplish is the same thing: structuring human judgment.
Zhou Yiheng is responsible for data strategy at ByteDance and is the "question - setter" in the chain. In his opinion, many ordinary users see that AI is getting better at chatting and writing articles, but what really changes is the internal ability structure of the model.
"The base model actually just predicts the next word," he said. "It has learned a lot of knowledge, but it doesn't know how to connect the knowledge."
In other words, the model knows many facts but doesn't know when to call these facts. This is exactly the problem that post - training needs to solve.
For example, if a user asks: "It's half past twelve now, and I haven't had lunch yet. Please recommend a nearby Japanese restaurant with an average per - capita consumption of less than 40 yuan." For a human, this is a simple requirement.
But for the model, it needs to complete a series of complex actions. It first needs to understand what the user really wants to express; then call the geographical location tool to obtain the coordinates; convert the coordinates into business district information; then call the local life tool to screen restaurants that meet the conditions; and finally organize the results into natural language and feedback to the user.
Throughout the process, the model not only needs to understand language but also learn to plan, reason, and make decisions. These abilities cannot be directly learned from Internet web pages.
In the past decade, the AI industry has always believed in scale expansion, requiring more parameters, richer data, and greater computing power. But around 2024, an increasingly obvious problem began to emerge - high - quality Internet data was approaching the ceiling.
The high - quality content in public web pages, forums, encyclopedias, and news is not infinite. When almost all large - model companies are using similar data for training, it is becoming increasingly difficult to achieve ability breakthroughs just by expanding the scale.
The industry began to look for new resources. This time, it is not web pages but the "judgment" that is difficult to extract from the human brain. How doctors diagnose diseases, how lawyers construct arguments, how researchers read papers, how native speakers recognize language sense... These pieces of knowledge that originally only existed in experience have begun to become the most important training materials in the post - training era.
What AI needs to learn is no longer knowledge itself but why humans think in this way. In the past, the work of AI trainers was to tell the model what the answer was; now, they need to tell the model why this is the answer.
This change has made the entire profession shift from "data workers" to "knowledge workers".
Meng Lin, the "test - taker", felt this deeply. This doctoral student in the humanities and social sciences began to participate in large - model training projects part - time in 2025. When he first entered the industry, most of the tasks he encountered were relatively standardized: judging whether the answer was correct, comparing which of the two answers was better, and supplementing the citation source.
But soon, the task difficulty began to increase rapidly. Now, he needs to write a response of hundreds of words around a humanities and social sciences question and attach more than twenty Rubrics at the same time.
The so - called Rubric is essentially a set of scoring criteria. Each criterion needs to clearly state: which paper is cited, why this paper is cited, how this paper supports the current view, whether the citation logic is sufficient, and what score should be given in the end.
This means that he not only needs to give the answer but also completely disassemble his thinking process.
"In the past, ten citations might be enough, but now it has to be more than twenty," he said. "And each one has to be explained why." Now, it takes him three or four hours to write one.
To ensure data quality, the platform requires all citations to come from papers, official documents, or authoritative institution websites, and even uses screen recording and multi - model cross - comparison to prevent the direct use of AI - generated content. "If the content you submit has a similar logic to an AI's answer, it will be detected," Meng Lin said.
To some extent, what the platform buys is not the answer but the process by which humans form the answer.
Meng Lin gradually realized that what he really provides to the model is not knowledge itself but the connection between knowledge. Why is this paper more important than another paper? Why can this view support the current conclusion? Why can two seemingly unrelated research results be connected?
These "whys" are exactly the parts that large models lack the most and are also the most valuable parts of humans.
The same thing also happened to Lin Zhixia. As the model's ability continued to improve, her work focus began to shift from finding language materials to listening tests. The so - called listening test is not simply judging whether the sound is correct but judging whether it is enough like a real person.
Whether the lateral and nasal sounds of a single character are accurate, whether the stress position of a sentence is natural, whether there are subtle differences between the Cantonese of one region and that of another region... These problems are difficult to write as standard answers, and even many native speakers themselves can't explain clearly.
"Many people can't tell the difference between Hong Kong Cantonese and Guangzhou Cantonese," Lin Zhixia said. "But someone in the project must be able to tell."
This ability does not come from textbooks but from long - term experience in the language environment. It is closer to an intuition.
The work of AI trainers is to disassemble this intuition. Disassemble it into rules, tags, and scoring criteria, and finally turn it into data that the model can learn.
After leaving the job, Lin Zhixia is still occasionally invited back by the original project team to participate in evaluations. "Sometimes when they have meetings and discussions, they still pull me in," she said with a smile, because only she can tell the difference.
But she also knows that this irreplaceability is constantly shrinking. Every evaluation, every correction, and every feedback essentially help the model narrow the gap with herself.
On the other hand, Chen Ruoning is also experiencing a similar change. She is responsible for product image generation. In the past, the team only needed to judge whether the pictures were in violation or had obvious errors. Now, the model can generate a complete enough product scene. The new question becomes: Is it good enough?
This may seem simple, but it is much more difficult than judging right or wrong. What kind of background is considered high - class? What kind of lighting is more in line with the brand's tone? What kind of model posture is more natural? What kind of composition is more likely to promote transactions? There are no standard answers to these questions.
Therefore, trainers have to convert the vague aesthetic feelings into specific rules. The business side says it hopes the pictures have a "high - class feeling". The training team has to disassemble: Does the high - class feeling come from white space or color? From light and shadow or material? From scene design or the state of the characters?
The judgment that originally existed in experience is gradually translated into a language that machines