HomeArticle

AI training data is exhausted. Why has this data annotation company skyrocketed? | Kr-Insight · Hard Tech

耿宸斐2025-04-01 15:58
Innodata's stock price has risen by 432% in a year.

Author | Geng Chenfei

Editor | Song Wanxin

"Data annotation" is an important part of the industrial chain born along with the development of AI. Especially after the emergence of large models, the scale of the data annotation industry has expanded rapidly. However, with the iteration of large models, as a labor-intensive industry, data annotation has been continuously re-evaluated by the market.

Innodata, a leading data annotation company in the US stock market, is a typical example of this process.

In the past year, Innodata's stock price has soared by 432%. The latest financial report shows that in 2024, Innodata's annual revenue increased by 96.44% year-on-year, and among its eight major customers, five are from the "Magnificent Seven" in the US stock market.

However, the stable fundamentals cannot stop the adjustment of market expectations. After the release of DeepSeek, the market began to doubt the demand for public data for training, which led to fluctuations in Innodata's stock price. Especially in March, the company's stock price dropped by more than 30%.

Currently, there are significant differences in market opinions about this company.

Bearish investors believe that Innodata has only made a profit twice in the past decade, so there is no reason for the soaring stock price. On the other hand, bullish investors believe that the situation has changed due to the development of large models, and Innodata has shifted its business model to data cleaning for large models.

01 Re-evaluation of Value

The first highlight moment of the data annotation industry came from the development of autonomous driving. Before the emergence of large models, a Deloitte report showed that in 2022, the annotation demand in the field of autonomous driving accounted for 38% of the entire downstream applications of AI.

Large models have raised the demand for data annotation to another level.

"If large models had not emerged, even Scale AI, the leading company in data annotation in the autonomous driving industry, would have only had an annual revenue of 100 million to 200 million US dollars before 2023. By 2024, Scale AI's annual ARR is expected to be between 1.2 billion and 1.4 billion US dollars, about seven times that of 2022," an investor said.

The Scaling Law theory in the large model industry holds that model performance is related to the number of model parameters, the amount of training data, and computing resources. Taking GPT-4 as an example, the number of its parameters increased from about 175 billion in GPT-3 to about 18 trillion, and the scale of the training dataset also expanded from several hundred billion tokens in GPT-3 to 13 trillion tokens.

Innodata, whose business is concentrated in the field of data engineering, has reaped a large number of benefits as a "shovel seller" in the large model industry.

The latest financial report shows that Innodata's largest customer has awarded the company an additional contract worth about $24 million, bringing the total annualized operating revenue from this customer to about $135 million.

In addition to this largest customer, the revenue from Innodata's other seven large technology company customers increased by 159% quarter-on-quarter in the fourth quarter.

Judging from its recent performance, Innodata's revenue growth has accelerated significantly. From the first to the fourth quarter of 2024, the year-on-year growth rates of the company's revenue were 40.7%, 65.6%, 135.6%, and 126.6% respectively. Moreover, Innodata expects that its revenue growth will exceed 40% in 2025.

However, after the expansion period of the large model industry, the contradictions in the data annotation industry have begun to emerge - The data that is about to be exhausted is difficult to support the training needs brought about by model iteration and the implementation of large models.

Research by Epoch AI estimates that since 2020, the data used for training large language models has increased by 100 times, and the scale of AI training datasets doubles every year. However, the annual growth of available content on the Internet is less than 10%. By 2028, AI training data is likely to be exhausted.

In fact, the development bottleneck caused by data shortage is a common phenomenon in the industry. In November last year, The Information reported that the improvement of OpenAI's next-generation flagship model, Orion, had slowed down significantly, and one of the main reasons was the shortage of high-quality training data.

There is an industry consensus that currently, the supply of general data is approaching saturation, and vertical data will be the key to the differentiation of future AI models.

02 Will DeepSeek Eliminate Data Annotation?

As the only AI data annotation target in the US stock market, the "AI content" of Innodata has been widely questioned.

As early as 2019, Innodata claimed that it had started to implement artificial intelligence and machine learning processes and classified itself as an artificial intelligence company. However, in February last year, a report released by Wolfpack Research stated that Innodata was hyping its stock price with AI, and its core business still relied on cheap overseas labor for basic data annotation, rather than self-developed AI technology.

The report quoted a former employee as saying that the service the company provided to Silicon Valley customers was essentially "keyboard labor."

"Innodata's business model is based on data annotation through labor outsourcing, making money through hard work. The only difference from its peers is that it has been in the business for the longest time and has the largest scale," an investor commented. "Technology can only make data annotation faster. To make data annotation better, it still depends on humans for now."

According to a report by Zhiyan Consulting, although some data annotation companies have developed corresponding semi-automated tools, in terms of the annotation ratio, the ratio of machine annotation to manual annotation is about 3:7.

Innodata's financial report data also confirms this reality from the side. In the second quarter of 2024 alone, Innodata spent $3.6 million on recruitment agency fees, indicating that the company still relies heavily on human resources.

Industry insiders told 36Kr that this is mainly due to the complexity and diversity of data annotation and the different data annotation requirements in different fields. In addition, automated annotation technology still has certain limitations at this stage, such as low recognition accuracy for certain types of data and limited processing capabilities for complex scenarios.

However, DeepSeek has rewritten the logic of data demand to a certain extent.

From a technical perspective, simply put, the reinforcement learning (RL) technology adopted by DeepSeek allows large models to no longer need to be continuously fed with new data outside the model and can perform self-training using only the data already existing in the model.

On the one hand, this reduces the demand for data volume from large model manufacturers. On the other hand, An Guangyong, an expert from the Credit Management Committee of the All-China Federation of Mergers and Acquisitions, believes that companies may tend to use low-cost synthetic data for the sake of cost reduction. This will also impact data annotation companies such as Innodata to a certain extent.

Regarding the doubts about the impact of DeepSeek, at the earnings conference call, Innodata's management said that they believe that pre-training data and fine-tuning data are irreplaceable for the development of AGI.

In their view, DeepSeek's reliance on existing model data to train new models will greatly compress the data, ultimately leading to model collapse.

From the perspective of market doubts, the uncertainty of Innodata's sustainable growth comes from two aspects. One is whether the demand for data annotation will continue to grow, and the other is whether the annotation work will remain at a low level of automation.

Regarding the former, Zhou Di, a national science and technology expert from the Ministry of Science and Technology, told 36Kr that the applicable boundary of synthetic data is that it is more suitable for generating new data for model training, while manual annotation is more suitable for in-depth understanding and interpretation of existing data.

Although synthetic data can provide more consistent and controllable data, in fields such as sentiment analysis and text generation that require in-depth semantic understanding, manually annotated data is still irreplaceable.

Another investor analyzed that as the cost of model deployment and operation brought by DeepSeek is significantly reduced, more and more application-layer companies will deploy their own large models, which will also bring additional demand for data annotation. Therefore, the emergence of DeepSeek will at least not be a negative factor for Innodata.

However, regarding the latter, this issue has become a paradox of "the chicken or the egg." When market investors question the low "AI content" of Innodata, a very likely future is that the AIization of data annotation work will first eliminate data annotation companies themselves.

Follow for more information