HomeArticle

Two major factories have invested in a 3D generative large-scale model company | WAVES

Muqiu2025-01-29 11:56
Meituan and ByteDance made investments simultaneously.

Author | Shi Jiaxiang

Editor | Liu Jing

In October 2023, after a round of financing that took half a year and almost all of their energy was dropped, Wu Di, the founder of Yingmou Technology, was completely stunned.

Without time to think more, Yingmou Technology carried out the first large-scale personnel adjustment since its establishment. Wu Di originally wanted to quickly raise a small amount of money to ensure the company's operation, but the environment was as bad as it could be. The failure of financing strengthened their idea of expanding the 3D asset generation capability of the entire category.

At that time, some teams in the market had launched 3D generation products based on the 2D upscaling technology path, which is the main path in the academic field.

But they saw the bottleneck of the 2D upscaling path: it can only record one side of a real object, and infinitely many multi-angle images cannot completely describe the 3D content.

The only solution is to use 3D native data from the beginning. It was almost a desperate move. Even the artists on the team who were originally responsible for the film project were transferred to do model labeling. The 3D engine Rodin based on CLAY was launched in June last year. CLAY is a 3D native Diffusion Transformer generative large model jointly launched by Yingmou and ShanghaiTech University. This research earned them an honorary nomination for the Best Paper at SIGGRAPH 2024.

45 days later, Rodin achieved $1 million in ARR. Wu Di said that this is the main reason why they were later noticed by big companies.

Waves has learned that Yingmou Technology has completed a new round of tens of millions of US dollars in Series A financing. This round of financing is led by Meituan Longzhu and ByteDance, with the old shareholders Sequoia China Seed Fund and MiraclePlus following the investment.

In the past, Yingmou has always been labeled as a "student entrepreneurship", and even the core members are still studying for a master's or doctoral degree in the laboratory. But now that the team has gone through four years, Zhang Qixuan, the CTO, said that the "little geniuses" have gradually put commercialization and product usability first.

Wu Di still remembers that when he first entered ShanghaiTech University, the school was still a construction site. He didn't even know if this construction site would really become the modern campus in the renderings. But he, who had just finished the college entrance examination, didn't care. Compared with the path of studying conventionally, going abroad for further study, and returning to enter a big company, this almost completely blank score is more attractive to him.

"WAVES" is a new column of Waves. Here, we will present the stories and spirits of a new generation of entrepreneurs and investors.

The following is the review of the past entrepreneurial history by Wu Di, the founder of Yingmou Technology, and Zhang Qixuan, the CTO. It also includes their understanding of the future of the 3D track, edited by Waves:

On Entrepreneurship: A Choice

1. The birth of Yingmou started from a difficult problem in the laboratory: How to put people and things into the virtual world. To achieve this goal, we launched the first set of face scanning systems in 2020, which can collect the performance of human faces under different environmental lighting to synthesize the display effect of human faces under new lighting.

2. But this technology has repeatedly encountered difficulties in practical applications. We once entered the face-changing project of "The Wandering Earth 2", but the cooperation was not successful in the end. The reason is that the first-generation dome light field focuses on lighting and is pieced together to show the effect of a person under lighting, while the camera perspective is fixed and the model cannot move. Finally, it can only be used in a specific perspective - such as a completely static shot. The light field can only collect data of geometric information and cannot identify the material, and is powerless for dynamic information such as facial wrinkles.

3. At that time, I realized that there is a huge gap between the research in the academic field and what the industry needs. 3D modeling that is elegantly wired, with regular UV, can be rendered, can adjust expressions, and can be driven in real-time in the game is what the industry needs. While waiting for the update of the new generation of dome light fields, we wanted to make some attempts based on generative network technology.

4. Yingmou made two products at that time, one of which is called Wand. This APP is very simple. Users simply sketch on the canvas, and Wand will generate a real-person avatar. The product development only took two weeks. The first generation of real-person avatars did not cause any waves, so we changed the generation result from real people to two-dimensional images. As a result, Wand topped the App Store's Graphics and Design category chart, with more than 1.6 million registered users "drawing two-dimensional wives" on Wand.

5. But Wand is just a simple tool, and users do not remain. We couldn't come up with a good charging model and couldn't balance user and computing power expenditures. Next, either delve into technology and extend more functions, or make it into a two-dimensional community. But we don't believe in 2D technology, and the all-engineering team of 8 people can't find one who is good at community operation. Finally, we admitted that we couldn't handle this traffic and cut off the entire 2D business line.

6. Looking back, Wand has completed its historical mission, helping us earn the first sum of money, although it is only 6,000. More importantly, it helped us complete the angel round of financing. We still believe that the next generation of display devices and interaction methods will be carried out at the three-dimensional level.

On Direction Choice and the Future of 3D Generation: Swing and Determination

7. After getting the financing, the metaverse was booming, and we got the second round of financing by taking advantage of the east wind of digital humans and the metaverse. At that time, our idea was that the existing digital humans would eventually be transformed into ID-type digital humans and become the standard for everyone who wants to enter the virtual world. So at the end of 2022, DreamFace and the 3D character generator ChatAvatar based on this framework were launched, which can already make at least supporting role-level models with bone binding.

8. But our entry timing caught the tail end of the metaverse, and the commercialization progress was not smooth, and it was difficult to move forward. That year I graduated and moved the office out of the ShanghaiTech University laboratory. It happened to catch the city lockdown due to the epidemic, and I paid the rent for half a year in vain.

9. By 2023, I had been negotiating a new round of financing for six months. As a result, the lead investor chose to give up overnight, and I was completely stunned. I originally wanted to raise one or two million US dollars first to survive, but the environment was simply as bad as it could be. I asked the finance department to show me the balance in the account twice a week, keeping an eye on the cash flow, and barely maintaining the balance of income and expenditure. At that time, I realized that before there is a new milestone, Yingmou could no longer raise money.

10. We had previously put the extensive generative 3D on the agenda, but at the same time, we also faced a key technical choice. The technical routes for 3D generation can be roughly divided into two types: 2D upscaling and native 3D. The former is generated through training with a massive amount of 2D image data, but due to the data being concentrated in the 3D world, there will always be a "multi-headed" problem with the model. Using this technical path to produce a product may be able to quickly get a sum of financing, but the product will have an irreparable gap from being "Production-Ready". And we are not sure if the 3D native technical path can be achieved.

11. Finally, we unanimously believe that if we need to compete with the 3D industry, we can only use the training method of native 3D. The difficulty of this method is often considered to be the lack of high-quality data. But in fact, the bottleneck of 3D generation is not the amount of data of the model, but the appropriate three-dimensional expression and parameter scale. The key is to minimize the information loss from the dataset transformation to the final output.

12. Rodin was launched in June last year and was the latest to be released among the same batch of 3D generation startups. I think at that time, its generation quality and usability were generations ahead of similar products at that time. Rodin Gen-1.5, released on the last day of 2024, fills the gap in the 3D generation's ability to generate sharp edges. It has an absolute advantage for CAD industrial models and hard surface models.

3D Model

13. But even so, the AI-generated models are still far from being directly usable. As a content form, the difference from video, image and other fields is that 3D is industrial-level content, not consumer-level, which means there are definite industry standards. In the case where problems such as topology, geometric accuracy, material, and UV unfolding have not been solved, there is a big gap between AI-generated 3D and being directly usable in games and movies.

14. In addition, solving the creative ability of ordinary users in the 3D world does not mean that the consumer-level era of 3D will come. More preconditions are needed - such as making Vision Pro, Quest 3 as popular as the iPhone. Previously, the metaverse became popular, and it was more of a self-entertainment of B-end players. In terms of improving the efficiency of the game industry, what 3D generation can do is far less than Midjourney. Previously in the laboratory, we thought that technology equals product equals company, but in fact, technology does not equal product, nor does it equal company.

15. Rodin cannot generate industrial-level 3D works for games or movies. Perhaps in the future, 3D generation will appear as a core gameplay in games and film and television works, but the current opportunity for 3D native technology is in the existing market.

16. So this time, Yingmou aims at "game outsourcing" for commercialization: In the game modeling process, from the original painting to the end of modeling, there is a series of "waste drafts" that may need to be reworked several times. Now, after the three-view drawing of the original painting is completed, Rodin can be used to generate a modeling draft first, and the specific details can be adjusted by the modeler to reduce costs in the initial medium model or preview stage of modeling, or it can be applied to some unimportant assets on the periphery.

17. When I first came to ShanghaiTech University, the school was a construction site, and the laboratory was also newly built. We almost witnessed the entire process of ShanghaiTech University from a ruin to a tall building. To some extent, the emergence of ShanghaiTech University from nothing, as our supervisor said, is also a "great entrepreneurship". And the four years of Yingmou Technology is a footnote to this "entrepreneurship".