HomeArticle

This is the Jinqiu Fund of Radical Investment AI, and the twenty-five key perceptions brought back from Silicon Valley.

于丽丽2025-01-26 14:16
Some AI trend judgments from Silicon Valley.

In early 2025, Jinqiu Fund, one of the most active AI investment institutions in China, organized a Scale With AI event in Silicon Valley.

During the 5 days and 4 nights, key AI companies such as OpenAI, xAI, Anthropic, Google, Meta, Perplexity, Luma, Runway, and many Silicon Valley investment institutions intensively shared the latest progress of AI in Silicon Valley, as well as many of their predictions and judgments on the 2025 trends.

As the organizer behind this event, Jinqiu Fund not only invested in some active AI funds in North America to establish a connection with the global AI market, but also particularly launched the Soil Seed Program to support early-stage entrepreneurs in the AI field in an aggressive, fast, and flexible decision-making manner. In the past 2024, Jinqiu frequently invested in many projects such as the AI talent marketing platform Aha Lab and the AI content platform Dream Dimension.

The following are twenty-five important understandings about the progress of AI sorted out from this Silicon Valley trip, which are divided into four parts: large models, video models, AI applications, and embodied intelligence.

About Large Models: Whether the Scaling Law Has Reached a Bottleneck and the Source of Silicon Valley Innovation

1. For LLM, the era of pre-training has basically ended, but there are still many opportunities in post-training. Moreover, the reason for less investment in pre-training is more due to limited resources, and the marginal benefit of doing post-training will be higher. Therefore, there are still many opportunities in post-training.

2. Pre-training comes first, and then RL in post-training. The model needs to have basic capabilities so that RL can be targeted. RL does not change the intelligence of the model, but more of a thinking mode. In contrast, pre-training is imitation, which can only achieve imitation, while RL is creation and can do different things.

3. Some predictions that may become a consensus next year. For example, the model architecture may change; the gap between closed-source and open-source will be significantly reduced; regarding synthetic data, it is feasible for large models to generate data and then train small models, but the reverse is more difficult. The difference between synthetic data and real data is mainly a quality issue. It is also possible to piece together and synthesize various data, and the effect is also very good. It can be used in the pretraining stage because the requirements for data quality are not high. In addition, each company has a different way of piecing together, and it is feasible to use a larger model to generate and train a small model. If a small model is used to train a large model, it may be better in the near future. Essentially, it is all a matter of data sources.

4. For post-training team building, theoretically, 5 people are enough (not necessarily full-time). For example, one person builds the pipeline (infrastructure), one person manages the data (data effect), one person is responsible for the model itself SFT, one person is responsible for judging the arrangement of the model by the product, collecting user data, etc.

5. Regarding the secret of Silicon Valley innovation, one of the important factors is that their companies can easily form a flat organizational culture. Like OpenAI, there is no so-called specific decision maker. Everyone is very autonomous and free, and the cooperation between teams is also very free. And the established Google is also quietly eliminating the middle layer, allowing many former managers to move to the front line.

About Video Models: The Bottleneck of the Scaling Law Is Still Far Away

6. Video generation is still at the time point of GPT1 and 2. The current video level is close to the version of SD1.4, and in the future, there will be an open-source version of the video with a commercial performance similar to that. The current difficulty is the dataset. Due to issues such as copyright, there is no such a large public dataset for videos. How each company acquires, processes, and cleans the data will result in many differences, leading to different model capabilities and different difficulties in the open-source version.

7. One of the more difficult points in the DiT solution is how to improve the adherence to physical laws, rather than just statistical probabilities. The efficiency of video generation is the sticking point. Currently, it takes a long time to run on high-end graphics cards, which is an obstacle to commercialization and a direction that the academic community is exploring. Similar to LLM, although the model iteration speed is slowing down, the application is not slowing down. From a product perspective, only doing text-to-video is not a good direction. Related products that are more partial to editing and creativity will emerge in an endless stream, so there will be no bottleneck in the short term.

8. It will take 1 to 2 years to reach the saturation of the DiT technical route. There are many areas that can be optimized in the DiT route. A more efficient model architecture is very important. Taking LLM as an example, at first, everyone was making it larger, but later it was found that after adding MOE and optimizing the data distribution, it is not necessary to use such a large model. More research needs to be invested, and simply scaling up DiT is very inefficient. If YouTube and TikTok are included in the video data, the quantity is very large, and it is impossible to use all of them for model training.

9. The scaling law of video exists to a certain extent, but it is far from the level of llm. The largest model parameter at present is only 30b. It has been proved to be effective within 30b; but for the magnitude of 300b, there are no successful cases. In the current practices, the differences are mainly in the data, including the data ratio, and there are no major differences in other aspects.

10. When Sora first came out, everyone thought it would converge to DiT, but in fact, there are still many technical paths being pursued, such as the path based on GAN, and the real-time generation of AutoRegressive, such as the recently popular project Oasis, and the combination of CG and CV to achieve better consistency and control. Each company has a different choice, and in the future, choosing different technical stacks for different scenarios will be a trend.

11. The technical solution for speeding up the generation of long videos can show where the limit of the DiT capability is. The larger the model and the better the data, the higher the clarity, the longer the duration, and the higher the success rate of the generation. There is currently no answer to how large the DiT model can scale. If a bottleneck occurs at a certain size, a new model architecture may emerge. From an algorithmic perspective, DiT creates a new reasoning algorithm to support rapidity. The more difficult part is how to add these during training.

12. There are actually still many training data for the video modality, and how to efficiently select high-quality data is more important. The quantity depends on the understanding of copyright. But computing power is also a bottleneck. Even if there is so much data, there may not be the computing power to process it, especially for high-definition data. Sometimes it is necessary to infer the required high-quality dataset based on the computing power at hand. High-quality data has always been in short supply, but even if there is data, a big problem is that everyone does not know what kind of image description is correct and what keywords an image description should have.

13. The authenticity of video generation mainly relies on the base model capability, and the aesthetic improvement mainly relies on the post-training stage. For example, Conch uses a large amount of film and television data. The visual modality may not be the best modality for reaching AGI, because text is a shortcut to intelligence, and the efficiency gap between video and text is hundreds of times.

14. Multimodal models are still in a very early stage. It is already very difficult to predict the next 5 seconds based on the previous 1 second of video, and it may be even more difficult to add text later. In theory, it is best to train video and text together, but it is very difficult to do overall. Multimodality cannot currently improve intelligence, but it may be possible in the future.

About AI Applications: The Trends in Silicon Valley Are Different from Those in China

15. Silicon Valley VCs tend to believe that 2025 is a great opportunity for application investment. One of their criteria for investing in AI products is that it is best to focus on only one direction, making it difficult for competitors to replicate. It also needs to have some network effects: either an insight that is difficult to replicate; or a technical edge that is difficult to replicate; or a monopoly capital that others cannot obtain. Otherwise, it is difficult to be called a startup, but more like a business. Moreover, in the United States, there are basically no killer apps for everyone. People are used to using apps with different functions in different scenarios, and the key is to make the user experience as barrier-free as possible.

16. Silicon Valley VCs believe that AI product companies are a new species, which is very different from the previous SaaS. Once they find the product-market fit (pmf), their revenue growth is very fast. The real value creation before the hype is in the seed stage; large models focus on pre-training, while application companies focus more on reasoning. Each industry has a fixed way of looking at problems and methods. The newly emerging AI Agent adds Cognitive Architecture on the basis of LLM.

17. A minority view in VCs is that they can consider investing in Chinese entrepreneurs under certain conditions. The reason is that the new generation of Chinese founders is very energetic and has the ability to create a very good business model. But the premise is to be based in the United States. China and Chinese entrepreneurs are making many new attempts, but international investors do not understand, so it is also a value depression.

18. Silicon Valley VCs are all trying to establish their own investment strategies. Soma Capital's strategy is to connect with the best people, let the best people introduce their friends, and create a Life Long Friendship. Inspire, support, and connect these people in the process; establish a panoramic map, including market segmentation and project mapping, and want to make data-driven investments. It will invest from the Seed round to the Series C round to observe successful/failed samples; Leonis Capital is a research-driven venture capital fund, mainly for the First Check. OldFriendship Capital is Work first, invest later. It will work with the founder first, conduct customer interviews, determine some interview guidelines, and work together to figure out the problems of the product, similar to consulting work. When investing in Chinese projects, it is possible to judge whether the Chinese founder has the opportunity to work with US customers during the work.

19. Storm Venture likes Unlocking Growth and prefers companies with PMF in the Series A round. They usually earn an income of 1 - 2M, and then judge whether there is Unlocking growth to support them to increase to 20M. Inference venture believes that barriers should be built on interpersonal relationships and domain knowledge.

20. Leonis Capital, founded by OpenAI researchers, has several AI predictions for 2025. For example, an AI programming application will become popular; for example, model providers will begin to control costs, and entrepreneurs need to choose a model/agent to create a unique supply; data centers will cause an electricity impact, and there may be a new architecture to re-establish; a new framework, and the model becomes smaller; Multi-agent will become more mainstream.

21. A possible idea for model training in AI Coding companies is to initially use the better API of the model company to achieve better results, even if the cost is higher. After accumulating customer usage data, continuously train their own small models in small scenarios, thereby continuously replacing some API scenarios to achieve better results at a lower cost.

22. An important trend in AI Coding is the use of reasoning enhancement technology, similar to the o3 or o1 method. This method can significantly improve the overall efficiency of the code agent. Although it currently involves high costs (10 to 100 times more), it can reduce the error rate by half or even a quarter. With the development of language models, these costs are expected to drop rapidly, which may make this method a common technical route.

About Embodied Intelligence: Robots with Fully Human Generalization Ability May Not Be Achieved in Our Generation

23: Some people in Silicon Valley believe that embodied robots have not yet reached a moment similar to ChatGPT. One core reason is that robots need to complete tasks in the physical world, not just generate text through virtual language. The breakthrough in robot intelligence requires solving the core problem of embodied intelligence, that is, how to complete tasks in a dynamic and complex physical environment. The critical moment of robots needs to meet several conditions: universality, being able to adapt to different tasks and environments; reliability, having a high success rate in the real world; scalability, being able to continuously iterate and optimize through data and tasks.

24: It is difficult to achieve the data closed loop of robots because they lack a landmark dataset similar to ImageNet, making it difficult for research to form a unified evaluation standard. In addition, the cost of data collection is high, especially for interaction data involving the real world. For example, collecting multi-modal data such as touch, vision, and dynamics requires complex hardware and environmental support. Simulators are considered an important tool to solve the data closed loop problem, but the "sim-to-real gap" between simulation and the real world is still significant.

25: Embodied intelligence faces the conflict between general models and task-specific models. General models need to have a strong generalization ability to be able to adapt to diverse tasks and environments; but this usually requires a large amount of data and computing resources. Task-specific models are easier to commercialize, but their capabilities are limited and it is difficult to expand to other fields. The future robot intelligence needs to find a balance between universality and specificity. For example, through modular design, let the general model be the basis, and then through the fine-tuning of specific tasks to achieve rapid adaptation.