Das nicht zu ignorierende Doppelspiel der Silicon Valley AI

Offenbar baut man die Holzpfade, heimlich durchquert man den Chencang-Pass.

Statements made by Silicon Valley elites in the field of Artificial Intelligence (AI) on public occasions show a systematic discrepancy between their actual assessments and resource allocations. There is a strategic consensus among them to suppress Chinese AI: On the surface, they use the theory of threat to expand political measures such as export controls, distillation restrictions, and cloud rental restrictions. In reality, they take advantage of technological gaps to secure a real generational lead.

It has been almost a year and a half since the emergence of DeepSeek at the beginning of last year. The Chinese public has been successfully captivated by the superficial presentation, and no one wants to step out of their cognitive comfort zone.

01 A Large Language Model that Keeps the Finance Ministers of Japan and Canada Awake at Night

Senior officials from the financial systems of Japan, Canada, and the UK have recently issued warnings after experiencing the astonishing capabilities of a new American model. This model is called Mythos and is currently the flagship model of Anthropic. Its training parameter size is about 100 billion, and the cost of a single training run is about $10 billion.

Sergey Brin, co - founder of Google and the driving force behind the revival of Gemini, said after experiencing the Mythos model: "This is a Large Language Model at the AGI level." This is the first time that a person of the stature of a Silicon Valley founding father has admitted that a particular model has reached the AGI level.

In China, however, the model is unknown to most users. Even among AI experts who may know the name, the actual strength of the model is unclear.

Mythos is not publicly accessible. That is, the model is not publicly offered via an API, does not participate in the LMArena, and is not compared with other models on public rankings. Instead, it is made accessible in a controlled manner through a system called "Project Glasswing".

Glasswing butterflies are from South America. Their seemingly thin and transparent wings can, however, withstand a pressure 40 times their body weight. Anthropic's concept of a secure AI protection system is not a massive "Iron Wall" but a long - term system with high flexibility.

The American glass beetle is called the "Glasswing Butterfly" because of its transparent wings.

Twelve founding partners have been granted access, including AWS, Apple, Google, Microsoft, NVIDIA, CrowdStrike, the Linux Foundation, and other important infrastructure companies. In addition, about 40 other organizations that maintain "critical software infrastructure" have restricted access (the UK AI security institute has tested Mythos). However, the national financial supervision and security agencies of allies such as Canada, Japan, and the EU are generally not included in the list. This means that they cannot independently assess the capabilities of Mythos or check the potential impacts on their national critical systems.

This is why the finance ministers of Japan and Canada had to jointly discuss defense concepts at the IMF conference. Since the emergence of Mythos in April 2026, more security vulnerabilities have been fixed in Mozilla Firefox in one month than in the whole of 2025, which is 20 times the monthly average. If someone were to use it for attack purposes, it could systematically detect the weaknesses of another country's critical infrastructure within a few hours.

Interestingly, the finance officials did not react with the desire "I also want this model", but with the demand "I must defend against it". This is the first time in the history of AI. This is surely related to the estrangement between the US and its traditional allies, but the real cause is the impressive strength of the model, which scares the rather conservative finance officials. Mythos has almost the same effect as a nuclear weapon and is a strategic asset that should not be further spread.

This brings us back to the actual starting point of the question: If a country manages the "truly strongest AI model" and the "publicly evaluated AI model" separately, any "difference analysis" based on public rankings loses its meaning.

In the Chinese public, the "AI Index Report 2026" of Stanford HAI from April has been repeatedly referred to in recent months. It states that the performance gap between Chinese and American Large Language Models has shrunk to 2.7%. Many Chinese experts, investors, politicians, and even the general public have become very confident because of this.

Titles such as "Historical turning point: No more gap between Chinese and American Large Language Models" and "Sensational change: China closes the gap to 2.7% and secures victory" have emerged in the public.

02 Reveal the Truth Behind the Rankings

If Mythos is the "unpublished" model of the US, an evaluation by the CAISI Center under the US National Institute of Standards and Technology (NIST) at the end of April shows that the gap is also much larger than expected among the already published models.

The CAISI tested DeepSeek V4 Pro with an unpublished benchmark:

The CAISI belongs to an official US agency (Department of Commerce) and cannot be bribed to manipulate the score.

The table shows that the gap between the latest published models of both sides is not 2.7% but more than 30 percentage points, and in the most sensitive dimension of network security, it even reaches 39 percentage points.

The conclusion of the CAISI is: The real capabilities of DeepSeek V4 Pro correspond to those of GPT - 5 eight months ago, and the gap is getting larger. In fact, in its technology blog about V4, DeepSeek only compares with GPT5.4 and not with the almost simultaneously published version 5.5.

Why don't the conclusions of the CAISI match the Stanford report? The reason lies in the different methods.

The fundamental problem with public benchmarks (such as MMLU, HumanEval, HLE, etc.) is that the longer they are publicly accessible, the more targeted optimization will be carried out by different providers. Once the question types are analyzed, the model teams can improve the score through data enhancement and intensive learning. After a certain period, the scores on the public rankings do not reflect the real capabilities but the "test - taking abilities" of different providers. The gap determined by the CAISI with an unpublished benchmark is the real gap.

DeepSeek claims that its capabilities are close to those of Claude Opus 4.6. This is true on public benchmarks, but on the unpublished benchmark of the CAISI, it is more in line with the capabilities of GPT - 5 at the time of Opus 4.4. There are two version numbers, eight months of time, and the iteration rhythms of three of the strongest companies in between.

The test results show that GPT 5.5 is the strongest in all evaluations, followed by Claude Opus 4.6, and finally DeepSeek V4.

Of course, Mythos is not included in this evaluation. There is no public answer as to how large the real gap is when Mythos is considered, but it is almost certain that it will be even larger.

03 The Key: The Exponential Computing Power Gap between China and the US

Where is the root of the gap? The answer is computing power.

The data comes from public sources. The table was created by the author.

A rather cruel fact is that Meta's AI capital expenditures in 2026 are already almost equal to the total of all leading Chinese AI companies. DeepSeek V4 Pro has a total of 160 billion parameters, which is more than six times less than the 100 billion parameters of American models.

The "brutal" model of xAI, where "seven models are trained simultaneously", is only possible when the computing power is extremely abundant. Essentially, computing power is used to buy security, and the model that works is used. But even so, it has not won. Recently, it was integrated into SpaceX by Elon Musk and has thus lost the possibility of independent operation.

In contrast to Silicon Valley companies that use "brutal selection methods" in algorithms, top Chinese model companies such as DeepSeek have to "be sparing" in algorithms. It may be argued that this reflects the engineering skills of Chinese companies, but a more realistic explanation is that it is a necessity. After all, everyone understands that 80% of the capabilities can be achieved with a sparing approach, but the last 20% of the generational gap can only be bridged by sufficient computing power.

It is true that the production of Chinese chips is increasing rapidly. But even if the production increases, one cannot be sure whether the same results as with NVIDIA chips can be achieved. This is another challenge that is greatly underestimated.

A study published in November 2025 by the University of Auckland, the Hong Kong Polytechnic University, Lingnan University, the Harbin Institute of Technology, and other institutions examined five enterprise - grade AI accelerators - NVIDIA H200, AMD MI300X, Intel Max 1100, Huawei Ascend 910B, and Apple Mac M4 Pro - in a large - scale test for the first time. More than 100,000 variants of 4,000 real PyTorch models were tested individually.

The results are shocking:

Operator support: The NVIDIA H200 supports 488 operators, while the Huawei Ascend 910B only supports 407, which is 17.5% less. Huawei is exactly missing the parts that Large Language Models rely on the most: quantization inference, sparse operations, flash_attention, NLP embeddings, fused training operations, and advanced linear algebra.

Output matching rate: When running the same model with the same data, the output of AMD and NVIDIA matches 99.8%. For Huawei, the matching rate is 95% - with a 5% probability of different results. For Mac, the matching rate is 86%.

Platform errors: NVIDIA has one error, AMD has four, and Huawei has 13. The frequency of memory crashes is ten times higher for Huawei than for other platforms, and the number of unsupported operators is eighty times higher than for NVIDIA.

A 5% output matching rate is unacceptable in highly reliable scenarios such as the financial industry, medicine, and autonomous driving. In Large Language Model pre - training scenarios, this deviation can lead to systematic errors through thousands of iterations.

This is the academic study. Finally, there is a real - life example. Shortly after the Google I/O conference, David Holz, the founder of the AI video - generation company Midjourney in San Francisco, publicly complained that using Google's TPU chips had set his model back by a year and praised NVIDIA chips.

Midjourney was once the absolute leader in AI image generation, but now it seems rather mediocre and has been overtaken by many products. This case clearly shows that high - quality NVIDIA chips are currently essential for most...

该文观点仅代表作者本人，36氪平台仅提供信息存储空间服务。

Das Doppelspiel der Silicon Valley AI-Eliten sollte nicht ignoriert werden.

01

A Large Language Model that Keeps the Finance Ministers of Japan and Canada Awake at Night

02

Reveal the Truth Behind the Rankings

03

The Key: The Exponential Computing Power Gap between China and the US