HomeArticle

Another large model has been released, claiming to be on par with Fable 5 and Mythos

智东西2026-06-23 09:05
Sakana AI proposes a new approach to large model training.

According to a report by Zhidongxi on June 22nd, today, Japanese AI unicorn Sakana AI released the Sakana Fugu series of orchestrator models, including two models: Fugu Ultra and Fugu. Among them, the Fugu Ultra model's performance in engineering, science, and reasoning benchmarks is close to or surpasses that of top models such as Fable 5 and Mythos Preview.

Different from traditional large language models, Sakana Fugu doesn't answer questions on its own. It calls various models around the world to complete tasks. Simply put, Sakana Fugu acts like a “commander - in - chief”, selecting the best model to handle tasks according to the situation.

Fugu means pufferfish in Japanese. As can be seen from the official animation, Sakana Fugu aims to combine multiple “small fish” into a “big pufferfish”, a delicious ingredient.

Sakana AI is a Japanese AI unicorn founded in 2023 and co - founded by Llion Jones, the fifth author of the Transformer paper. It once used an “evolutionary” approach to combine small models to achieve capabilities comparable to large models. Now, in the technical report of Sakana Fugu, they propose a new idea for training models: Let one model learn to schedule multiple models and organize large models with different specialties to form a “collective intelligence”.

Sakana AI stated in its blog that orchestrator models will surpass traditional large models and become the new frontier. They believe that in the past few years, AI progress relied on brute - force computing power and data, but real - world complex tasks require professional knowledge far beyond the capabilities of a single model. To fully leverage the best performance of models, collective wisdom is needed, including knowing when to use which model, when to delegate tasks, and how to combine models good at different fields.

Meanwhile, this kind of orchestration is not only a technological advancement but also a product of geopolitics. Sakana AI learned from the recent export controls imposed on Anthropic models. It believes that relying on a single supplier may lead to the sudden loss of access rights. However, the underlying model pool of Fugu is completely replaceable. If one supplier cuts off the supply, another can be switched to. Sakana AI calls this “the real - world blueprint for AI sovereignty”.

Sakana AI stated in its blog that Fugu itself is a language model specifically designed to understand when to delegate tasks, how agents communicate with each other, and how to integrate their work into a reliable answer. This technical route is based on the team's previous research on learning model orchestration, including the papers Trinity and Conductor published at ICLR 2026.

Technical report address:

https://github.com/SakanaAI/fugu/blob/main/Fugu_technical_report.pdf

Experience address:

https://sakana.ai/fugu

01.

Surpassing Mythos Preview and Fable 5

Scheduling the Strongest Models to Complete Tasks

The technical report lists the performance of the Fugu series in eight benchmarks covering four dimensions: programming, reasoning, science, and agent capabilities. The report shows that the Fugu series reaches or approaches the level of cutting - edge models in these evaluations.

The technical report shows that through intelligent scheduling alone, the Fugu model surpasses the capabilities of Mythos Preview and Fable 5 in three benchmarks.

In terms of cross - domain adaptability, in the Terminal Bench test, the peak model calls of Fugu and Fugu Ultra are concentrated on GPT - 5.5, which performs top - notch in this test. In the GPQADiamond test, with Gemini - 3.1 - Pro as the leading model, both Fugu models center their scheduling around Gemini.

The way Fugu gets high scores is completely different from traditional models. It doesn't train a stronger base model to solve problems. Instead, it determines which model to assign the problem to, how to break down the task, and how to verify and check. Finally, the quality of the comprehensive answer exceeds that of multiple single models answering independently.

This is exactly the core positioning repeatedly emphasized in the technical report: The technical value of Fugu doesn't lie in replacing models like GPT, Claude, and Gemini, but in combining their capabilities. Among current large models, some are good at mathematical reasoning, some at code engineering, and some at security analysis. As different models develop their own specialties, orchestration ability itself is becoming an independent competitive advantage.

02.

Four Mechanisms Enable Fugu to Command the Model Legion

The report interprets four basic mechanisms of Fugu:

First, identify the problem type. Determine whether the user's problem is related to code, mathematics, reasoning, information retrieval, scientific analysis, or multi - modal tasks. This step determines the starting point of the subsequent task - assignment logic.

Second, select a suitable worker model. Different models perform very differently on different tasks. One of the training goals of Fugu is to learn which model to call for what kind of problem. The report mentions that even within the same type of task, such as competitive programming, different models may be respectively good at direct implementation, formulating problem - solving plans, or combining multiple algorithmic ideas. Fugu needs to take these subtle differences into account in its decision - making.

Third, design the agent workflow. For complex problems, Fugu Ultra will generate a complete agentic workflow, including task decomposition, sub - task assignment, context - sharing strategies, and final answer synthesis, all of which can be completed in natural language within the model.

Fourth, optimize according to feedback. The training of Fugu includes not only supervised fine - tuning but also evolutionary algorithms and reinforcement learning. It uses real - task results to reverse - optimize the orchestration strategy, which allows it to know how to assign the right tasks to the right models.

There are two versions of the Sakana Fugu model, namely Fugu and Fugu - Ultra. Fugu emphasizes daily use, focusing on the balance between performance and latency. While ensuring high quality, it tries to respond as quickly as possible. Therefore, it doesn't always conduct very complex multi - agent collaborations. Instead, it uses a lightweight selection mechanism to quickly determine which worker model is more suitable for the current task.

Fugu - Ultra prioritizes quality. It uses a more complex orchestration method, breaking tasks into multiple sub - tasks, assigning different agents to handle them, and then integrating the results. This method may have a longer response time but is more suitable for high - difficulty problems, such as complex code tasks, mathematical reasoning, scientific problems, and multi - step planning.

The common feature of the two is complete modularity independent of the models. Sakana Fugu doesn't need to access the weights of worker models, and they don't even need to be open - source. After a new model is released, it can be directly added to the worker model pool, and users can customize the available model list according to their needs for cost, privacy, compliance, etc.

03.

Solving Rubik's Cubes, Playing Blind Chess, and Not Stumped by the Car - Washing Problem

There are several experiments in the appendix of the Sakana Fugu technical report:

One is the “one - time Rubik's Cube solver”. The model needs to write a Rubik's Cube solving program using the Python standard library at once and test it on 300 scrambled Rubik's Cubes. The report states that both Fugu and Fugu - Ultra successfully solved all the Rubik's Cubes, with Fugu - Ultra having a shorter average number of steps and Fugu having a faster running speed.

Another is the “blind chess test”. The model plays chess based only on historical moves without seeing the chessboard, having a list of legal moves, or having FEN. This experiment mainly tests whether the model can maintain an internal state in the long term. In several representative games shown in the report, Fugu defeated multiple baseline models and Stockfish with limited strength.

There is also an “online stock trading” experiment. The model can only see past and current anonymous market data and cannot peek at future prices. It needs to make weekly decisions on buying, holding, or selling. The report states that Fugu - Ultra achieved a higher average return in five runs.

These experiments may not directly represent the actual capabilities of the model, but they demonstrate what Fugu wants to prove: Orchestrator models can handle tasks that require long - term operation, strategy adjustment, and multi - step execution.

Some netizens used Fugu - Ultra to handle some problems that caused many models to fail, such as how many “r”s are in “strawberry”, whether 5.11 is greater than 5.1, and the classic car - washing problem. They exclaimed that it was as good as Fable. It can be seen that Fugu - Ultra answered these three questions correctly.

What's most notable in the Sakana Fugu technical report is that it proposes a new path for model research.

In the past, we often asked which model was the strongest, while Sakana Fugu poses a new question: How to make multiple cutting - edge models collaborate more effectively.

This will bring several changes: First, model capabilities will become more modular. After a new model is released, it can be directly added to the worker pool and become an expert in a certain type of task; Second, users will have more control. Enterprises or individuals can configure the model pool according to their needs for privacy, compliance, cost, latency, and supplier preferences. Third, AI competition may expand from “single - model capabilities” to “system - organization capabilities”. Those who are better at scheduling models, using tools, designing workflows, and integrating feedback will have stronger capabilities.

Of course, the test results in the technical report are from the manufacturer, and the actual capabilities depend on the usage experience of real developers. Secondly, multi - model orchestration will bring higher costs and higher latency, especially in deep - collaboration modes like Fugu - Ultra. Meanwhile, error attribution in multi - model systems will be more complex. Once the final answer is wrong, it's difficult to tell whether it's the routing, the worker model, or the integration process that caused the error.

In addition, the orchestrator model itself may also have biases. If it misjudges the task type or over - relies on a certain model, it may weaken the overall performance. Therefore, although the Sakana Fugu approach has great potential, its actual implementation still requires a lot of engineering verification.

04.

Conclusion: A New Way to Enter the Large - Model Training Field

The release of the Sakana Fugu series of models indicates that in the next stage of AI, it may not only be about larger and stronger single models but also about more collaborative model systems.

If the past large - model competition was about cultivating “super intelligence”, then the direction of Sakana Fugu is to train a “super commander”. Let the model specifically learn how to divide labor, coordinate, verify, and integrate. In the current large - model field dominated by a few top - model manufacturers, this model - training method that only schedules but doesn't execute may be a new way to enter the large - model training field at present.

This article is from the WeChat official account