22-Year-Old Reverse-Engineers and Open-Sources Mythos Architecture with MoE and Attention Mechanisms Inspired by DeepSeek

Is Mythos too dangerous and sealed? Some people just "rebuilt" it.

Heard that Mythos was too dangerous and got sealed? Someone just "rebuilt it and made it open - source" right away.

OpenMythos integrates public research and the current mainstream speculations about the Claude Mythos architecture.

OpenMythos implements a Recurrent - Depth Transformer (RDT) with a MoE routing mechanism, achieving iterative depth through weight sharing across experts and conditional computation.

Existing research has confirmed that this architecture can achieve the same effect as traditional models with only half the parameters.

Stack loops, not parameters

The person who put these pieces together is Kye Gomez, a 22 - year - old and the founder of the Swarms agent framework.

The RDT architecture he designed has three core points:

Let the same set of weights run up to 16 times repeatedly.
Take different expert paths each time.
The entire inference process is completed in the latent space.

Combined, these three aspects make it more efficient to "think about a problem more times" than to stack parameters.

In the past two years, the standard approach in the AI industry has been to stack hundreds of different Transformer layers, with each layer learning different things, resulting in an explosion of the number of parameters.

RDT doesn't need hundreds of layers. It only uses a few layers and runs up to 16 repeated loops, with each loop continuing the calculation based on the results of the previous round.

Running the same thing 16 times, isn't that a waste of computing power?

RDT's answer is that it won't be repetitive because different "experts" are activated in each loop.

The loop block uses a mixture - of - experts layer, and the MoE router activates different subsets of experts in each loop.

The design of MoE draws inspiration from DeepSeekMoE: a large number of fine - grained routing experts plus a small number of always - online shared experts.

Gomez summarized this design in one sentence:

MoE provides the breadth of domain knowledge, and the loop provides the depth of inference.

With both breadth and depth, a stability mechanism is needed to ensure that the loop doesn't go out of control.

A new paper from UCSD and Together AI, Parcae: Scaling Laws For Stable Looped Language Models, proposes LTI stable loop injection to prevent divergence in each round.

In experiments, an RDT with 770M parameters caught up with a standard Transformer with 1.3B parameters.

The number of parameters is nearly halved, but the effect is the same.

The last piece of the puzzle is continuous latent space inference. All 16 rounds of inference are completed in the hidden state vector, without generating any intermediate tokens. The answer is only output at the end of the last loop.

This is completely different from Chain - of - Thought. CoT is "think one step, write one step, think another step, write another step", and all intermediate tokens are exposed for humans to read.

RDT is "say one sentence after thinking 16 times", and the inference process is completely internalized.

Kye also cited a paper from Ohio State University and conducted two key experiments on the recurrent Transformer architecture.

First: Systematic generalization.

For knowledge combinations that were never seen during training, the recurrent Transformer can still answer correctly during inference, while the standard Transformer fails directly.

This proves that the loop is not repetitive computation but real "deeper thinking".

Second: Depth extrapolation.

Only 20 - hop inference chains were taught during training, but 30 - hop chains were given during testing.

The recurrent Transformer's response is to add a few more loops during inference, while the standard Transformer collapses directly.

These results indicate that current large models have memorized a large number of facts during pre - training, and the bottleneck lies in knowledge combination.

They can't string together known facts to answer novel questions. The loop seems to unlock this combination ability for free.

If these conclusions hold, the mainstream of scaling will shift from "training larger models" to "letting existing models think more times during inference".

With these research results, whether Anthropic's Mythos really uses this architecture seems no longer important.

The speculation about the recurrent Transformer has attracted a lot of attention from the academic community.

More theoretical and experimental verifications are on the way.

Reference links:

[1]https://x.com/KyeGomezB/status/2045660378844024994

[2]https://arxiv.org/abs/2604.07822[3]https://arxiv.org/abs/2604.12946

This article is from the WeChat official account "QbitAI". Author: Focus on cutting - edge technology. Republished by 36Kr with permission.

该文观点仅代表作者本人，36氪平台仅提供信息存储空间服务。

The Mythos architecture has been reverse-engineered and open-sourced by a 22-year-old guy. The MoE and attention mechanisms draw inspiration from DeepSeek.

Stack loops, not parameters