AI Lab Valued at $84 Billion Makes Another Big Move: Applying 'Tightening Spell' to Large Models

When training large models, how to manage weights and avoid numerical explosion and loss? A new study titled "Modular Manifolds" by Thinking Machines Lab proposes a new paradigm. It transforms the traditional "firefighting" numerical correction into "preventive" constrained optimization, offering a brand - new approach for better training of large models.

Just now, the Thinking Machines Lab founded by Mira Murati, the former CTO of OpenAI, released new results again!

This is their second research article, "Modular Manifolds", following "Defeating Nondeterminism in LLM Inference".

Blog address: https://thinkingmachines.ai/blog/modular-manifolds/

Training large neural networks is like "walking on a tightrope". We must carefully maintain their internal "health" to prevent key tensors such as weights, activation values, or gradients from becoming too large or too small, so as to avoid a series of problems such as numerical overflow.

One important idea is to provide a unified magnitude management for large models.

First, stabilize the foundation.

Use Layer Norm technology to bring the output of each layer back to an appropriate range and normalize the activation vectors. This is also a common practice at present.

Normalizing gradient updates is also very common. For example, the Muon optimizer performs spectral normalization on the updates to make the magnitude of each update controllable.

Furthermore, directly "control" the weight itself.

Normalizing the weight matrix is a direction worth trying.

The article proposes a new perspective for rethinking optimization algorithms: constraining the weight tensors on a certain submanifold to design optimization algorithms in collaboration with these manifold constraints.

This is like changing from "putting out a fire" to "prevention":

Put the parameters in a healthy range from the beginning to make the training more stable and interpretable, so that large models can be trained more stably and efficiently.

The Form of Manifold Optimizers

We know that a manifold is just a surface that looks flat locally.

If we zoom in enough, it looks like an ordinary plane.

The locally flat space near a point on the manifold is called the "tangent space".

As shown in Figure 1, a three - dimensional sphere or a hypersphere of higher dimensions is a manifold. The red part in the figure represents its tangent plane at a certain point.

To keep the weights "stay" in the specified manifold, a simple method is to use an ordinary optimizer and project the weights back to the manifold after each update step.

However, the problem is that if the optimization step deviates too much from the manifold and is then forcibly projected back, this will cause the nominal learning rate to no longer correspond to the actual displacement of the parameters on the manifold, thus weakening our intuition about the "step size - effect" relationship.

If we want to seriously design a training algorithm on the manifold, we must first figure out: how to measure the "distance" in the tangent space?

One solution is to optimize directly in the tangent space. In this way, each step is taken along the "surface" of the manifold, and the learning rate can better correspond to the "actual displacement".

A common choice is the Euclidean distance, but other ways to measure the distance can also be chosen, as shown in Figure 2.

It is worth noting that the choice of distance measurement method will directly affect the direction of the optimal optimization step.

In Figure 3, the pink arrow represents the original gradient, that is, the partial derivative of the loss function with respect to the weights.

That is to say, we don't necessarily have to move strictly in the direction of the gradient.

To express this process mathematically, we can regard the "optimal update direction under manifold constraints and a specific distance measurement" as a constrained optimization problem. We can take a hypersphere with an Euclidean norm as an example.

Let g represent the gradient, w represent the current point on the hypersphere, a represent the update direction, and η represent the learning rate. The problem we need to solve is:

Returning to the visual language shown in Figures 1, 2, and 3, the meaning of this formula is: the green arrow (that is, the optimal solution of a) must meet two conditions simultaneously:

One is that it must fall on the red tangent plane, and the other is that it must be on the yellow circle with a radius of η.

We can apply the method of Lagrange multipliers to solve this problem.

Among them, λ and μ are Lagrange multipliers.

Take the derivative of this Lagrange function with respect to a and set it to zero, and then solve for λ and μ in combination with the two constraint conditions to obtain the optimal update direction.

To put it simply, the optimal update method is: first subtract the radial component in the same direction as w from the gradient, that is, project the gradient onto the tangent space, then normalize the result, and finally multiply it by the learning rate.

The update direction obtained in this way is in the tangent space.

In Figure 4, this small correction process is called the "retraction map".

The complete manifold optimization algorithm is as follows:

In summary, the first - order manifold optimizer includes three steps:

Find a unit - length tangent vector that goes as far as possible in the gradient direction;

Multiply this direction by the learning rate and then subtract it from the current weights;

Project the updated weights back to the manifold through the retraction map.

When implementing this process, we need to decide what kind of manifold to choose as the constraint and how to define the measurement method of "length".

According to different choices of these two aspects, we can get different optimization algorithms. See the following table for details.

Manifold Muon

The typical weight matrix W in a Transformer is a "vector transformer", that is, it transforms the input vector x into the output vector y = Wx.

We hope to design a manifold constraint and a distance function so that the action of this matrix on the input vector is reasonable: it should neither cause the output value to be too large or too small, nor cause the output vector to change drastically or hardly change when updating the weights.

A good way to think about how a matrix acts on a vector is to use singular value decomposition (SVD), as shown in Figure 5.

SVD shows how the matrix stretches the input vector along different axes by decomposing the matrix.

We hope that the "stretching effect" of the matrix is close to 1, so we choose a matrix manifold where all singular values are 1.

This matrix manifold is mathematically called the Stiefel manifold. Under the assumption of a tall matrix (m ≥ n), it can be equivalently defined as the following set:

To design an optimizer for the Stiefel manifold, we also need to choose a suitable distance function.

To limit the maximum stretching effect of weight updates on the input vector, the spectral norm, that is, the measurement of the maximum singular value of the matrix, is a suitable option.

Although it only constrains the maximum effect, since the optimizer will saturate this upper limit, it can also indirectly prevent the minimum effect from being too small.

It is this idea that led to the proposal of the Muon optimizer.

Combined with the Stiefel manifold constraint, this idea forms the "manifold Muon" problem .

A key finding in the article is a convex optimization problem, which can be solved by the standard method - the dual ascent method.

After derivation, the gradient of the dual function is:

Through a small experiment, the feasibility of the algorithm can be verified. See Figure 6 for the experimental settings and results.

Modular Manifolds

There is also an important question here: what will happen when we combine multiple layers to build a complete neural network?

Do we need to pay attention to the interaction between layers and modify the optimization strategy accordingly?

This requires a method that can generalize the derivation logic introduced above to the entire neural network - the theory of modular manifolds.

The core idea of this theory is to build an abstract mechanism to guide how to reasonably allocate the learning rate between layers.

In essence, allocating the learning rate between different layers or scaling a single layer depends on our understanding of the Lipschitz sensitivity of the network output to the weights.

We will track this sensitivity during the process of building the network, and the manifold constraint helps us grasp it more accurately.

Reference materials:

https://thinkingmachines.ai/blog/modular-manifolds/

This article is from the WeChat official

该文观点仅代表作者本人，36氪平台仅提供信息存储空间服务。

The AI lab valued at 84 billion has made another big move. They're going to put a "tightening spell" on large models.

The Form of Manifold Optimizers

Manifold Muon

Modular Manifolds