NTU and The Chinese University of Hong Kong: No More Trade - off between Controllability and Naturalness, Reducing Tokens to 1/6 for Better

Want both "precise control" and "natural liveliness"

Want to make motion generation both obedient and natural and smooth?

In existing methods, when the control is strong, the motion becomes stiff; when trying to maintain naturalness, it tends to deviate - one of these two requirements always has to be sacrificed.

In response to this contradiction, the research teams from Nanyang Technological University and The Chinese University of Hong Kong proposed MoTok. The research teams believe that existing methods combine two types of tasks that should not be mixed together and process them in the same generation stage:

One type is high - level semantic planning, which determines "what to do" in the motion; the other type is the reconstruction and control of low - level details, which determines "how to precisely do it".

The former requires global and consistent motion organization ability, while the latter emphasizes local and high - frequency fine - grained constraints. In the same stage, these two aspects pull against each other, resulting in a trade - off between controllability and naturalness, making it difficult to achieve both.

MoTok pioneered the diffusion - based discrete motion tokenizer and proposed a new general paradigm for conditional motion generation. It effectively combines the advantages of discrete tokens and continuous diffusion. With a significant token quantity compression (1/6 of the SOTA method), MoTok reduces the trajectory error by 89% (from 0.72 cm to 0.08 cm), the FID by 65% (from 0.083 to 0.029), and observes a further 58% reduction in FID (from 0.033 to 0.014) under enhanced joint trajectory control, breaking free from the dilemma of existing methods and achieving "the more controlled, the more natural".

Three - stage decomposition provides a unified paradigm for motion generation

MoTok proposed a general Perception–Planning–Control three - stage paradigm for conditional motion generation: first understand the conditions, then perform semantic planning in the discrete token space, and finally reconstruct and perform fine - grained control of motion details through the diffusion - based decoder.

Through a flexible form of global ("what to do overall")/local ("what to focus on at each moment") condition injection, the Perception stage can adapt to different conditional inputs and motion generation tasks. The Planning and Control stages handle the parts they are better at, effectively combining the advantages of discrete tokens and continuous diffusion, breaking free from the limitations of existing models - whether global diffusion models or discrete token generation models - which have long compressed high - level semantic planning and low - level detail reconstruction and control into the same generation stage.

Compress tokens by one - sixth, and the motion quality still improves

In traditional discrete - token - based methods, tokens need to retain high - level semantics for planning and enough low - level details for reconstruction, which increases the number of tokens and makes it more difficult for downstream generators to learn.

MoTok's approach is to leverage the strong detail reconstruction ability of the diffusion - based decoder and let discrete tokens retain semantic information more beneficial for planning. In this way, the tokens can be more streamlined, and the Planning stage is easier to generate.

The paper conducted an illuminating comparative experiment (shown in the table below). The authors first compared only the decoder's ability on exactly the same discrete tokens: when freezing the encoder and codebook and only replacing the original decoder with the MoTok diffusion - based decoder, they could significantly improve the reconstruction effect. Then, the authors further compared the quality of the tokens themselves: when replacing the original tokens with MoTok tokens, regardless of which decoder was used later, a significant improvement was observed in the text - to - motion (T2M) generation effect; in the motion - to - text (M2T) task, MoTok tokens were also more easily translated into accurate text descriptions.

For the T2M task, the paper tried different ways of generating discrete tokens: discrete diffusion (DDM) and autoregressive (AR). This tokenizer can bring better motion generation ability. MoTok - DDM - 4 uses only one - sixth of the tokens of the SOTA (MoMask) and reduces the FID from 0.045 to 0.039; the higher - capacity MoTok - DDM - 2 uses one - third of the tokens to reach 0.033. MoTok - AR - 4 reduces the FID of the SOTA (T2M - GPT) from 0.141 to 0.053.

The more controlled, the more natural, resolving the conflict between text and motion control

In previous work, as the joint trajectory conditions change from none to present and then gradually become stronger, it is found that the quality of text - based motion generation becomes worse and worse.

MoTok believes that this is because the joint trajectory and text conditions conflict with each other in the same generation stage, and the high - frequency, local detail control prematurely interferes with the semantic planning of the motion.

Based on this, MoTok proposed coarse - to - fine control injection: in Planning, the joint trajectory participates in motion planning in the form of coarse constraints; in Control, it is iteratively optimized through the diffusion of continuous features in the form of fine - grained constraints.

By separating "what to do" and "how to precisely do it" into different stages for processing, it achieves harmony between text and motion control conditions and breaks free from the dilemma of existing methods.

The paper also conducted an ablation experiment on the effectiveness of dual - stream injection (shown in the table below): if only the coarse constraints in the Planning stage (Generator) are retained, although the model can perceive the control intention, the trajectory control error (Ctrl. Err.) increases significantly; if fine - grained constraints are only applied in the Control stage (Tok. Decoder), the forced trajectory optimization significantly impairs the motion distribution (Ctrl. FID).

In conclusion

MoTok allows high - level semantics and low - level details to no longer restrict each other under the same representation, establishing a more natural connection between "planning" and "control", and enabling conditional motion generation to potentially achieve stronger controllability, higher motion naturalness, and better task generality. This paradigm also provides a worthy direction for more extensive scenarios such as embodied agents and digital humans.

Project homepage: https://rheallyc.github.io/projects/motok/

Paper link: https://arxiv.org/pdf/2603.19227v1

Github link: github.com/rheallyc/MoTok

This article is from the WeChat official account "QbitAI", author: MoTok team. It is published by 36Kr with authorization.

该文观点仅代表作者本人，36氪平台仅提供信息存储空间服务。

No longer a trade-off between controllability and naturalness. Tokens are reduced to 1/6. NTU and The Chinese University of Hong Kong achieve more natural movements with better control.

Three - stage decomposition provides a unified paradigm for motion generation

Compress tokens by one - sixth, and the motion quality still improves

The more controlled, the more natural, resolving the conflict between text and motion control

In conclusion