Has the throne of the strongest domestic multimodal model changed hands again? With 671 billion parameters, it has developed "clairvoyance" and is built on DeepSeek.
On November 28th, Zhidongxi reported that Kuaishou has just open - sourced its new - generation flagship multi - modal large model, Keye - VL - 671B - A37B. This model is built on DeepSeek - V3 - Terminus and has 671 billion parameters. While maintaining the general capabilities of the base model, it has upgraded the visual perception, cross - modal alignment, and complex reasoning links, achieving strong multi - modal understanding and complex reasoning abilities.
How powerful is Keye - VL - 671B - A37B? Let's first experience it through a few cases. How many movie tickets are there in the following picture? Most people might blurt out "Three" at first glance.
However, Keye - VL - 671B - A37B observes more carefully. Combining the text on the tickets, it can determine that there are actually only two movie tickets in the picture. The top one is a popcorn snack voucher. After checking the thinking process, it can be found that it not only accurately identifies the text, logos, and layout differences of each ticket in the picture but also can further reason that the tickets on the left and in the middle meet the core features of movie tickets, while the ticket on the right has no seat information and no movie session label, so it is actually a stacked food exchange voucher, not a movie ticket.
In addition to image understanding ability, Keye - VL - 671B - A37B also has powerful video understanding and reasoning abilities. When asked about how the shots in the following video change, it can identify core elements such as "blue double - decker tram", "Louis Vuitton", "Tiffany & Co" and output the details of the shot changes.
Kuaishou has released the performance comparison between Keye - VL - 671B - A37B and other VL models. In the two core fields of general visual understanding and video understanding, the overall performance of Keye - VL - 671B - A37B exceeds that of leading VL models such as ByteDance's Seed1.5 - VL think and Alibaba's Qwen3 - VL 235B - A22B.
On 26 mainstream benchmark tests covering abilities such as STEM, reasoning, general Q&A, video understanding, OCR, and pure text, Keye - VL - 671B - A37B achieved the highest scores in 18 tests.
Currently, Keye - VL - 671B - A37B has been officially open - sourced and can be downloaded and experienced on Hugging Face and GitHub.
Github:
https://github.com/Kwai - Keye/Keye
HuggingFace:
https://huggingface.co/Kwai - Keye/Keye - VL - 671B - A37B
01. Complete pre - training in three stages, using only 300B high - quality data
Keye - VL - 671B - A37B uses DeepSeek - V3 - Terminus as the initialization of the large - language - model base, which has stronger text reasoning ability. The visual model is initialized with Keye - ViT, and this component comes from Keye - VL - 1.5. The two are bridged through the MLP layer. Keye - VL - 1.5 is a multi - modal large model open - sourced by Kuaishou at the beginning of September this year, with 8 billion parameters and supporting 128k tokens extended context.
The pre - training of Keye - VL - 671B - A37B covers three stages to systematically build the model's multi - modal understanding and reasoning abilities. The model reuses the visual encoder of Keye - VL - 1.5, which has been aligned on 1T token of multi - modal pre - training data with an 8B - sized model and has strong basic perception ability.
Kuaishou has screened approximately 300B high - quality pre - training data, which is very different from the training data of other large models, which are often calculated in "T (trillions)". Kuaishou said that it hopes to efficiently build the core perception foundation of the model with limited computing resources, ensuring a solid visual understanding ability and controllable computing costs.
The pre - training of Keye - VL - 671B - A37B is divided into three steps:
First stage: Freeze ViT and LLM, and only train the randomly initialized Projector to ensure the initial alignment of visual and language features.
Second stage: Open all parameters for pre - training.
Third stage: Conduct annealing training on higher - quality data to improve the model's fine - grained perception ability.
The multi - modal pre - training data of Keye is built through an automated data pipeline. Kuaishou has strictly filtered and resampled the data and added VQA data enhancement to make the data cover common and complex visual formats such as OCR, charts, and tables, improving the model's perception quality and generalization ability.
In the annealing stage, Kuaishou added the thinking - chain data generated by DeepSeek - V3 - Terminus to ensure that the model does not lose its original strong reasoning ability while continuing to strengthen visual perception.
02. Adopt a multi - stage post - training strategy, and verify that the mixed CoT data has better results
The post - training of Keye - VL - 671B - A37B consists of three steps: supervised fine - tuning (SFT), cold start, and reinforcement learning. The training tasks cover visual Q&A, chart understanding, rich - text OCR, mathematics, code, logical reasoning, etc.
In the SFT stage, the technical team of Keye - VL - 671B - A37B used more multi - modal and pure - text long - thinking - chain data to temper the model's pure - text ability and enhance its multi - modal ability. In the cold - start stage, reasoning data is used to enhance the model's reasoning ability. In the reinforcement - learning stage, complex reasoning data is used to improve the model's think and no_think (thinking and non - thinking) abilities, and video data is added to enhance the model's video understanding ability.
The technical team of Keye - VL - 671B - A37B conducted repeated experiments on the ratio of instruction (Instruct) data and long - thinking - chain (Long - CoT) data in the dataset to break through the limitations of the previous supervised fine - tuning paradigm, which relies one - sidedly on instruction data.
In this process, Kuaishou verified the superiority of the mixed mode (Instruct + Long - CoT) over the single mode (Instruct). That is, adding more long - thinking - chain reasoning data to the SFT dataset is beneficial to improving the overall performance of the model and the stability of subsequent training.
The loss curve shows that adding more CoT data in the SFT stage can significantly reduce the training loss in the cold - start stage.
The performance comparison on multiple benchmarks also shows that the model trained with mixed CoT data has significantly improved performance compared with the model fine - tuned with instructions.
In the cold - start stage, the quality of CoT data is crucial for improving the model's reasoning ability. The reasoning process of the pure - text model is often long and contains a lot of repetition. To alleviate the problem of over - thinking, the technical team of Keye - VL - 671B - A37B developed a strict data - screening process to filter out the thinking chains with redundant reflection behaviors.
The experimental results on Keye - VL - 1.5 - 8B show that filtering redundant data benefits both the model's reasoning ability and perception ability.
03. Use the same algorithm as Qwen3 in reinforcement learning and create a dedicated Verifier model
In the reinforcement - learning stage, Kuaishou did not use the traditional GRPO reinforcement - learning algorithm. GRPO is a token - level modeling, which is unstable when training MoE models.
In the training of Keye - VL - 671B - A37B, Kuaishou adopted GSPO (Group Sequence Policy Optimization) as the underlying reinforcement - learning algorithm for sequence - level modeling, improving the stability of reward - verifiable reinforcement - learning (RLVR) training. It is worth noting that this algorithm is one of the core algorithms of Alibaba's Qwen3 series of models.
For reinforcement learning, the quality of the reward signal is crucial. In the reinforcement - learning system of Keye - VL - 671B - A37B, Kuaishou first trained a dedicated Verifier to verify the logic of the model's output thinking process and the consistency between the final answer and the standard answer. The Verifier model uses Keye - VL - 1.5 8B as the base, and the training process includes two stages: SFT and RL.
In the SFT stage, there are both simple binary - classification tasks, which directly judge whether the generated answer is consistent with the reference answer, and more complex analysis tasks, which require the Verifier model to analyze the logic and correctness of the model's generated reply in the think - answer format.
In the RL stage, the technical team first trained on a large - scale preference dataset and then used a high - quality dataset labeled manually for annealing to improve the accuracy of the Verifier model.
To examine the detection accuracy of the Verifier model for the generated results, the technical team extracted 10,000 training data and the answers generated by the model, and compared the detection accuracy of the Verifier model and the Qwen - 2.5 - VL 72B Instruct model. Among the 150 pieces of data where the judgment results of Keye - Verifier and Qwen are inconsistent in the manually sampled data, Keye was correct in 128 cases, and Qwen in 22 cases.
The pre - experiment based on Keye - VL - preview shows that the reward signal provided by Keye - Verifier, compared with the reward signal based on rule matching, increased the average accuracy of Keye - VL - preview on multiple open - source perception benchmarks by 1.45% and on three multi - modal mathematics datasets by 1.33%.
To screen high - difficulty samples, Kuaishou used Keye - VL - 1.5 - 8B as a filter, sampled from the candidate dataset, and calculated the accuracy with the Verifier model, only retaining the data with an accuracy rate between 25% and 75% for training. In the RL dataset, Kuaishou added more video data to improve the model's video understanding ability.
04. Conclusion: The multi - modal model moves towards a future of "getting things done"
Kuaishou said that in the future, while improving the ability of the base model, the Keye - VL series of models will further integrate multi - modal Agent capabilities and move towards a form that is "better at using tools and solving complex problems". The model's ability to call tools in multiple rounds will be enhanced, enabling it to independently call external tools in real tasks to complete search, reasoning, and integration.
At the same time, Kuaishou will also promote key directions such as "think with image" and "think with video" to enable the model not only to understand images and videos but also to conduct in - depth thinking and chained reasoning around them, discovering key information in complex visual signals. Ultimately, Kuaishou hopes to create a more general, reliable, and powerful - reasoning next - generation multi - modal system.
This is the original content of the signing account [Zhidongxi] of NetEase's Featured Content Incentive Program. Any unauthorized reprinting is prohibited.)
This article is from the WeChat public account "Zhidongxi" (ID: zhidxcom), written by Chen Junda and edited by Li Shuiqing. It is published by 36Kr with authorization.