SenseTime präsentiert Large Language Model "Rixin V6.5": Multimodale Inferenz vergleichbar mit Gemini 2.5 Pro, Kosten stark gesenkt

SenseTime hat das Large Model V6.5 vorgestellt. Die Verbesserung der multimodalen Inferenz treibt den Prozess der AGI voran.

The ability to perceive and process multimodal information is the core requirement of AGI and the inevitable path from language models to AGI.

From multimodal perception and reasoning to interaction, the evolution of multimodal intelligence will drive the next stage of AI development.

On July 27, 2025, at the [Boundless Love, Shaping the Future] WAIC 2025 Large Model Forum co - hosted by the Artificial Intelligence Committee of the All - China Federation of Industry and Commerce and organized by SenseTime, SenseTime released the new "Rixin SenseNova V6.5" (hereinafter referred to as "Rixin V6.5") large - model system. The multimodal foundation large model has achieved a breakthrough upgrade, bringing a leap for AI from a "productivity tool" to "productivity". SenseTime's core product, SenseTime Little Raccoon, has also completed an upgrade of its intelligent agent.

In 1950, Turing defined AI as "human - like ability" through the "imitation game". However, actual AI has never been able to break free from the category of "tools" and once fell into a trough of development. In the era of large models, with the breakthrough in multimodal fusion capabilities, AI has gradually reached the boundary of AGI and truly started to approach the "human - like" standard.

Xu Li, the first rotating chairperson of the Presidium of the Artificial Intelligence Committee of the All - China Federation of Industry and Commerce, and the Chairman and CEO of SenseTime, said, "SenseTime has always explored the essence of artificial intelligence, stimulated maximum intelligence through technological innovation, and promoted the leap of AI from a 'tool' to a 'human', making it a real productivity."

01 Rixin V6.5 Upgraded: Breakthrough Upgrade Touches the "Depth of Understanding"

SenseTime's "Rixin V6.5" multimodal foundation large model brings three breakthrough upgrades:

1. Strong reasoning: A multimodal thinking chain with interleaved text and images, and its reasoning performance is comparable to that of Gemini 2.5 Pro and Claude 4 - Sonnet;

2. High efficiency: Optimization of the multimodal architecture, with a more than three - fold increase in cost - performance;

3. Intelligent agent: Significantly leading in data analysis, supporting end - to - end scenario implementation, and achieving a closed - loop of value.

Through the data advancement of the multimodal thinking chain and the data synthesis of the text - image interleaved thinking chain, the multimodal reasoning and interaction performance of SenseTime's "Rixin V6.5" has been significantly improved:

[Core Indicators] The text reasoning ability and multimodal reasoning have been significantly improved, surpassing Gemini 2.5 Pro and Claude 4 - Sonnet; the multimodal interaction ability surpasses Gemini 2.5 Flash and GPT - 4o, with outstanding performance in all aspects

SenseTime's "Rixin V6.5" is the first to break through the text - image interleaved thinking chain technology, introducing visual thinking into the large model, becoming the first commercial - grade large model in China to achieve text - image interleaved thinking.

In human thinking, visual thinking and logical thinking are equally important, and only their organic combination can form a comprehensive thinking ability. As the saying goes, "A picture is worth a thousand words." A picture can often trigger more effective thinking than a long paragraph of text. Currently, although mainstream multimodal models have achieved the integration of multiple modalities at the input end, the thinking and reasoning process still mainly relies on language reasoning, and there are still shortcomings in graphical and spatial reasoning.

The key to constructing a multimodal thinking chain lies in the graphical representation of information, which is more challenging than a pure - text thinking chain. It not only needs to present the text thinking process but also generate images as thinking nodes, which is difficult to achieve on a large scale through pure manual methods. SenseTime's R & D team first constructs seed data based on the understanding of the thinking process, trains the model through supervised fine - tuning (SFT) to initially have the ability of text - image interleaved thinking, and then significantly improves the multimodal reasoning ability through multiple rounds of reinforcement learning.

At the same time, SenseTime has also improved the fusion architecture of the multimodal model to promote early cross - modal fusion. The new architecture uses a significantly lighter visual encoder and a deep and narrow backbone model, enabling visual representation to be aligned and integrated with language at an early stage of feed - forward computation, thus making perception more efficient and modal fusion deeper.

Thanks to the improvement of the model architecture, while optimizing the cost, SenseTime's "Rixin V6.5" has increased the pre - training throughput by more than 20%, the reinforcement learning efficiency by 40%, and the reasoning throughput by more than 35%, achieving a perfect balance between performance and cost. Compared with "Rixin V6.0", "Rixin V6.5" has tripled the cost - performance ratio.

02 AI as Productivity: SenseTime Little Raccoon, the Most Powerful Office Intelligent Agent Debuts

Large language models have become work - assisting tools for many people nowadays, but relying solely on large language models is not enough for AI to leap from a "tool" to a "human".

Human daily task activities naturally involve the processing of multimodal information such as text, images, videos, and web pages. The key to transforming from a productivity tool to productivity lies in the ability to input, process, and output multimodal information.

Based on the powerful multimodal data analysis ability of "Rixin V6.5", SenseTime Little Raccoon has been comprehensively upgraded: it can handle complex multimodal inputs, conduct in - depth multimodal fusion analysis, output multimodal results, and achieve professional visual presentation, creating "AI productivity in the office scenario" and enabling AI to leap from a "productivity tool" to "productivity".

At the same time, SenseTime Little Raccoon always maintains a world - class complex data analysis ability. In the comprehensive test of customer scenarios, Little Raccoon has reached the level of the international benchmark Claude 4 Opus in the fields of data analysis and intelligent agents, far exceeding models such as OpenAI o3. Among them, in tasks such as time - series calculation, data matching, mathematical calculation, and anomaly detection, the accuracy rate can be close to 100%.

In real - world office scenarios, the form of data input is extremely complex. In data analysis scenarios, various types of documents such as screenshots, documents, and PDFs are common, and structured information and tables only account for about 70%. Even a seemingly basic Excel table often contains complex elements such as merged cells, missing values, nested sub - tables, and embedded charts, greatly increasing the processing difficulty.

SenseTime Little Raccoon can achieve global analysis with multimodal thinking, conduct multi - step thinking and reflection through the construction of a thinking chain, and finally output structured results.

In fact, although a table looks simple, the underlying logical causality is very complex. Now, SenseTime Little Raccoon can simplify complex tables.

The user uploads a complex Excel table containing merged cells, missing values, sub - tables, embedded charts, and external pictures. SenseTime Little Raccoon can accurately parse the table content, establish the logical relationship between sub - tables, and finally generate a complete analysis report.

In another user case of complex input, a small merchant sees useful table content on video platforms such as Douyin, takes a screenshot, and uploads it. SenseTime Little Raccoon can decompose the task through the image information, remove interference, extract table information, and export an editable Excel table for the user to fill in. The entire process of input, analysis, and output is smoothly supported by multimodal capabilities.

Traditional AI tools mostly play an auxiliary role, and the core work still depends on the user. However, SenseTime Little Raccoon has upgraded the interaction paradigm - the AI takes the initiative to undertake the core tasks, confirms key information with the user through precise questions, and the interaction logic is like that of colleagues collaborating.

The AI takes the initiative to undertake the core tasks, confirms key information with the user through precise questions, and the interaction logic is like that of colleagues collaborating.

The task - planning function launched by SenseTime Little Raccoon now has a novel interaction mode that is more conducive to user understanding. Take the "Scottish Premiership" that was very popular some time ago as an example.

The user uploads an image table and requests an analysis of the "TOP players in the Scottish Premiership". SenseTime Little Raccoon will automatically capture online information, generate a task list based on expert knowledge (such as determining the "TOP5" criteria and analyzing youth training results), conduct a systematic analysis, and finally generate a high - quality analysis document, which can also be exported in editable formats such as Excel, PPT, and HTML. The overall process is as follows.

Let's look at the decomposition steps: After receiving the task, Little Raccoon will actively sort out the task details, ask the user clear questions about key nodes (such as "Do you need to proceed according to the following points 1, 2, 3?"), to ensure the accuracy of the task direction, and truly achieve the efficient mode of "AI leading the work, and the user making decisions and controlling the process".

Then it can generate a task list based on expert knowledge (such as determining the "TOP5" criteria and analyzing youth training results) for systematic analysis, and it is clear at a glance what to do next and how to discuss it.

Professional data + tool invocation to achieve a high - quality content process.

Finally, a high - quality analysis document is generated, which can also be exported in editable formats such as Excel, PPT, and HTML.

该文观点仅代表作者本人，36氪平台仅提供信息存储空间服务。

SenseTime hat das Large Language Model "Rixin V6.5" vorgestellt. Seine multimodale Inferenz ist mit der von Gemini 2.5 Pro vergleichbar, und die Kosten sind stark gesenkt.

01 Rixin V6.5 Upgraded: Breakthrough Upgrade Touches the "Depth of Understanding"

02 AI as Productivity: SenseTime Little Raccoon, the Most Powerful Office Intelligent Agent Debuts