From Conversational to Decision - Making: Baichuan M3 Sets New Standards for Medical Large Language Models

The watershed moment of AI in healthcare.

On January 13th, Baichuan Intelligence released and open-sourced its new-generation medical-enhanced large model, Baichuan-M3. On the authoritative medical evaluation set HealthBench led by OpenAI and its difficult subset, the model achieved the highest global comprehensive score, significantly surpassing GPT-5.2. In the pure model evaluation of the medical hallucination rate, it also reached the current lowest level. In the SCAN-bench evaluation focusing on the full-process clinical capabilities, M3 ranked first in multiple core indicators such as medical history collection, auxiliary examinations, and diagnosis, demonstrating comprehensive leading medical reasoning and consultation capabilities.

In addition, M3 also has the native "end-to-end" serious consultation ability for the first time. It can actively follow up and gradually approach the problem like a doctor, asking out the key medical history and risk signals, and then conducting in-depth medical reasoning based on complete information. The evaluation shows that its consultation ability is significantly higher than the average level of real doctors.

However, the significance of this release is not just another achievement of surpassing on the technology list. More importantly, Baichuan-M3 has pushed the medical large model to a new position: it no longer stays at the level of dialogue and expression, but begins to truly have the ability to support the complete diagnosis and treatment process and can participate in medical decision-making itself. That's why its significance far exceeds that of other models. The technological progress of large models can finally be fully transformed into the scalable and practical value in the medical and health field.

"It is meaningful to help patients generate value for auxiliary decision-making." Wang Xiaochuan, the founder & CEO of Baichuan Intelligence, shared at the press conference.

In the medical scenario with the highest requirements for safety and responsibility, such changes do not occur by chance. It means that someone has chosen a slower, more difficult, and less appealing path, gradually pushing the model's capabilities from demonstrating intelligence to bearing decision-making.

Why has Baichuan been able to reach this stage? Why did this breakthrough occur in the medical field rather than in more popular tracks such as code, search, or agents? And why is it at this moment that these long-accumulated technological choices and engineering routes have begun to converge to a clear result?

The evaluation criteria for medical large models are being rewritten

Almost since the birth of artificial intelligence, people have regarded the medical industry as one of the most likely and worthwhile industries to be transformed by AI.

Before the emergence of HealthBench, the AI capabilities related to the medical industry were almost incomparable. Each model could claim to understand medicine and be able to answer medical questions, but there was no unified evaluation coordinate system, and there was no way to make horizontal comparisons.

In May this year, OpenAI launched HealthBench. This set of standards brings together a large number of multi-round dialogue samples designed based on real clinical scenarios, enabling the quantification and evaluation of medical capabilities with a public standard. Therefore, for a long time, it was almost equivalent to the highest standard for medical large models and became a common coordinate system for each model to demonstrate its medical capabilities.

For this reason, for a long time, the default consensus was that whoever scored higher on HealthBench understood medicine better. This was not because HealthBench covered all the complexities of medicine, but because before it, the industry didn't even have a standard.

From a certain moment on, the industry trend changed. Since the middle of last year, as domestic medical assistants such as A Fu and Xiao He Doctor have been launched one after another, OpenAI has launched ChatGPT Health, and Anthropic has launched Claude for Healthcare. Medicine is no longer just a benchmark for testing the intelligence of models, but has become a product direction that large model manufacturers must invest in directly. Models also have to face the issue of whether their answers can be used as decision-making basis.

This is no longer just a ranking issue.

It is also at this stage that the limitations of HealthBench begin to emerge. It is still important but no longer sufficient. It can still prove whether a model has medical knowledge and professional expression ability, but it cannot answer a more core question: whether the model has the qualification to enter the real medical decision-making process.

Clinical decision-making never starts with a standardized question but with highly incomplete and even chaotic information. Patients often cannot clearly state the key points, symptoms overlap with each other, and different risks are mixed together. The real difficulty lies not in "how to give the answer" but in "how to ask the question". A large part of a doctor's professional ability is reflected in the judgment of information priority: which are the high-risk signals that must be ruled out immediately, which can be postponed; which information is indispensable for making a conclusion, and which is just supplementary reference.

It is precisely at this point that Baichuan has made a significantly different choice from the mainstream route. On the one hand, it has not given up the competition in the HealthBench system and still continuously pursues to be the best under the existing authoritative standards. On the other hand, it has simultaneously launched SCAN-bench, attempting to make up for the long-neglected dimension of modeling and evaluating the complete clinical process itself.

Based on the SCAN principle, Baichuan borrowed the OSCE method long used in medical education and collaborated with more than 150 front-line doctors to build the SCAN-bench evaluation system. This system uses real clinical experience as the "standard answer" and breaks down the diagnosis and treatment process into three major stages: medical history collection, auxiliary examinations, and accurate diagnosis. It conducts assessments in a dynamic and multi-round manner, fully simulating the whole process of a doctor from receiving a patient to making a diagnosis. Compared with HealthBench, SCAN-bench is a new dynamic evaluation paradigm that is more end-to-end and covers the whole process.

That is to say, while the industry is still competing on who is better at "answering", Baichuan has shifted its focus to a more fundamental question: Can the model "ask" like a doctor?

This is exactly what makes M3's release truly special: it forms a closed loop in its ability structure. It can reason, doesn't fabricate information randomly, and knows how to ask for the necessary information. The ability to reason solves the problem of "whether it can make judgments", not fabricating information solves the problem of "whether it can be trusted", and the ability to conduct consultations solves the problem of "whether it has the qualification to enter the decision-making process".

When all three conditions are met, a medical large model can be said to have evolved from a talkative intelligent system to a system that can be entrusted with part of the medical decision-making responsibility.

In terms of results, M3 is still a model with multiple firsts. Its topping the list on HealthBench means a comprehensive surpassing under the medical ability standard system defined by OpenAI itself. In the HealthBench Hard subset, which emphasizes complex clinical decision-making ability more, M3 won the championship with 44.4 points, systematically surpassing GPT-5.2 for the first time. This result is more convincing because it verifies not only the professionalism of the answers but also the stability and reliability of the model in highly uncertain and difficult reasoning scenarios.

At the same time, M3 achieved the lowest global hallucination rate without the use of tools, which means that safety has been internalized as the model's own ability, rather than relying on external retrieval, rule constraints, or engineering patches to make up for it. More importantly, in the SCAN-bench evaluation aiming at the complete clinical process, M3 also ranked first. Especially in the most core consultation link, it significantly exceeded the GPT series models and the baseline level of human doctors. This shows that the model has truly made up for the long-neglected core ability of clinical information acquisition, which determines the upper limit of medical decision-making.

The real watershed for AI in medicine

If in the past two years the industry has been more focused on making models "talk" like doctors, M3's judgment this time is that just having the ability to express is not enough. A model must have the thinking structure of a doctor.

A large number of "AI doctors" still stay at the role-playing level. Their conversations are smooth and their tones are professional, but their questions are mostly to make the conversation seem complete rather than to collect key information for clinical decision-making. Models often follow the patient's description and chat, but rarely do risk stratification, screen for red flag signs, and design questions reversely around the diagnosis and treatment path like real doctors. As a result, although the conversations seem professional, they are not enough to support serious judgments and can only end up with a safe conclusion like "it is recommended to seek medical advice as soon as possible".

This is the essential difference between "being able to talk" and "being able to make clinical decisions", and it is also the background for Baichuan to propose the concepts of "serious consultation" and the "SCAN principle". Wang Xiaochuan shared at the press conference that "in the medical industry, patients often cannot fully express themselves and only know superficial symptoms. So they have to ask doctors to clarify the past development of the illness through consultation. Only with enough data can subsequent tests, diagnoses, and conclusions be made well. Today's large models do not have such ability."

What Baichuan wants to do is to break down the working methods that clinical doctors have long relied on experience to complete into engineering goals that can be learned, evaluated, and directly optimized by models through reinforcement learning.

Specifically in terms of engineering, Baichuan did not choose to pile up functions but focused on solving three most fundamental problems.

Firstly, a fully dynamic reinforcement learning system.

In the M2 stage, reinforcement learning relied more on relatively static verification rules. After the model's ability reached a certain level, the verification system itself became the upper limit. In M3, the Verifier is designed as a system that can evolve together with the model's ability: when the model exposes a new error pattern, the verifier generates new constraints; old and low-value rules are eliminated, and high-value rules are continuously strengthened. The rules and the model jointly raise the upper limit, solving the problem that the ability is likely to reach a ceiling in the later stage.

Secondly, the SPAR algorithm.

Medical consultation is naturally a very long decision-making chain. If we only look at whether the final diagnosis is correct, the model simply cannot know where the problem lies: whether the medical history was not clearly asked, the examination suggestions were wrong, or the reasoning path deviated. SPAR breaks down the long-chain decision-making into local processes that can be held accountable through a step-by-step punishment and relative benchmark mechanism, enabling the model to learn to accurately and sufficiently ask key questions within a limited number of rounds instead of relying on extending the number of dialogue rounds.

Thirdly, Fact-aware RL. In the medical scenario, the stronger the reasoning ability of the model, the more likely it is to "affirm itself". The more affirmative it is, the more dangerous it is once the factual basis is not solid. The traditional approach often relies on external retrieval or rule systems to correct deviations, while M3 makes low hallucination a direct optimization goal of reinforcement learning, making factual consistency part of the model's own ability. At the same time, through dynamic weight adjustment, it prevents the model from degenerating into a conservative state of saying less to make fewer mistakes, so that strong reasoning and high reliability can be achieved simultaneously.

Behind these three sets of designs actually points to the same goal: Baichuan wants both ability and safety, strong reasoning and high reliability, without making a trade-off, and wants to make them collaborative indicators in the same engineering system.

In this way, AI in medicine has truly crossed that watershed.

From health assistant to decision support

When the model's ability forms a complete closed loop of being able to reason, not fabricating information randomly, and being able to conduct consultations, Baichuan's focus will inevitably begin to change: from the display of the model itself to the application of its ability in real medical scenarios.

This is why, from an external perspective, it can be found that the product rhythm of Bai Xiao Ying has significantly accelerated recently. Various functions have been gradually added, and a system framework that can undertake the medical workflow is being gradually built. What the model needs is no longer just a display window, but a carrier that can store information, support long-term use, and connect to the real decision-making chain.

In this way, the difference between Baichuan's "serious medicine" and a large number of "general health" products in the industry becomes particularly clear.

Products represented by A Fu and Xiao He Doctor are closer to health consultation, medical popular science, consultation guidance, and emotional companionship. They solve the problems of information asymmetry and patients' anxiety before seeking medical advice.

What Baichuan is trying to enter is a completely different link: doctors can use it to deduce consultation and diagnosis and treatment ideas, and patients and their families can also use this application to more systematically understand the medical logic behind diagnosis, treatment, examinations, and prognosis.

This is a decision support path with high risk, high responsibility, and high value density: here, the model is no longer just providing reference information or emotional comfort. Every judgment it makes may affect the patient's next choice; every question it asks determines whether the key information is completely collected; every conclusion it forms must be verifiable and can truly be incorporated into the medical decision-making process.

The fundamental difference is that while most products in the industry still stay at the level of helping users collect health information, Baichuan has chosen a more difficult, slower, but higher-ceiling path.

Looking back at Baichuan's timeline of betting on the medical field, its choice is a pre - arranged judgment.

At the communication meeting, Wang Xiaochuan summarized his judgment on several core pain points in the medical industry: the long - term shortage of high - quality doctor resources, and the high imbalance of medical services among different regions and populations. In the United States, there is a family doctor system to undertake primary care, while in China, patients flock to top - tier hospitals more concentratedly, further squeezing high - quality medical resources. Based on the long - term observation of these real contradictions, Baichuan set its goal from the beginning to solve the problems in the medical field itself.

In 2023, at the hottest stage of the large model industry, Baichuan did not choose to first enter tracks such as code, search, and content creation, which were easier to verify commercial value. Instead, it clearly regarded medicine as the most core direction. This was not appealing at that time: medical data is sensitive, the scenarios are complex, the responsibility boundaries are blurred, and the product implementation cycle is long, making it difficult to form quick feedback. "I was questioned by many people in the industry at that time," Wang Xiaochuan told us.

At the beginning of 2026, OpenAI launched ChatGPT Health, and Anthropic also officially launched Claude for Healthcare. International leading model manufacturers have collectively entered the medical field, and all companies around the world have realized that medicine is the battleground for large models.

In this race, as the only domestic large model enterprise focusing on medicine, Baichuan has continuously broken through in core capabilities such as low hallucination rate, end - to - end consultation, and complex clinical reasoning. It has achieved generational leadership in the medical large model foundation and has transformed from a "follower" to an "industry leader" and a "definer of new paradigms", shouldering the flag for the development of Chinese AI in medicine with its hardcore strength.

This article is originally produced by「晓曦」， For reprint or content cooperation, please click Reprint Instructions ；Unauthorized reprint will be held accountable.

From Conversational to Decision-Making: Baichuan M3 Redefines the Standards of Medical Large Language Models

The evaluation criteria for medical large models are being rewritten

The real watershed for AI in medicine

From health assistant to decision support