Based on the long-context capabilities of Gemini 1.5, Google's conversational medical system AMIE achieved the reasoning level of a general practitioner in 100 scenarios involving multiple medical visits
Recently, a study by Google Deepmind and Google Research built upon its conversational medical system AMIE to further develop a brand - new intelligent Agent system based on LLM. This system can perform clinical management and optimize doctor - patient conversations for multiple follow - up scenarios. AMIE leverages the long - context capabilities of the Gemini model. By combining in - context retrieval and structured reasoning, its output can be consistent with the latest clinical practice guidelines and drug prescription catalogs.
Large language models are rapidly entering the medical and health field. Their applications have extended from literature retrieval and medical record generation to clinical decision - making support. Among them, auxiliary diagnosis is one of the relatively mature directions at present: models fine - tuned for medicine can provide high - quality differential diagnoses based on medical history, physical signs, and examination results; systems with multi - round dialogue capabilities can also supplement medical history information through inquiry - style interactions.
However, diagnosis is just the starting point of clinical decision - making. What truly affects the quality of treatment is often the management decisions after diagnosis, such as whether further examinations are needed, how to choose treatment plans, when to adjust medications, how to arrange follow - ups, and how to continuously revise plans according to changes in the patient's condition. This kind of "management reasoning" is closer to the core of real clinical work. It also poses a greater test to the model's comprehensive understanding of evidence - based guidelines, clinical pathways, drug knowledge, and individual patient differences.
Compared with diagnostic reasoning, the evaluation of management reasoning is more difficult. Diagnostic problems usually have relatively clear standard answers, while management decisions often have no single solution and are subject to medical resources, guideline systems, drug accessibility, and doctors' experience. Currently, in medical education, the main way to evaluate this kind of comprehensive ability is still the Objective Structured Clinical Examination (OSCE). However, it relies on real - person interactions and expert scoring, making it difficult to be directly used for the automated evaluation of large language models.
To address this gap, a recent study by Google Deepmind and Google Research built upon its conversational medical system AMIE to further develop a brand - new intelligent Agent system based on LLM. This system can perform clinical management and optimize doctor - patient conversations for multiple follow - up scenarios. AMIE leverages the long - context capabilities of the Gemini model. By combining in - context retrieval and structured reasoning, its output can be consistent with the latest clinical practice guidelines and drug prescription catalogs.
In a randomized, double - blind virtual Objective Structured Clinical Examination (OSCE) study, researchers compared AMIE with 21 primary care physicians (PCPs). The test covered 100 multiple - visit case scenarios, and the case design referred to the UK NICE guidelines and BMJ Best Practice clinical specifications. The results showed that in terms of disease management reasoning ability evaluated by specialist physicians, AMIE's performance was non - inferior to that of human doctors; at the same time, in terms of the accuracy of treatment plans and examination recommendations, the degree of compliance with clinical guidelines, and the reliability of knowledge basis, AMIE scored higher than the group of doctors.
The relevant research results, titled "Towards Conversational AI for Disease Management", have been published in Nature.
Research Highlights:
* This study advanced the capabilities of the conversational medical system AMIE from single - round diagnosis to full - process clinical management reasoning covering the longitudinal evolution of diseases, multiple - visit decision - making, treatment response feedback, and drug prescriptions.
* The system utilizes the long - context capabilities of Gemini, combining in - context retrieval with structured reasoning to ensure that the output of management plans is highly consistent with authoritative clinical knowledge such as NICE guidelines and BMJ Best Practice.
* In multiple indicators such as the overall appropriateness of the plan, the quality of treatment recommendations, and the accuracy of examination recommendations, the system's performance reached or exceeded that of general practitioners.
View the paper: https://www.nature.com/articles/s41586-026-10764-5
Dataset: From Single - Question - Answer to Longitudinal Clinical Scenarios
To evaluate the real - world capabilities of conversational medical artificial intelligence in long - term management reasoning, the research team constructed a multi - level data system. It covers clinical scenarios of multiple visits and incorporates evidence - based guidelines and drug knowledge for model training, plan generation, and standardized evaluation.
The core evaluation carrier is a "multi - visit virtual OSCE scenario dataset". A total of 100 independent cases were compiled in the study and evenly distributed among five specialties: cardiology, pulmonology, obstetrics/gynecology and urology, gastroenterology, and neurology/musculoskeletal. There are 20 cases in each specialty. All cases were jointly designed by clinical physicians from Canada and India and constructed with reference to the diagnostic and treatment paths in the NICE clinical guidelines and BMJ Best Practice guidelines.
Different from the common single - round medical Q&A, these cases are designed for three consecutive visits. Each scenario not only includes the patient's initial chief complaint but also longitudinal information such as symptom evolution, treatment response, and return of auxiliary examination results to restore the real decision - making rhythm in chronic disease management and follow - up of complex cases as much as possible. To increase the clinical difficulty, some cases also include elements such as inconsistent information and multi - system comorbidities to test the system's judgment ability in non - standard situations. In addition to the 100 formal evaluation cases, the study also set up 20 verification scenarios for pre - experiments and score calibration.
The evidence - based basis comes from a clinical guideline knowledge base. This knowledge base contains a total of 627 documents, including 527 NICE guidelines and 100 BMJ Best Practice documents, with a total scale of about 10.5 million tokens. The content covers diagnostic criteria, examination paths, treatment plans, and follow - up specifications. During the evaluation process, this knowledge base is open to both the AI system and the participating general practitioners to simulate the situation of consulting guideline materials in real - world clinical practice and ensure the fairness of the human - machine comparison as much as possible.
Drug decision - making is an indispensable part of management reasoning. Therefore, the research team also constructed the RxQA special benchmark to evaluate the model's understanding of drug instructions, indications, contraindications, dosages, and medication risks. This benchmark contains 600 multiple - choice questions, which are sourced from the drug instructions in the US OpenFDA and the UK National Formulary. They are divided into two types: basic short questions and long - scenario comprehensive questions. The initial draft of the questions was generated by the Gemini model based on the instructions and then reviewed, revised, and marked for difficulty by 8 practicing pharmacists from the two countries. Due to licensing restrictions, currently, 300 questions sourced from OpenFDA are publicly available, providing a standardized reference for comparing drug reasoning abilities.
AMIE Model: Endowing the System with Both "Dialogue Ability" and "In - depth Management Ability"
This study built upon the previously proposed conversational medical system AMIE by Google and carried out a special upgrade for the needs of management reasoning. The new system adopts a dual - agent collaborative architecture, and the design idea draws on the "dual - process theory" in cognitive science: One agent is responsible for fast and continuous doctor - patient dialogues, and the other agent is responsible for slower but more in - depth management reasoning. The underlying model uniformly uses Gemini 1.5 Flash to balance real - time response speed and long - context reasoning ability.
Specifically, the system consists of a Dialogue Agent and an Mx Management Reasoning Agent (Mx Agent). The Dialogue Agent is closer to "System 1": it is responsible for real - time communication with patients, asking for medical history, explaining plans, and maintaining the patient's state during the dialogue. The Mx Agent is closer to "System 2": it is mainly responsible for generating structured and traceable management plans based on the complete course of the disease and clinical guidelines. The two agents synchronize information through a shared state module, and the Dialogue Agent can call the Mx's reasoning results at any time, ensuring that medical advice has a strong guideline basis while maintaining natural communication.
System architecture of the AMIE model
As the interaction hub, the Dialogue Agent has been upgraded in three aspects compared with the original diagnostic model. First, the basic model is replaced with Gemini 1.5 Flash, which has long - context capabilities, enabling it to handle longer medical records and multi - round dialogue information. Second, simulated dialogues for multiple visits are added to the training data to strengthen the system's understanding of disease progression and long - term management. Third, after supervised fine - tuning, the study further adds reinforcement learning based on human feedback and AI feedback to optimize dialogue quality and decision - making performance.
During the real - time reasoning process, the Dialogue Agent adopts a three - step process of "planning - generating - perfecting": First, it plans the focus of the next inquiry or response based on the current state, then generates a natural - language answer for the patient, and finally conducts self - inspection and correction. To support continuous management across visits, it also maintains a modular state structure, including patient summaries, differential diagnoses, current management plans, and other information, and continuously updates it in the background to avoid starting from scratch in each dialogue.
The Mx Agent is the core module responsible for in - depth management reasoning in the entire system. It fully utilizes the long - context capabilities of Gemini 1.5 Flash and adopts a strategy of "coarse retrieval + full - context reasoning" to minimize the information fragmentation that may be caused by traditional block - based retrieval. The system first builds an index for all guideline documents through the Gecko 1B embedding model, then generates a natural - language query based on the current patient case, and filters out about 6 highly relevant complete documents from the guideline library, with a total scale of about 256,000 tokens. Subsequently, the system inputs the full text of these guidelines and the patient's complete medical history information into the model, allowing the model to complete cross - document and cross - stage overall reasoning in a single call.
To improve the usability and auditability of the output, the Mx Agent uses the JSON mode to constrain the generation results and outputs them according to the framework of "analyzing the clinical situation - defining management goals - formulating management steps and marking the source of the guidelines". Each recommendation needs to be accompanied by the corresponding guideline citation. At the same time, the system first independently generates 4 management drafts and then integrates and perfects them based on the original text of the guidelines to improve the integrity and adaptability of the final plan.
Not Inferior to General Practitioners in 15 Indicators
To verify the clinical management reasoning ability of the upgraded system, this study adopted a randomized, blind virtual OSCE framework and combined it with the RxQA drug benchmark test. The AMIE system was compared with 21 general practitioners. The overall evaluation was carried out around three dimensions: the overall quality of the management plan, the quality of investigation recommendations, and the quality of treatment recommendations.
In the clinical evaluation, both the system and the general practitioners need to complete 100 sets of multi - visit cases. Thirty specialist physicians and standardized patients scored blindly from the perspectives of professional quality and medical experience respectively. That is, the scorers did not know whether the plan came from the AI system or human doctors, so as to minimize the influence of identity bias on the results. The drug test was set in both closed - book and open - book environments to observe whether external materials would change the performance of the system and doctors.
The results showed that in terms of the overall quality of the management plan, the system was non - inferior to general practitioners in all 15 evaluation dimensions and showed statistical advantages in multiple indicators. Taking the overall appropriateness of the plan as an example, the system scored 95%, 96%, and 98% in the three visits respectively, higher than 72%, 80%, and 81% of general practitioners. In terms of the appropriate rate of treatment recommendations, the system scored 87%, 90%, and 94% respectively, also higher than 66%, 62%, and 71% of general practitioners.
The system also showed a continuous advantage in the accuracy of examination and treatment recommendations. Its accuracy rate of treatment recommendations was stable above 95%, while that of general practitioners was 62% to 67%. In terms of guideline compliance, since each recommendation of the system requires a clear citation, its traceability is significantly better than that of human doctors. This result suggests that the long - context reasoning and the integration mechanism of the original text of the guidelines may help improve the stability and interpretability of the model in complex management tasks.
Quality of the management plan
In the dual - perspective preference evaluation, the study covered 10 core dimensions of management reasoning, forming a total of 51 groups of comparisons. In nearly half of the cases, specialist physicians and patients thought that the performance of both sides was comparable; in the cases with clear preferences, the system's winning rate was 47%, significantly higher than 7% of general practitioners. More notably, the evaluation trends of specialist physicians and patients were relatively consistent, indicating that the system's advantages are not only reflected in professional judgment but also in dimensions related to the patient experience.
As the number of visits increases, the system's advantages in time - related dimensions such as dynamic monitoring, reception process, and doctor - patient relationship become more obvious. This is consistent with the original intention of the study: the difficulty of management reasoning lies not in whether a single answer is correct but in whether the changes in the patient's condition, treatment feedback, and the next - step plan can be continuously connected.
Visual display of the preference proportion of 51 independent dimensions
In terms of drug reasoning, the RxQA benchmark shows that the system outperforms general practitioners in high - difficulty questions rated by pharmacists. In the closed - book environment, the system's accuracy rate was 50.6%, while that of general practitioners was 41.5%; in the open - book environment, the system's accuracy rate was 57.9%, while that of general practitioners was 47.8%. In low - difficulty questions, the difference between the two sides was not significant. Open - book materials are helpful to both the system and doctors, especially in low - difficulty questions, where the improvement is more than 20 percentage points; in high - difficulty questions, the improvement is smaller but still statistically significant. This shows that in complex drug information integration tasks, the model has certain relative advantages, but external materials alone cannot completely solve high - difficulty drug reasoning problems.
Accuracy of RxQA drug reasoning
Conclusion
The value of this study does not lie in proving that large medical models can already replace doctors, but in shifting the focus of evaluation from "whether they can diagnose"