Towards autonomous medical artificial intelligence agents

MIMIC-IV dataset

Dataset description

We develop a benchmark of 574 patients derived from the MIMIC-IV database which is a publicly available, comprehensive repository of de-identified EHRs from approximately 300,000 patients who received care at Beth Israel Deaconess Medical Center in Boston, MA, USA, between 2008 and 2019, managed by the Massachusetts Institute of Technology (MIT). This database includes semi-structured clinical information related to hospital admissions ranging from free-text notes such as discharge summaries or radiology reports and tabular information, including ICD-coded patient diagnoses, laboratory and microbiology results, vital parameters, pre-admission and in-hospital medications, and procedural records, such as surgical interventions. In this study, we concentrated on eight target diagnoses out of which the first four—appendicitis, cholecystitis, diverticulitis and pancreatitis were focused on abdominal pathologies. Our data preparation was adapted from a prior publication18 to ensure methodological consistency and enhance comparability across studies. We also refer readers to this study for more details on the data preparation pipeline. The remaining four target pathologies focused on internal medicine emergencies, including pneumonia, urinary tract infection and pulmonary embolism, as well as an oncology-related condition, pancreatic cancer. These 8 diagnoses were selected to reflect both the high-volume symptom burden that drives emergency department demand (for example, abdominal pain with around 13 million visits, cough (5.97 million), shortness of breath (5.89 million), and fever (5.83 million)), all leading entry symptoms for our target conditions and the frequency of the diagnoses themselves (for instance urinary tract infection and pneumonia with around 1.66 and 1.2 million emergency department visits in the USA in 2022), while retaining a small oncologic cohort (pancreatic cancer, 4% of cases in our dataset) to test infrequent, high-acuity presentations22. A detailed data selection flowchart (consort diagram) summarizing the workflow of our benchmark creation is provided in Extended Data Fig. 10 with word clouds on chief complaints shown in Supplementary Fig. 4. Next, we describe our dataset generation pipeline.

Benchmarking dataset curation

We first restrict our dataset to hospital admissions related to the eight target pathologies under investigation. This process starts with identifying hospital stays coded with the relevant ICD-9 or ICD-10 diagnosis as the primary (principal) diagnosis from the diagnosis table matching diagnoses extracted from the discharge letter, accompanied by a corresponding admission diagnosis from the emergency department. This is to ensure the exclusion of cases where patients were hospitalized under a completely different initial working diagnosis due to incomplete or pending diagnostic information or where signs of the disease only occurred during hospitalization (for example nosocomial urinary tract infections). Since we want to restrict the decision-making on laboratory and microbiology data from the first 24 h after admission to simulate first encounter at the emergency department, this way we can filter out cases where the correct diagnosis could only be made during the progression of the hospital stay. Then, for each sample we extract the patient’s clinical HPI and the documented findings from the admission physical examination using regular expressions from the discharge summaries. Admission medication is identified either also through regular expressions in the free-text sections of the discharge notes or, as a fallback, by aggregating entries from the medication reconciliation table associated with the emergency department visit. Subsequently, laboratory and microbiology data are extracted by selecting the earliest available result for each unique parameter recorded within the first 24 h following admission. In cases where a patient had a prior encounter within 24 h before admission (for example, an initial visit to the emergency department without inpatient admission followed by a subsequent revisit the next day), the initial encounter from the preceding day is considered the earliest time point. Laboratory events are mapped to standardized clinical code systems by using the label column of the laboratory events table to associate each test with its corresponding unique identifier in the Observational Medical Outcomes Partnership (OMOP) concept codes23. Laboratory results are then curated from tabular structure into an LLM-readable format by grouping label names, results and reference ranges while maintaining tabular structure. Similarly, for microbiology data, a unique entry is created for each identified organism, aggregating rows that include antibiotic susceptibility information for that organism into a structured, LLM-readable representation. In cases in which laboratory values or microbiology tests were measured multiple times within the first 24 h, only the initial recorded measurement is considered for downstream analysis. For radiology data, we adhere to the same temporal conventions as described above and extract imaging modalities and anatomical regions from a predefined set of keywords, as outlined previously18. Finally, there are more than 80,000 possible ICD-9 and ICD-10 codable procedures, containing medical interventions such as surgeries and any other clinical action that can be requested and documented within EHR systems. Recent research has demonstrated that LLMs can excel at generating ICD codes when equipped with tools such as retrieval-augmented generation10, which enables the model to search a database of relevant codes using natural language queries and provides contextual information—such as a list of potential candidate ICD codes that match the request—to improve accuracy. Although our work does not primarily focus on medical coding tasks, we can leverage the idea of RAG to develop a searchable index of available procedures. Specifically, we generate embeddings of the full ICD-9 and ICD-10 procedure descriptions using jina-embeddings-v324 in Jina AI and store these embeddings, along with metadata (including the ICD code and original procedure title), in a local Qdrant25 index for efficient retrieval. Finally, we remove any mention of the diagnosis within the reports using placeholders (three underscores) in accordance with the redaction conventions of the MIMIC-IV dataset. Additionally, cases are excluded if imaging data is incomplete, specifically when either the modality or anatomical region could not be identified, or if required imaging studies were unavailable (for example, absence of chest imaging for pneumonia or abdominal imaging for appendicitis). Finally, cases lacking essential clinical information—such as a documented clinical HPI, physical examination findings, or blood test results—are also excluded. For pancreatic cancer patient cases, which often include critical clinical information from previous hospital visits and external referrals, we generated structured patient summaries from complete discharge letters. During manual review, we observed that such information, such as radiology findings (for example identification of a pancreatic mass on CT) or histopathological results after endoscopic retrograde cholangiopancreatography (ERCP) was frequently referenced in the free-text sections of the discharge summaries but not systematically captured in the structured fields of the dataset. This step ensured that the agent could access essential details otherwise unavailable from the current HPI or the structured dataset, such as planned admissions for Whipple surgery or histologic confirmation of diagnosis prior to presentation. To accomplish this, a language model with the instructions shown the Supplementary Information 15 was used to systematically extract structured information on prior imaging studies, ERCP findings, biopsy histopathology results, if diagnosis was already confirmed, and if there were any documented reasons for planned admissions. This additional curation was performed only for pancreatic cancer cases.

From the preliminary dataset, 600 cases were randomly selected for manual review by two experienced physicians who independently evaluated each case against a minimal set of requirements extracted from relevant medical guidelines26,27,28,29,30,31,32, which can be found in Extended Data Fig. 10. This evaluation considered all available clinical information including patient history, laboratory and urine results, radiology and microbiology findings, physical examination results, procedures, and both hospital and pre-admission medications. Cases were only excluded if reviewers agreed that a diagnosis was not possible based on the available data. For instance, in the case of pneumonia, the presence of chest imaging either by CT-scan or X-ray is a minimal requirement as per medical guidelines; cases lacking this due to missing external imaging reports (for example, for transferred patients where an outside CXR report may have been available on paper at the bedside but not stored in the destination hospital infrastructure) were excluded, as these data were never recorded in the EHR and thus unavailable for evaluation. Importantly, these exclusions were not applied to remove diagnostically ambiguous or difficult emergency presentations, but only to remove encounters that were considered not evaluable in our experiments because crucial information—available in reality—was absent in the dataset extract. Following this process, 26 cases were excluded (1 pancreatitis, 1 appendicitis, 3 urinary tract infections, 6 pneumonias, 15 cholecystitis), resulting in a final benchmark dataset of 574 cases. Notably, physicians did not disagree with the ‘ground truth’ of those cases, but agreed that relevant information was missing. Further details are presented in Extended Data Fig. 10. As an additional safety step, a board-certified physician independently reviewed a random subset of cases (n = 90), evaluating the complete ground-truth data later available to MIRA and to the physician study (history, examination, laboratories, imaging, microbiology, procedures and medications) to confirm the diagnosis from the underlying data, with all 90 cases (100%) judged as correct.

LLM agent pipeline

We developed a multi-turn conversation framework featuring two AI agents—a patient agent and a physician agent (MIRA). The patient agent simulates a real patient, providing responses solely based on a real patient’s clinical history from the MIMIC-IV dataset without external tool access. By contrast, MIRA, akin to a physicians using hospital software, can call specialized ‘tools’ to request additional information about the patient, such as laboratory values or radiology images. Each tool requires populating standardized FHIR parameters such as selected laboratory test value codes or imaging modality, body region and clinical information about the patient. Then, the request gets forwarded to a sandboxed EHR server, where FHIR-compatible observations with data grounded in the real-world MIMIC-IV dataset are generated. The returned FHIR resources are fed back into MIRA’s conversation context for subsequent decision-making. Supplementary Fig. 1 illustrates an example of two tool calls—one for laboratory values and another for imaging requests. These tool calls can occur in parallel, allowing multiple requests to be initiated during a single agent turn within the conversation. Further details regarding the implementation and workflow of these tool calls are provided in the subsequent sections.

HL7 FHIR is a widely recognized, standards-based framework designed to enable consistent and interoperable exchange of electronic health information. We adopted FHIR as the communication backbone for MIRA to the EHR, which facilitates the submission of diagnostic or therapeutic requests to a server and the receipt of FHIR observations in response. The server ran locally as a HAPI-FHIR instance33 in Docker (https://www.docker.com/). Resource creation, updates, and retrieval were performed using standard FHIR operations. Core FHIR entities were generated using the open-source fhir.resources34 package. These included an ‘organization’ resource to represent the AI-enabled healthcare facility and a ‘practitioner’ resource to denote a physician entity (MIRA). ‘Synthetic patient’ resources were created from the MIMIC-IV dataset and uploaded to the server dynamically during the runtime of the AI simulation between the patient agent and MIRA, while the physician and organization resource remained consistent throughout the simulation. Patient resources were created with gender and age derived from MIMIC-IV, with birth dates synthesized using the anchor year of patient information.

Medical coding systems

... continue reading