Test how well an AI model performs as a doctor in realistic multi-turn patient conversations across 17 medical specialties.
Compare different AI models on clinical reasoning skills beyond standard medical knowledge quiz benchmarks.
Run a quick 100-case experiment to evaluate a new medical AI model before committing to the full benchmark.
Identify specific clinical competency gaps in an AI model using the structured six-area rubric.
Requires a Python environment, an API key for the model under test, and downloading scenario data from Hugging Face. A 100-case subset is available for faster first runs.
MedSP1000 is a research benchmark for testing how well AI language models handle interactive medical consultations. It was created by researchers at Shanghai Jiao Tong University and Shanghai Artificial Intelligence Laboratory. The benchmark is based on standardized patient cases, which are a method medical schools use to train students: trained actors play the role of patients with specific conditions so students can practice taking histories, ordering tests, and making treatment decisions in a controlled setting. This project converts that same approach into a format that AI models can be tested against. The dataset contains 1,638 interactive clinical cases spanning 17 medical specialties. In each case, an AI model plays the role of a clinician and has a multi-turn conversation with a patient simulation agent. A separate environment controller handles things like lab results and clinical state changes. After the encounter ends, an evaluator agent scores the clinician AI's actions against a structured rubric developed from the original medical education materials. The rubric covers six competency areas defined by the accrediting body for US medical training, including patient care, medical knowledge, and communication. The central finding from the paper is that AI models which score well on standard medical knowledge tests do not automatically perform well in interactive clinical scenarios. The best model tested completed only about 60 percent of the expert-defined rubric items. The best model specifically designed for medical tasks completed only 40 percent. Giving models more time to think before responding produced no measurable improvement. To run the benchmark yourself, you set up a Python environment, provide an API key for the AI model you want to test, download the scenario data from Hugging Face, and run a shell script that executes the simulated encounters. A smaller subset of 100 pre-validated scenarios is available for quicker experiments. Results for each case are written to a JSON file showing which rubric items passed and which did not. The full scenario data and rubric files are publicly available.
← magic-ai4med on gitmyhub — every repo by this author, as a profile.
Verify against the repo before relying on details.