magic-ai4med/medsp1000

★ 13PythonAudience · researcherComplexity · 4/5Setup · moderate

Mindmap

mindmap
  root((repo))
    Dataset
      1638 clinical cases
      17 specialties
      100 case subset
    Evaluation
      Six competency areas
      Rubric scoring
      JSON results output
    Benchmark Setup
      Python environment
      Hugging Face data
      Shell script runner
    AI Roles
      Clinician model
      Patient sim agent
      Environment controller
    Key Findings
      60 percent top score
      Thinking time no help
      Medical AI gap

mindmap root((repo)) Dataset 1638 clinical cases 17 specialties 100 case subset Evaluation Six competency areas Rubric scoring JSON results output Benchmark Setup Python environment Hugging Face data Shell script runner AI Roles Clinician model Patient sim agent Environment controller Key Findings 60 percent top score Thinking time no help Medical AI gap

Click or tap to explore — scroll the page freely

Things people build with this

USE CASE 1

Test how well an AI model performs as a doctor in realistic multi-turn patient conversations across 17 medical specialties.

USE CASE 2

Compare different AI models on clinical reasoning skills beyond standard medical knowledge quiz benchmarks.

USE CASE 3

Run a quick 100-case experiment to evaluate a new medical AI model before committing to the full benchmark.

USE CASE 4

Identify specific clinical competency gaps in an AI model using the structured six-area rubric.

Tech stack

PythonShell scriptsHugging FaceJSONLLM APIs

Getting it running

Difficulty · moderate Time to first run · 1h+

Requires a Python environment, an API key for the model under test, and downloading scenario data from Hugging Face. A 100-case subset is available for faster first runs.

License not mentioned in the explanation.

In plain English

MedSP1000 is a research benchmark for testing how well AI language models handle interactive medical consultations. It was created by researchers at Shanghai Jiao Tong University and Shanghai Artificial Intelligence Laboratory. The benchmark is based on standardized patient cases, which are a method medical schools use to train students: trained actors play the role of patients with specific conditions so students can practice taking histories, ordering tests, and making treatment decisions in a controlled setting. This project converts that same approach into a format that AI models can be tested against. The dataset contains 1,638 interactive clinical cases spanning 17 medical specialties. In each case, an AI model plays the role of a clinician and has a multi-turn conversation with a patient simulation agent. A separate environment controller handles things like lab results and clinical state changes. After the encounter ends, an evaluator agent scores the clinician AI's actions against a structured rubric developed from the original medical education materials. The rubric covers six competency areas defined by the accrediting body for US medical training, including patient care, medical knowledge, and communication. The central finding from the paper is that AI models which score well on standard medical knowledge tests do not automatically perform well in interactive clinical scenarios. The best model tested completed only about 60 percent of the expert-defined rubric items. The best model specifically designed for medical tasks completed only 40 percent. Giving models more time to think before responding produced no measurable improvement. To run the benchmark yourself, you set up a Python environment, provide an API key for the AI model you want to test, download the scenario data from Hugging Face, and run a shell script that executes the simulated encounters. A smaller subset of 100 pre-validated scenarios is available for quicker experiments. Results for each case are written to a JSON file showing which rubric items passed and which did not. The full scenario data and rubric files are publicly available.

Copy-paste prompts

Prompt 1

I have downloaded the MedSP1000 dataset from Hugging Face. Walk me step by step through setting up the Python environment and running the shell script to test my AI model on the 100-case subset.

Prompt 2

Explain what the six competency areas in the MedSP1000 rubric are and how the evaluator agent scores a clinician AI's performance against them.

Prompt 3

I ran MedSP1000 on my model and got JSON result files. How do I read these files to see which rubric items my model passed and which it failed?

Prompt 4

Why does MedSP1000 use a patient simulation agent and a separate environment controller instead of a single system? How do these two components interact during a clinical case?

Prompt 5

My model scores well on medical knowledge tests but I want to know how it would do on interactive clinical cases. How does MedSP1000 measure the gap between those two things?

Open on GitHub → Explain another repo

← magic-ai4med on gitmyhub — every repo by this author, as a profile.

Verify against the repo before relying on details.