explaingit

epflight/fullyopenmeditron

6PythonAudience · researcherComplexity · 5/5ActiveLicenseSetup · hard

TLDR

EPFL LiGHT lab's reproducible pipeline for fine-tuning open LLMs on medical reasoning, covering synthetic data generation, multi-GPU Slurm training, and a multi-benchmark eval suite.

Mindmap

mindmap
  root((FullyOpenMeditron))
    Inputs
      Medical exam QA
      Clinical guidelines
      Teacher model gpt-oss-120b
    Outputs
      MeditronFO fine-tunes
      Synthetic corpus
      Benchmark scores
    Use Cases
      Reproduce paper
      Train medical LLM
      Run Auto-MOOVE judge
    Tech Stack
      Python
      vLLM
      Slurm
      PyTorch

Things people build with this

USE CASE 1

Reproduce the MeditronFO fine-tunes on Apertus, OLMo-2, or EuroLLM base models

USE CASE 2

Build a synthetic medical QA corpus using rejection sampling against gold labels

USE CASE 3

Run MedQA, MedMCQA, PubMedQA, and HealthBench evals on a custom model

USE CASE 4

Compare two medical LLMs with the Auto-MOOVE pairwise judge protocol

Tech stack

PythonvLLMSlurmPyTorchCUDA

Getting it running

Difficulty · hard Time to first run · 1day+

Pipeline targets large Slurm GPU clusters with vLLM; the README mentions a Swiss supercomputer as the original training environment.

Research-use only; the models are not licensed for clinical deployment or commercial use.

In plain English

Fully Open Meditron is a research project from the EPFL LiGHT lab that builds and evaluates large language models for clinical decision support. The repository holds the full open-source pipeline used in the project's paper, from how the training data was generated, to the training scripts, to the evaluation tools that test how well the trained models reason about medical cases. The stated goal is that every stage can be reproduced from this one repo. The team releases what they call MeditronFO fine-tunes of five fully open base models, including Apertus-Instruct in 70 billion and 8 billion parameter sizes, OLMo-2-SFT at 32 billion parameters, EuroLLM-Instruct at 22 and 9 billion parameters, plus one open-weight control built on Gemma-3 at 27 billion parameters. The pipeline targets large GPU clusters running Slurm and vLLM, and the README mentions it was originally run on a Swiss supercomputer called Clariden. A new_launch.sh helper script handles container setup, GPU allocation, and starting and stopping the vLLM server. The training corpus is built from three synthetic components. Curated QA distills cleaner answers from existing medical exam questions using rejection sampling, where the teacher model is asked up to eight times until the answer letter matches the gold label. Synthetic Curated QA generates new exam-style questions. Guidelines QA is grounded in a clinical practice guidelines corpus of 46,469 articles from 16 institutions. Synthetic MOOVE produces open-ended clinical vignettes in two steps, first the case prompt then the teacher answer. The default teacher is gpt-oss-120b, with alternatives such as Qwen3-30B used in ablations. Before training, the corpus is decontaminated against every benchmark the models will later be tested on. Evaluation has three pieces. A benchmark script runs MedQA, MedMCQA, PubMedQA, MedXpertQA, MMLU-Pro, IFEval, and ARC-Challenge at temperature zero. HealthBench is run separately. The third piece is Auto-MOOVE, a pairwise judge protocol where one model's answer is compared to another's across nine criteria including reasoning, relevance, harmlessness, fairness, communication, and alignment with guidelines, with answer order randomized to reduce positional bias. The authors say Auto-MOOVE was validated against 204 human raters. A table of ablations covers removing each corpus subset, adding a 10 percent Tulu replay mixture for instruction-following retention, swapping the teacher model, and swapping the judge model. The license is research-use only and the README states clearly that the models are not approved for clinical deployment.

Copy-paste prompts

Prompt 1
Walk me through the Curated QA, Synthetic MOOVE, and Guidelines QA stages of the FullyOpenMeditron training corpus
Prompt 2
Show me how new_launch.sh sets up containers, GPU allocation, and the vLLM server on a Slurm cluster
Prompt 3
Explain the Auto-MOOVE pairwise judge protocol and its nine evaluation criteria
Prompt 4
How does the rejection sampling loop work for Curated QA with up to 8 teacher attempts
Prompt 5
What ablations does the repo run on teacher swap, judge swap, and the 10 percent Tulu replay mixture
Open on GitHub → Explain another repo

Generated 2026-05-22 · Model: sonnet-4-6 · Verify against the repo before relying on details.