Reproduce the MeditronFO fine-tunes on Apertus, OLMo-2, or EuroLLM base models
Build a synthetic medical QA corpus using rejection sampling against gold labels
Run MedQA, MedMCQA, PubMedQA, and HealthBench evals on a custom model
Compare two medical LLMs with the Auto-MOOVE pairwise judge protocol
Pipeline targets large Slurm GPU clusters with vLLM; the README mentions a Swiss supercomputer as the original training environment.
Fully Open Meditron is a research project from the EPFL LiGHT lab that builds and evaluates large language models for clinical decision support. The repository holds the full open-source pipeline used in the project's paper, from how the training data was generated, to the training scripts, to the evaluation tools that test how well the trained models reason about medical cases. The stated goal is that every stage can be reproduced from this one repo. The team releases what they call MeditronFO fine-tunes of five fully open base models, including Apertus-Instruct in 70 billion and 8 billion parameter sizes, OLMo-2-SFT at 32 billion parameters, EuroLLM-Instruct at 22 and 9 billion parameters, plus one open-weight control built on Gemma-3 at 27 billion parameters. The pipeline targets large GPU clusters running Slurm and vLLM, and the README mentions it was originally run on a Swiss supercomputer called Clariden. A new_launch.sh helper script handles container setup, GPU allocation, and starting and stopping the vLLM server. The training corpus is built from three synthetic components. Curated QA distills cleaner answers from existing medical exam questions using rejection sampling, where the teacher model is asked up to eight times until the answer letter matches the gold label. Synthetic Curated QA generates new exam-style questions. Guidelines QA is grounded in a clinical practice guidelines corpus of 46,469 articles from 16 institutions. Synthetic MOOVE produces open-ended clinical vignettes in two steps, first the case prompt then the teacher answer. The default teacher is gpt-oss-120b, with alternatives such as Qwen3-30B used in ablations. Before training, the corpus is decontaminated against every benchmark the models will later be tested on. Evaluation has three pieces. A benchmark script runs MedQA, MedMCQA, PubMedQA, MedXpertQA, MMLU-Pro, IFEval, and ARC-Challenge at temperature zero. HealthBench is run separately. The third piece is Auto-MOOVE, a pairwise judge protocol where one model's answer is compared to another's across nine criteria including reasoning, relevance, harmlessness, fairness, communication, and alignment with guidelines, with answer order randomized to reduce positional bias. The authors say Auto-MOOVE was validated against 204 human raters. A table of ablations covers removing each corpus subset, adding a 10 percent Tulu replay mixture for instruction-following retention, swapping the teacher model, and swapping the judge model. The license is research-use only and the README states clearly that the models are not approved for clinical deployment.
Generated 2026-05-22 · Model: sonnet-4-6 · Verify against the repo before relying on details.