epflight/fullyopenmeditron

Analysis updated 2026-06-24

★ 6PythonAudience · researcherComplexity · 5/5LicenseSetup · hard

Mindmap

mindmap
  root((FullyOpenMeditron))
    Inputs
      Medical exam QA
      Clinical guidelines
      Teacher model gpt-oss-120b
    Outputs
      MeditronFO fine-tunes
      Synthetic corpus
      Benchmark scores
    Use Cases
      Reproduce paper
      Train medical LLM
      Run Auto-MOOVE judge
    Tech Stack
      Python
      vLLM
      Slurm
      PyTorch

mindmap root((FullyOpenMeditron)) Inputs Medical exam QA Clinical guidelines Teacher model gpt-oss-120b Outputs MeditronFO fine-tunes Synthetic corpus Benchmark scores Use Cases Reproduce paper Train medical LLM Run Auto-MOOVE judge Tech Stack Python vLLM Slurm PyTorch

Click or tap to explore — scroll the page freely

What do people build with it?

USE CASE 1

Reproduce the MeditronFO fine-tunes on Apertus, OLMo-2, or EuroLLM base models

USE CASE 2

Build a synthetic medical QA corpus using rejection sampling against gold labels

USE CASE 3

Run MedQA, MedMCQA, PubMedQA, and HealthBench evals on a custom model

USE CASE 4

Compare two medical LLMs with the Auto-MOOVE pairwise judge protocol

What is it built with?

PythonvLLMSlurmPyTorchCUDA

How does it compare?

	epflight/fullyopenmeditron	ashishdevasia/ha-proton-drive-backup	benchflow-ai/skillsbench-trajectories
Stars	6	6	6
Language	Python	Python	Python
Last pushed	—	—	2026-06-14
Maintenance	—	—	Active
Setup difficulty	hard	moderate	easy
Complexity	5/5	2/5	1/5
Audience	researcher	ops devops	researcher

Figures from each repo's GitHub metadata at analysis time.

How do you get it running?

Difficulty · hard Time to first run · 1day+

Pipeline targets large Slurm GPU clusters with vLLM, the README mentions a Swiss supercomputer as the original training environment.

Research-use only, the models are not licensed for clinical deployment or commercial use.

In plain English

Fully Open Meditron is a research project from the EPFL LiGHT lab that builds and evaluates large language models for clinical decision support. The repository holds the full open-source pipeline used in the project's paper, from how the training data was generated, to the training scripts, to the evaluation tools that test how well the trained models reason about medical cases. The stated goal is that every stage can be reproduced from this one repo. The team releases what they call MeditronFO fine-tunes of five fully open base models, including Apertus-Instruct in 70 billion and 8 billion parameter sizes, OLMo-2-SFT at 32 billion parameters, EuroLLM-Instruct at 22 and 9 billion parameters, plus one open-weight control built on Gemma-3 at 27 billion parameters. The pipeline targets large GPU clusters running Slurm and vLLM, and the README mentions it was originally run on a Swiss supercomputer called Clariden. A new_launch.sh helper script handles container setup, GPU allocation, and starting and stopping the vLLM server. The training corpus is built from three synthetic components. Curated QA distills cleaner answers from existing medical exam questions using rejection sampling, where the teacher model is asked up to eight times until the answer letter matches the gold label. Synthetic Curated QA generates new exam-style questions. Guidelines QA is grounded in a clinical practice guidelines corpus of 46,469 articles from 16 institutions. Synthetic MOOVE produces open-ended clinical vignettes in two steps, first the case prompt then the teacher answer. The default teacher is gpt-oss-120b, with alternatives such as Qwen3-30B used in ablations. Before training, the corpus is decontaminated against every benchmark the models will later be tested on. Evaluation has three pieces. A benchmark script runs MedQA, MedMCQA, PubMedQA, MedXpertQA, MMLU-Pro, IFEval, and ARC-Challenge at temperature zero. HealthBench is run separately. The third piece is Auto-MOOVE, a pairwise judge protocol where one model's answer is compared to another's across nine criteria including reasoning, relevance, harmlessness, fairness, communication, and alignment with guidelines, with answer order randomized to reduce positional bias. The authors say Auto-MOOVE was validated against 204 human raters. A table of ablations covers removing each corpus subset, adding a 10 percent Tulu replay mixture for instruction-following retention, swapping the teacher model, and swapping the judge model. The license is research-use only and the README states clearly that the models are not approved for clinical deployment.

Copy-paste prompts

Prompt 1

Walk me through the Curated QA, Synthetic MOOVE, and Guidelines QA stages of the FullyOpenMeditron training corpus

Prompt 2

Show me how new_launch.sh sets up containers, GPU allocation, and the vLLM server on a Slurm cluster

Prompt 3

Explain the Auto-MOOVE pairwise judge protocol and its nine evaluation criteria

Prompt 4

How does the rejection sampling loop work for Curated QA with up to 8 teacher attempts

Prompt 5

What ablations does the repo run on teacher swap, judge swap, and the 10 percent Tulu replay mixture

Frequently asked questions

What is fullyopenmeditron?

EPFL LiGHT lab's reproducible pipeline for fine-tuning open LLMs on medical reasoning, covering synthetic data generation, multi-GPU Slurm training, and a multi-benchmark eval suite.

What language is fullyopenmeditron written in?

Mainly Python. The stack also includes Python, vLLM, Slurm.

What license does fullyopenmeditron use?

Research-use only, the models are not licensed for clinical deployment or commercial use.

How hard is fullyopenmeditron to set up?

Setup difficulty is rated hard, with roughly 1day+ to a first successful run.

Who is fullyopenmeditron for?

Mainly researcher.

Open on GitHub → Explain another repo

This repo across BitVibe Labs

Verify against the repo before relying on details.