Reproduce Medprompt benchmark results on medical or general knowledge datasets.
Apply dynamic few-shot selection and chain-of-thought to boost GPT-4 accuracy on your own task.
Use majority-vote ensembling to make a language model's answers more consistent.
Build a more reliable AI question-answering system by adapting the prompting strategies in this repo.
Requires an OpenAI API key with GPT-4 access, API costs apply when running experiments.
This is a collection of resources, code examples, and best practices from Microsoft researchers focused on getting better results from large AI language models, particularly GPT-4. The central contribution is a method called Medprompt, which was originally developed for medical question-answering but has since been extended to general knowledge benchmarks. Medprompt combines three techniques. The first is dynamic few-shot selection: instead of giving the AI the same fixed set of examples every time, the method picks examples that are specifically similar to the question being asked, by comparing them in a mathematical similarity space. The second is self-generated chain-of-thought, where GPT-4 is asked to write out its step-by-step reasoning before answering, which has been shown to improve accuracy on complex questions. The third is majority-vote ensembling, where the model answers the same question multiple times with shuffled answer choices, and the most consistent answer wins. Using these three techniques together, the researchers showed that a general-purpose model like GPT-4 could match or beat models that were specifically trained on medical data. When applied to the MMLU benchmark, a broad test covering 57 subject areas from mathematics to law to computer science, the extended version called Medprompt+ reached over 90% accuracy, which matched the best results reported by Google's Gemini Ultra at the time. The repository includes runnable Python scripts so others can reproduce the experiments or apply these prompting strategies to their own tasks. The README explains each technique in plain terms before linking to the relevant code. The project is described as evolving, with plans for more case studies and tooling around the prompt engineering process. This is primarily a research artifact aimed at practitioners who want to understand or apply advanced prompting strategies, rather than a finished product or library with a stable API.
← microsoft on gitmyhub — every repo by this author, as a profile.
Verify against the repo before relying on details.