microsoft/promptbase

★ 5,748PythonAudience · researcherComplexity · 3/5Setup · moderate

Mindmap

mindmap
  root((PromptBase))
    Core method
      Medprompt
      Dynamic few-shot
      Chain of thought
      Majority vote
    Tech stack
      Python
      GPT-4 API
    Results
      90 percent MMLU
      Medical benchmarks
    Use cases
      Reproduce experiments
      Custom prompting
      Accuracy improvement

mindmap root((PromptBase)) Core method Medprompt Dynamic few-shot Chain of thought Majority vote Tech stack Python GPT-4 API Results 90 percent MMLU Medical benchmarks Use cases Reproduce experiments Custom prompting Accuracy improvement

Click or tap to explore — scroll the page freely

Things people build with this

USE CASE 1

Reproduce Medprompt benchmark results on medical or general knowledge datasets.

USE CASE 2

Apply dynamic few-shot selection and chain-of-thought to boost GPT-4 accuracy on your own task.

USE CASE 3

Use majority-vote ensembling to make a language model's answers more consistent.

USE CASE 4

Build a more reliable AI question-answering system by adapting the prompting strategies in this repo.

Tech stack

PythonGPT-4

Getting it running

Difficulty · moderate Time to first run · 30min

Requires an OpenAI API key with GPT-4 access, API costs apply when running experiments.

In plain English

This is a collection of resources, code examples, and best practices from Microsoft researchers focused on getting better results from large AI language models, particularly GPT-4. The central contribution is a method called Medprompt, which was originally developed for medical question-answering but has since been extended to general knowledge benchmarks. Medprompt combines three techniques. The first is dynamic few-shot selection: instead of giving the AI the same fixed set of examples every time, the method picks examples that are specifically similar to the question being asked, by comparing them in a mathematical similarity space. The second is self-generated chain-of-thought, where GPT-4 is asked to write out its step-by-step reasoning before answering, which has been shown to improve accuracy on complex questions. The third is majority-vote ensembling, where the model answers the same question multiple times with shuffled answer choices, and the most consistent answer wins. Using these three techniques together, the researchers showed that a general-purpose model like GPT-4 could match or beat models that were specifically trained on medical data. When applied to the MMLU benchmark, a broad test covering 57 subject areas from mathematics to law to computer science, the extended version called Medprompt+ reached over 90% accuracy, which matched the best results reported by Google's Gemini Ultra at the time. The repository includes runnable Python scripts so others can reproduce the experiments or apply these prompting strategies to their own tasks. The README explains each technique in plain terms before linking to the relevant code. The project is described as evolving, with plans for more case studies and tooling around the prompt engineering process. This is primarily a research artifact aimed at practitioners who want to understand or apply advanced prompting strategies, rather than a finished product or library with a stable API.

Copy-paste prompts

Prompt 1

Using the Medprompt code from microsoft/promptbase, show me how to apply dynamic few-shot selection to my own multiple-choice questions.

Prompt 2

How does Medprompt's majority-vote ensembling work? Show me a Python example using the code in this repo.

Prompt 3

Walk me through setting up and running the Medprompt scripts on a medical question-answering dataset with GPT-4.

Prompt 4

How do I adapt the Medprompt+ approach from this repo to a non-medical benchmark like a law or coding exam?

Open on GitHub → Explain another repo

← microsoft on gitmyhub — every repo by this author, as a profile.

Verify against the repo before relying on details.