Download the 52,000 instruction-output pairs to fine-tune your own language model without collecting human annotations.
Run the Self-Instruct generation pipeline to produce new instruction data using your own GPT model and OpenAI key.
Benchmark your instruction-tuned model using the 252 human-written evaluation tasks included in the repository.
Adapt the data generation scripts to produce instruction data for a different domain or a non-English language.
Running the generation pipeline requires a paid OpenAI API key, the released dataset can be downloaded without one.
Self-Instruct is a research project that explores a way to train language models to follow instructions better, without requiring large amounts of human-written examples. The central idea is that the model itself generates the training data it later learns from, reducing the need for expensive manual annotation. The process works as an iterative loop. A small set of 175 human-written seed tasks is used to prompt a language model (in this case GPT-3) to write new tasks and examples of inputs and outputs for those tasks. The resulting generations are filtered to remove low-quality or duplicate items, then added back into the pool. Each round produces more data, which can then be used to fine-tune the model to be more responsive to natural language instructions. The repository releases the data generated through this process: 52,000 instructions paired with 82,000 input-output examples, all produced by GPT-3. This dataset is available for others to use to fine-tune their own models. The authors note that roughly 46 percent of the generated data points may contain errors or biases, and they encourage caution when using it. In addition to the dataset, the codebase includes scripts to run the full pipeline from scratch: generating instructions, classifying them, producing instance inputs and outputs, and preparing everything for fine-tuning. The scripts currently work with GPT-3 via the OpenAI API. The repository also includes 252 human-written evaluation tasks used in the original research paper to measure how well instruction-tuned models perform on realistic user requests. This project is aimed at machine learning researchers and engineers working on instruction-tuned language models. The code and data are open for reuse, though the authors note the work was still in progress at time of release.
← yizhongw on gitmyhub — every repo by this author, as a profile.
Verify against the repo before relying on details.