Train an image generation model on 100 million commercially licensed images without legal uncertainty about dataset rights.
Benchmark your own image generation model against the GPIC evaluation toolkit and compare to the Stanford baseline.
Study how caption length and detail affect text-to-image model quality using the four caption variants per image.
The full training set spans 8000 archive files totaling roughly 28 trillion pixels on Hugging Face, even a single shard requires significant bandwidth and storage.
GPIC stands for Giant Permissive Image Corpus. It is a large image dataset created by researchers at Stanford University, designed for training and evaluating AI models that generate images. The dataset contains roughly 100 million training images, 200,000 validation images, and 1 million test images, totaling approximately 28 trillion pixels of image data. A key distinguishing feature is the licensing. All images in the corpus are permissively licensed, meaning they can be used for both academic research and commercial purposes without the legal uncertainty that surrounds many large image datasets scraped from the web. The images were also filtered for safety content and deduplicated before release. Each image comes with a caption generated by a vision-language model, and captions are available in four lengths: tag, short, medium, and long. The corpus is hosted on Hugging Face and distributed as a series of compressed archive files. The training data is split across 8,000 such archives, with 32 for validation and 128 for test. Each archive contains alternating pairs of a JSON metadata file and the corresponding image file. The metadata includes the license, attribution, image dimensions, which dataset split the image belongs to, and the generated caption. This GitHub repository contains two things: the evaluation toolkit used to measure how well a trained model performs on the GPIC benchmark, and a baseline implementation for comparison. The baseline is a pixel-space image generation approach built on an existing open-source model, with training and sampling scripts included. Researchers can train their own models on the GPIC data and then use the evaluation toolkit to benchmark their results against the provided reference. The project is from a team at Stanford that includes faculty members Li Fei-Fei and Jiajun Wu. The dataset and associated models are available on Hugging Face, and the accompanying research paper is on arXiv.
← keshik6 on gitmyhub — every repo by this author, as a profile.
Verify against the repo before relying on details.