BLIP (Bootstrapping Language-Image Pre-training) is a research project from Salesforce AI Research that produced a model capable of understanding and generating language about images. The README marks the repository as deprecated and no longer supported, and recommends using LAVIS, a newer library from the same team that incorporates BLIP and other models in one place. The model was trained to handle several tasks that require both reading an image and producing or understanding text about it. Image captioning produces a written description of a photo automatically. Visual question answering takes an image and a typed question, then returns an answer. Image-text retrieval finds the most relevant image for a given text query, or the most relevant text for a given image. Natural language visual reasoning judges whether a statement about an image is true. All four tasks are covered by the code in this repository. The key idea behind BLIP is a training technique called CapFilt, which stands for Captioning and Filtering. The researchers collected large quantities of image-text pairs from the web, generated their own captions for the images using an initial model, and then used a separate filtering model to remove low-quality or incorrect captions from both the web-sourced and generated text. This cleaned dataset was used to train the final model. Pre-trained weights trained on 14 million and 129 million image-text pairs are available for download. Running the code requires PyTorch and multiple GPUs for training: the README examples use 8 to 16 A100 GPUs for fine-tuning. For people who only want to try the model without setting up hardware, a Colab notebook demo is available that runs without a GPU, and a web interface was hosted on Hugging Face Spaces. Fine-tuned model weights for each supported task are also provided as direct downloads so researchers can evaluate without training from scratch.
← salesforce on gitmyhub — every repo by this author, as a profile.
Verify against the repo before relying on details.