salesforce/lavis

★ 11,221Jupyter NotebookAudience · researcherComplexity · 4/5LicenseSetup · hard

Mindmap

mindmap
  root((repo))
    What it does
      Image captioning
      Visual question answering
      Text to image
    Models included
      BLIP-2
      InstructBLIP
      BLIP Diffusion
    Inputs supported
      Images and video
      Audio and 3D
      Natural language
    Setup
      pip from PyPI
      Jupyter notebooks
      BSD 3-Clause license

mindmap root((repo)) What it does Image captioning Visual question answering Text to image Models included BLIP-2 InstructBLIP BLIP Diffusion Inputs supported Images and video Audio and 3D Natural language Setup pip from PyPI Jupyter notebooks BSD 3-Clause license

Click or tap to explore — scroll the page freely

Things people build with this

USE CASE 1

Run BLIP-2 or InstructBLIP on your own images to generate captions or answer natural-language questions about them.

USE CASE 2

Benchmark multiple vision-language models against standard research datasets using consistent evaluation code.

USE CASE 3

Experiment with image-language models for a custom application without rebuilding training infrastructure from scratch.

Tech stack

PythonPyTorchJupyter Notebook

Getting it running

Difficulty · hard Time to first run · 1h+

Requires a GPU with CUDA and PyTorch installed, CPU-only inference is impractically slow for most models.

BSD 3-Clause license, use freely for any purpose including commercial, as long as you keep the copyright notice and don't use Salesforce's name to endorse your product.

In plain English

LAVIS is a Python library from Salesforce AI Research that brings together a collection of AI models capable of understanding both images and text at the same time. The kind of tasks these models can do includes describing what is in a photo, answering questions about an image, matching images to relevant text descriptions, and following natural-language instructions paired with visual input. The library is meant to make it easier for researchers and developers to try out these vision-and-language models without rebuilding everything from scratch. It provides a consistent interface so you can load different models, run them on images or videos, and evaluate their performance across standard benchmarks using the same code patterns. The library also includes tools to load common research datasets used in this field. Several notable models are included. BLIP-2 is a general image-language model that can be paired with a large language model to answer questions or generate descriptions. InstructBLIP extends that with instruction-following capabilities, meaning you can give it a task in plain English alongside an image. BLIP-Diffusion is a text-to-image generation model. X-InstructBLIP adds support for video, audio, and 3D input in addition to images. The library is installable from PyPI and the README includes working Jupyter notebook examples for captioning images, answering visual questions, and extracting features. Full documentation and a benchmark comparison table are hosted separately. LAVIS is released under a BSD 3-Clause license. It is primarily a research tool rather than a consumer product, so using it assumes familiarity with Python and machine learning workflows.

Copy-paste prompts

Prompt 1

I want to use InstructBLIP to answer questions about images in Python. Show me the minimal code to load the model from LAVIS, pass an image and a question, and print the answer.

Prompt 2

How do I use LAVIS to generate image captions with BLIP-2 and evaluate them on a standard dataset?

Prompt 3

I want to extract visual features from images using a LAVIS model for use in a downstream classifier. Show me how to load the model and get embeddings.

Prompt 4

What's the difference between BLIP-2, InstructBLIP, and X-InstructBLIP in the LAVIS library, and which should I use for video understanding?

Open on GitHub → Explain another repo

← salesforce on gitmyhub — every repo by this author, as a profile.

Verify against the repo before relying on details.