explaingit

haotian-liu/llava

24,802PythonAudience · researcherComplexity · 4/5StaleLicenseSetup · hard

TLDR

Open-source AI model that understands images and text together, letting you ask questions about pictures and get conversational answers.

Mindmap

mindmap
  root((LLaVA))
    What it does
      Understands images
      Answers visual questions
      Follows image instructions
    How it works
      Vision encoder
      Language model
      Visual instruction tuning
    Use cases
      Visual Q&A systems
      Image captioning
      Screenshot explanation
    Tech stack
      Python
      Hugging Face
      LLaMA models
      Vision encoders
    Versions
      Original LLaVA
      LLaVA-1.5
      LLaVA-NeXT with video

Things people build with this

USE CASE 1

Build a visual question-answering system that answers questions about uploaded images.

USE CASE 2

Create an AI assistant that can analyze screenshots and explain what's happening in them.

USE CASE 3

Train a custom multimodal model for domain-specific image understanding tasks.

USE CASE 4

Develop an image captioning tool that generates detailed descriptions of pictures.

Tech stack

PythonPyTorchHugging FaceLLaMAVision Transformer

Getting it running

Difficulty · hard Time to first run · 1h+

Requires downloading large model weights and GPU/CUDA for inference; PyTorch compilation and dependency resolution can be time-consuming.

Use freely for any purpose including commercial. Keep the notice and disclose changes to the patent grant.

In plain English

LLaVA (Large Language and Vision Assistant) is a research project and open-source AI model that can understand and discuss both images and text together. In simple terms, you can show it a picture and ask questions about it in plain language, and it will respond conversationally, describing what it sees, answering questions, and following instructions related to the image. The core idea is "visual instruction tuning", training an AI so it can follow human instructions when those instructions involve visual content, not just text. It connects a vision encoder (a system that understands images) to a large language model (LLM, the type of AI behind ChatGPT), allowing the combined system to reason about images and language together. The project was accepted as an oral presentation at NeurIPS 2023, one of the most competitive AI research conferences. Later versions (LLaVA-1.5, LLaVA-NeXT) improved on the original by achieving top benchmark scores while using only publicly available training data and completing training in about one day on a standard cluster of eight high-end GPUs (A100s). The LLaVA-NeXT version also added video understanding and support for newer language models including LLaMA-3 and Qwen-1.5. Researchers and developers use LLaVA as a foundation for building multimodal AI applications, things like visual question answering, image captioning, or AI assistants that can look at screenshots and explain them. It is built in Python and weights are distributed via Hugging Face.

Copy-paste prompts

Prompt 1
How do I set up LLaVA locally to run image understanding on my own machine?
Prompt 2
Show me how to fine-tune LLaVA on my custom dataset of labeled images.
Prompt 3
What's the difference between LLaVA-1.5 and LLaVA-NeXT, and which should I use for video understanding?
Prompt 4
How can I integrate LLaVA into a Python application to answer questions about user-uploaded images?
Prompt 5
What are the hardware requirements to run LLaVA inference, and can I run it on a consumer GPU?
Open on GitHub → Explain another repo

Generated 2026-05-18 · Model: sonnet-4-6 · Verify against the repo before relying on details.