haotian-liu/llava

Analysis updated 2026-05-18

★ 24,755PythonAudience · researcherComplexity · 4/5LicenseSetup · hard

Mindmap

mindmap
  root((LLaVA))
    What it does
      Understands images
      Answers visual questions
      Follows image instructions
    How it works
      Vision encoder
      Language model
      Visual instruction tuning
    Use cases
      Visual Q&A systems
      Image captioning
      Screenshot explanation
    Tech stack
      Python
      Hugging Face
      LLaMA models
      Vision encoders
    Versions
      Original LLaVA
      LLaVA-1.5
      LLaVA-NeXT with video

mindmap root((LLaVA)) What it does Understands images Answers visual questions Follows image instructions How it works Vision encoder Language model Visual instruction tuning Use cases Visual Q&A systems Image captioning Screenshot explanation Tech stack Python Hugging Face LLaMA models Vision encoders Versions Original LLaVA LLaVA-1.5 LLaVA-NeXT with video

Click or tap to explore — scroll the page freely

What do people build with it?

USE CASE 1

Build a visual question-answering system that answers questions about uploaded images.

USE CASE 2

Create an AI assistant that can analyze screenshots and explain what's happening in them.

USE CASE 3

Train a custom multimodal model for domain-specific image understanding tasks.

USE CASE 4

Develop an image captioning tool that generates detailed descriptions of pictures.

What is it built with?

PythonPyTorchHugging FaceLLaMAVision Transformer

How does it compare?

	haotian-liu/llava	kovidgoyal/calibre	microsoft/jarvis
Stars	24,755	24,777	24,693
Language	Python	Python	Python
Setup difficulty	hard	easy	hard
Complexity	4/5	2/5	4/5
Audience	researcher	general	researcher

Figures from each repo's GitHub metadata at analysis time.

How do you get it running?

Difficulty · hard Time to first run · 1h+

Requires downloading large model weights and GPU/CUDA for inference, PyTorch compilation and dependency resolution can be time-consuming.

Use freely for any purpose including commercial. Keep the notice and disclose changes to the patent grant.

In plain English

LLaVA (Large Language and Vision Assistant) is a research project and open-source AI model that can understand and discuss both images and text together. In simple terms, you can show it a picture and ask questions about it in plain language, and it will respond conversationally, describing what it sees, answering questions, and following instructions related to the image. The core idea is "visual instruction tuning", training an AI so it can follow human instructions when those instructions involve visual content, not just text. It connects a vision encoder (a system that understands images) to a large language model (LLM, the type of AI behind ChatGPT), allowing the combined system to reason about images and language together. The project was accepted as an oral presentation at NeurIPS 2023, one of the most competitive AI research conferences. Later versions (LLaVA-1.5, LLaVA-NeXT) improved on the original by achieving top benchmark scores while using only publicly available training data and completing training in about one day on a standard cluster of eight high-end GPUs (A100s). The LLaVA-NeXT version also added video understanding and support for newer language models including LLaMA-3 and Qwen-1.5. Researchers and developers use LLaVA as a foundation for building multimodal AI applications, things like visual question answering, image captioning, or AI assistants that can look at screenshots and explain them. It is built in Python and weights are distributed via Hugging Face.

Copy-paste prompts

Prompt 1

How do I set up LLaVA locally to run image understanding on my own machine?

Prompt 2

Show me how to fine-tune LLaVA on my custom dataset of labeled images.

Prompt 3

What's the difference between LLaVA-1.5 and LLaVA-NeXT, and which should I use for video understanding?

Prompt 4

How can I integrate LLaVA into a Python application to answer questions about user-uploaded images?

Prompt 5

What are the hardware requirements to run LLaVA inference, and can I run it on a consumer GPU?

Frequently asked questions

What is llava?

Open-source AI model that understands images and text together, letting you ask questions about pictures and get conversational answers.

What language is llava written in?

Mainly Python. The stack also includes Python, PyTorch, Hugging Face.

What license does llava use?

Use freely for any purpose including commercial. Keep the notice and disclose changes to the patent grant.

How hard is llava to set up?

Setup difficulty is rated hard, with roughly 1h+ to a first successful run.

Who is llava for?

Mainly researcher.

Open on GitHub → Explain another repo

This repo across BitVibe Labs

Scan in gitsafehub Deploy in gitdeployhub haotian-liu on gitmyhub

Verify against the repo before relying on details.