llava-vl/llava-next

★ 4,657PythonAudience · researcherComplexity · 4/5Setup · hard

Mindmap

mindmap
  root((LLaVA-NeXT))
    Model Lines
      OneVision Images
      LLaVA Video
      Interleave Docs
      Critic R1 Evaluator
    Capabilities
      Image Understanding
      Video Analysis
      Mixed Doc Reasoning
      Output Evaluation
    Training
      Fine-tune Scripts
      Custom Datasets
      Reproduce Models
    Distribution
      HuggingFace Weights
      Online Demos
      Research Papers

mindmap root((LLaVA-NeXT)) Model Lines OneVision Images LLaVA Video Interleave Docs Critic R1 Evaluator Capabilities Image Understanding Video Analysis Mixed Doc Reasoning Output Evaluation Training Fine-tune Scripts Custom Datasets Reproduce Models Distribution HuggingFace Weights Online Demos Research Papers

Click or tap to explore — scroll the page freely

Things people build with this

USE CASE 1

Ask questions about any photo or image and get a plain-English answer from the AI

USE CASE 2

Analyze video content, including long videos, by having the model watch and summarize or answer questions

USE CASE 3

Process documents that mix images and text together, such as slideshows or illustrated reports

USE CASE 4

Evaluate and score outputs from other AI models using LLaVA-Critic-R1 as a quality judge

Tech stack

PythonPyTorchHugging FaceTransformersCUDAReinforcement Learning

Getting it running

Difficulty · hard Time to first run · 1day+

Requires a Python ML environment with PyTorch and CUDA. Model weights downloaded from Hugging Face. Training scripts available for fine-tuning. Online demos exist if you want to skip local setup.

Open research release with publicly available model weights and training code. License terms not explicitly stated in the explanation.

In plain English

LLaVA-NeXT is an open research project producing AI models that can understand both images and text together. You can describe something in a picture, ask questions about a photo, or have the model analyze video content. The project comes from an academic lab and releases model weights, training code, datasets, and research papers under one shared codebase. The project has grown to include several distinct model lines. LLaVA-OneVision handles single images, multiple images at once, and video, with models ranging from 0.5 billion to 72 billion parameters. LLaVA-Video focuses specifically on understanding video content, including long videos, and was trained on a dataset of roughly 1.3 million synthetic video question-and-answer pairs created for this project. LLaVA-NeXT-Interleave processes documents that mix images and text in any order, which is useful for tasks that require reasoning across several visual and textual inputs at once. The most recent addition is LLaVA-Critic-R1, a model trained to evaluate and critique the outputs of other AI models. It is trained using a reinforcement-learning approach and is positioned as a tool for assessing response quality rather than directly answering user questions. All model checkpoints are distributed through Hugging Face. The repository also contains training scripts so researchers can fine-tune or reproduce the models on their own data. Demos are available at external links for users who want to try the models without setting up anything locally. The intended audience is AI researchers and developers working on vision-language tasks. The README assumes familiarity with model training and the Python ecosystem for machine learning. It is not aimed at end users looking for a finished product.

Copy-paste prompts

Prompt 1

I have the LLaVA-NeXT repo. How do I load a LLaVA-OneVision checkpoint from Hugging Face and run inference on a single image using Python?

Prompt 2

Using LLaVA-NeXT, write a Python script that takes a video file as input and asks the model to summarize what happens in the video.

Prompt 3

I want to fine-tune a LLaVA-NeXT model on my own image-question-answer dataset. Walk me through the training script arguments I need to set in this repo.

Prompt 4

How do I use LLaVA-Critic-R1 from the llava-next repo to evaluate and score a response generated by another AI model?

Prompt 5

Show me how to run LLaVA-NeXT-Interleave on a document that contains both images and text mixed together, using the code in this repository.

Open on GitHub → Explain another repo

← llava-vl on gitmyhub — every repo by this author, as a profile.

Verify against the repo before relying on details.