Ask questions about any photo or image and get a plain-English answer from the AI
Analyze video content, including long videos, by having the model watch and summarize or answer questions
Process documents that mix images and text together, such as slideshows or illustrated reports
Evaluate and score outputs from other AI models using LLaVA-Critic-R1 as a quality judge
Requires a Python ML environment with PyTorch and CUDA. Model weights downloaded from Hugging Face. Training scripts available for fine-tuning. Online demos exist if you want to skip local setup.
LLaVA-NeXT is an open research project producing AI models that can understand both images and text together. You can describe something in a picture, ask questions about a photo, or have the model analyze video content. The project comes from an academic lab and releases model weights, training code, datasets, and research papers under one shared codebase. The project has grown to include several distinct model lines. LLaVA-OneVision handles single images, multiple images at once, and video, with models ranging from 0.5 billion to 72 billion parameters. LLaVA-Video focuses specifically on understanding video content, including long videos, and was trained on a dataset of roughly 1.3 million synthetic video question-and-answer pairs created for this project. LLaVA-NeXT-Interleave processes documents that mix images and text in any order, which is useful for tasks that require reasoning across several visual and textual inputs at once. The most recent addition is LLaVA-Critic-R1, a model trained to evaluate and critique the outputs of other AI models. It is trained using a reinforcement-learning approach and is positioned as a tool for assessing response quality rather than directly answering user questions. All model checkpoints are distributed through Hugging Face. The repository also contains training scripts so researchers can fine-tune or reproduce the models on their own data. Demos are available at external links for users who want to try the models without setting up anything locally. The intended audience is AI researchers and developers working on vision-language tasks. The README assumes familiarity with model training and the Python ecosystem for machine learning. It is not aimed at end users looking for a finished product.
← llava-vl on gitmyhub — every repo by this author, as a profile.
Verify against the repo before relying on details.