Analysis updated 2026-05-18
Build a visual question-answering system that answers questions about uploaded images.
Create an AI assistant that can analyze screenshots and explain what's happening in them.
Train a custom multimodal model for domain-specific image understanding tasks.
Develop an image captioning tool that generates detailed descriptions of pictures.
| haotian-liu/llava | kovidgoyal/calibre | microsoft/jarvis | |
|---|---|---|---|
| Stars | 24,755 | 24,777 | 24,693 |
| Language | Python | Python | Python |
| Setup difficulty | hard | easy | hard |
| Complexity | 4/5 | 2/5 | 4/5 |
| Audience | researcher | general | researcher |
Figures from each repo's GitHub metadata at analysis time.
Requires downloading large model weights and GPU/CUDA for inference, PyTorch compilation and dependency resolution can be time-consuming.
LLaVA (Large Language and Vision Assistant) is a research project and open-source AI model that can understand and discuss both images and text together. In simple terms, you can show it a picture and ask questions about it in plain language, and it will respond conversationally, describing what it sees, answering questions, and following instructions related to the image. The core idea is "visual instruction tuning", training an AI so it can follow human instructions when those instructions involve visual content, not just text. It connects a vision encoder (a system that understands images) to a large language model (LLM, the type of AI behind ChatGPT), allowing the combined system to reason about images and language together. The project was accepted as an oral presentation at NeurIPS 2023, one of the most competitive AI research conferences. Later versions (LLaVA-1.5, LLaVA-NeXT) improved on the original by achieving top benchmark scores while using only publicly available training data and completing training in about one day on a standard cluster of eight high-end GPUs (A100s). The LLaVA-NeXT version also added video understanding and support for newer language models including LLaMA-3 and Qwen-1.5. Researchers and developers use LLaVA as a foundation for building multimodal AI applications, things like visual question answering, image captioning, or AI assistants that can look at screenshots and explain them. It is built in Python and weights are distributed via Hugging Face.
Open-source AI model that understands images and text together, letting you ask questions about pictures and get conversational answers.
Mainly Python. The stack also includes Python, PyTorch, Hugging Face.
Use freely for any purpose including commercial. Keep the notice and disclose changes to the patent grant.
Setup difficulty is rated hard, with roughly 1h+ to a first successful run.
Mainly researcher.
This repo across BitVibe Labs
Verify against the repo before relying on details.