CogVLM and CogAgent are open-source AI models that can look at an image and answer questions about it, carry on a back-and-forth conversation about what they see, and identify specific regions of an image when asked. The repository contains code for running these models yourself and documentation for fine-tuning them on your own data. CogVLM-17B is the base visual language model. It has two sets of parameters: one for understanding images and one for language, totaling 17 billion parameters. It can handle images at 490 by 490 pixel resolution and was evaluated across a range of standard vision-and-language tests, placing at or near the top on ten of them at the time of its release. CogAgent is built on top of CogVLM and adds support for a higher image resolution (1120 by 1120 pixels), which makes it better at reading text in images like screenshots and documents. It also adds a specific capability for controlling graphical interfaces: given a screenshot of a computer screen, it can describe what to click or type to complete a task. This is sometimes called a GUI agent. It was evaluated on nine cross-modal benchmarks and on datasets specifically for computer interface automation. To run the models locally, you need a machine with one or more Nvidia GPUs. The README documents the minimum GPU memory required for different configurations, including a 4-bit compressed mode that can run with roughly 11 GB of GPU memory. Models can be loaded from Hugging Face with a few lines of Python code, or run through a provided command-line tool or a local web interface. A newer follow-up model called CogVLM2, based on a different underlying language model, was released in May 2024 and is linked from the README as a recommended upgrade for new projects.
← zai-org on gitmyhub — every repo by this author, as a profile.
Verify against the repo before relying on details.