JoyAI-VL-Interaction is a research project from JD.com that builds an AI model capable of watching a live video feed and deciding on its own when to speak up, without waiting to be asked. Most current AI video assistants work in a turn-based way: they wait for you to ask a question, then answer it. This project aims to create a model that watches a stream continuously and speaks unprompted when something worth responding to happens, such as a pot boiling over or a child approaching something dangerous. The core model is a vision-language model at about 8 billion parameters. Every second, it makes a three-way decision: speak, stay silent, or hand the task to a more capable model running in the background. It was trained on over four million time-labeled video clips where each second is marked with which of those three actions was correct. Reinforcement learning was then used to refine the timing behavior. To keep up with live video without overwhelming computing resources, the system uses a video compression approach that spends fewer processing tokens on predictable frames and saves full detail for moments when the scene actually changes. The system also includes long-term memory so it can track context across a long session, and a plug-in speech layer for spoken conversation. The team tested the model against video-call features in Doubao and Gemini across 58 real event-driven scenarios. In those comparisons, JoyAI-VL-Interaction was preferred by human raters in about 78 percent of cases versus Doubao and 88 percent versus Gemini. The team notes that those products are backed by much larger models and years of product work, and that JoyAI-VL-Interaction is not yet competitive on general open-ended chat or broad knowledge tasks. The model weights, training data, training recipe, and deployment system are planned for open release around June 20, 2026. At the time this README was written, only the technical report and a blog post were publicly available.
← jd-opensource on gitmyhub — every repo by this author, as a profile.
Verify against the repo before relying on details.