Let a Bengali-speaking user run any Android app by voice, hands-free.
Build a phone-literacy tutor that tells the user where to tap instead of doing it for them.
Prototype a vision-driven phone agent for another low-resource language by swapping the Sarvam models.
Demo a screen-aware AI agent that uses accessibility services for real-device automation.
Floating bubble needs a native Android build, plus Sarvam and optional OpenAI API keys and three runtime Android permissions before anything works.
Talkie is an Android app that puts a floating bubble on top of whatever else you have open on your phone. You press and hold the bubble, speak a request in your own local language, and Talkie figures out what you want to do and does it for you. According to the README, that includes things like tapping buttons, filling out forms, scrolling, and opening apps. The example language used throughout the README is Bengali. To use it you need an Android phone running Android 10 or newer and an API key from Sarvam, which is a service that handles Bengali speech-to-text and text-to-speech. The Sarvam key is required, with a free tier mentioned. An OpenAI key is optional. With the OpenAI key Talkie can look at your screen and take more precise actions; without it, the README says Talkie still works for apps it already knows about by using their deep links. On first launch you have to give Talkie three permissions in Android settings: drawing over other apps, the accessibility service, and the microphone. Then you enter your API keys in the Talkie settings screen. After that, the floating bubble is visible everywhere, and the interaction loop is to hold the bubble, speak, and let Talkie work. There is a setting called Guide mode. With Guide mode off, Talkie does the task itself. With Guide mode on, Talkie does not act for you; instead it tells you in Bengali what to tap or do, which the README pitches as a way for someone to learn how to use a phone or an app. The README also includes a short diagram of what happens under the hood. Your speech goes to a Sarvam transcription model called Saaras, then GPT-4.1 Vision reads a screenshot of your screen and decides on an action such as a tap at a given coordinate, then the Android accessibility service performs that tap, and finally a Sarvam voice model called Bulbul speaks the result back in Bengali. The project is built with TypeScript and there is a quick test path using Expo Go, though the floating bubble itself needs a native build.
Generated 2026-05-22 · Model: sonnet-4-6 · Verify against the repo before relying on details.