Analysis updated 2026-05-18
Automate repetitive web tasks on real sites by describing what you want in plain English
Study a from-scratch LLM-driven browser agent implementation without any agent framework dependencies
Use as a starting point for building your own web automation agent with a live-streaming React UI
| amarjitjim/browserpilot | kitakitaaura/webgraph | lsb11/shopify-capi-validator | |
|---|---|---|---|
| Stars | 3 | 3 | 3 |
| Language | JavaScript | JavaScript | JavaScript |
| Setup difficulty | moderate | easy | easy |
| Complexity | 3/5 | 1/5 | 2/5 |
| Audience | developer | developer | developer |
Figures from each repo's GitHub metadata at analysis time.
Requires a Gemini API key, the Python backend and Node/npm frontend must both be started separately in two terminal windows.
BrowserPilot is a project that lets an AI agent control a real web browser on your behalf. You type a task in plain English, like "go to Flipkart and search for earbuds under 500 rupees", and the system opens a Chromium browser, reads the current webpage, asks an AI what to do next, carries out that action, takes a screenshot, and repeats this cycle until the task is complete. Every step streams live to a React web interface so you can watch the agent work in real time. The system is built around a three-step loop: observe, plan, act. In the observe step, it extracts a simplified version of the page's HTML structure (roughly 3,000 tokens) to give the AI a readable summary of what is on screen. The plan step sends that snapshot to Gemini 2.0 Flash, which returns a list of actions as JSON. The act step carries out those actions in the browser: clicking buttons, typing text, navigating to URLs, or scrolling. If an action fails, the error is added to the history so the AI can try a different approach on the next loop iteration. The project was built without any AI agent frameworks like LangChain, intentionally, to understand the core mechanics from scratch. The author found that the actual observe-plan-act loop is about 150 lines of code, the hard problems were browser bot detection, unreliable JSON output from the AI, and CSS selector specificity. Running it requires API keys for Gemini and optionally Groq. The backend uses Python with FastAPI and Playwright, the frontend uses React. The project was a 14-day build and is still in progress at the time of the README. The license is MIT.
An AI agent that controls a real Chromium browser via an observe-plan-act loop driven by Gemini 2.0 Flash, with a live React frontend streaming screenshots and step logs over WebSocket.
Mainly JavaScript. The stack also includes Python, JavaScript, React.
MIT: use freely for any purpose, including commercial, with no restrictions beyond keeping the copyright notice.
Setup difficulty is rated moderate, with roughly 30min to a first successful run.
Mainly developer.
This repo across BitVibe Labs
Verify against the repo before relying on details.