explaingit

amarjitjim/browserpilot

Analysis updated 2026-05-18

3JavaScriptAudience · developerComplexity · 3/5LicenseSetup · moderate

TLDR

An AI agent that controls a real Chromium browser via an observe-plan-act loop driven by Gemini 2.0 Flash, with a live React frontend streaming screenshots and step logs over WebSocket.

Mindmap

mindmap
  root((repo))
    What it does
      Browser automation
      Natural language tasks
      AI-driven loop
    How It Works
      Observe page DOM
      Plan with Gemini
      Act with Playwright
    Frontend
      React 18 UI
      Live screenshots
      WebSocket streaming
    Tech Stack
      FastAPI backend
      Playwright browser
      Gemini and Groq
Click or tap to explore — scroll the page freely

Code map

Detail Auto

An interactive map of this repo's files and how they connect — its source is parsed live in your browser. Click Visualize to build it.

filefunction / class

What do people build with it?

USE CASE 1

Automate repetitive web tasks on real sites by describing what you want in plain English

USE CASE 2

Study a from-scratch LLM-driven browser agent implementation without any agent framework dependencies

USE CASE 3

Use as a starting point for building your own web automation agent with a live-streaming React UI

What is it built with?

PythonJavaScriptReactFastAPIPlaywrightGemini 2.0 FlashGroqVite

How does it compare?

amarjitjim/browserpilotkitakitaaura/webgraphlsb11/shopify-capi-validator
Stars333
LanguageJavaScriptJavaScriptJavaScript
Setup difficultymoderateeasyeasy
Complexity3/51/52/5
Audiencedeveloperdeveloperdeveloper

Figures from each repo's GitHub metadata at analysis time.

How do you get it running?

Difficulty · moderate Time to first run · 30min

Requires a Gemini API key, the Python backend and Node/npm frontend must both be started separately in two terminal windows.

MIT: use freely for any purpose, including commercial, with no restrictions beyond keeping the copyright notice.

In plain English

BrowserPilot is a project that lets an AI agent control a real web browser on your behalf. You type a task in plain English, like "go to Flipkart and search for earbuds under 500 rupees", and the system opens a Chromium browser, reads the current webpage, asks an AI what to do next, carries out that action, takes a screenshot, and repeats this cycle until the task is complete. Every step streams live to a React web interface so you can watch the agent work in real time. The system is built around a three-step loop: observe, plan, act. In the observe step, it extracts a simplified version of the page's HTML structure (roughly 3,000 tokens) to give the AI a readable summary of what is on screen. The plan step sends that snapshot to Gemini 2.0 Flash, which returns a list of actions as JSON. The act step carries out those actions in the browser: clicking buttons, typing text, navigating to URLs, or scrolling. If an action fails, the error is added to the history so the AI can try a different approach on the next loop iteration. The project was built without any AI agent frameworks like LangChain, intentionally, to understand the core mechanics from scratch. The author found that the actual observe-plan-act loop is about 150 lines of code, the hard problems were browser bot detection, unreliable JSON output from the AI, and CSS selector specificity. Running it requires API keys for Gemini and optionally Groq. The backend uses Python with FastAPI and Playwright, the frontend uses React. The project was a 14-day build and is still in progress at the time of the README. The license is MIT.

Copy-paste prompts

Prompt 1
Walk me through how BrowserPilot's observe-plan-act loop works. What does it extract from the page, what does it send to Gemini, and how does it handle failed actions?
Prompt 2
How do I run BrowserPilot locally? What environment variables do I need and what commands start the backend and frontend?
Prompt 3
How does BrowserPilot handle bot detection on sites like Amazon? What does the stealth Chromium configuration in browser.py do?
Prompt 4
Why does BrowserPilot need defensive JSON parsing in _parse_actions()? What does Gemini sometimes return instead of a valid JSON array?

Frequently asked questions

What is browserpilot?

An AI agent that controls a real Chromium browser via an observe-plan-act loop driven by Gemini 2.0 Flash, with a live React frontend streaming screenshots and step logs over WebSocket.

What language is browserpilot written in?

Mainly JavaScript. The stack also includes Python, JavaScript, React.

What license does browserpilot use?

MIT: use freely for any purpose, including commercial, with no restrictions beyond keeping the copyright notice.

How hard is browserpilot to set up?

Setup difficulty is rated moderate, with roughly 30min to a first successful run.

Who is browserpilot for?

Mainly developer.

Open on GitHub → Explain another repo

This repo across BitVibe Labs

Scan in gitsafehub Deploy in gitdeployhub amarjitjim on gitmyhub

Verify against the repo before relying on details.