zhoujx4/autodeepresearch

★ 12HTMLAudience · researcherComplexity · 4/5Setup · moderate

Mindmap

mindmap
  root((autodeepresearch))
    What it does
      Self-improve agent
      Score each round
      Commit all changes
      Show HTML report
    Research Agent
      LangGraph based
      Sub-question splitting
      Tavily web search
      Citations in output
    Scoring
      RACE framework
      AI judge
      Four dimensions
      Token and cost tracking
    Setup
      Python with uv
      OpenAI API key
      Tavily API key

mindmap root((autodeepresearch)) What it does Self-improve agent Score each round Commit all changes Show HTML report Research Agent LangGraph based Sub-question splitting Tavily web search Citations in output Scoring RACE framework AI judge Four dimensions Token and cost tracking Setup Python with uv OpenAI API key Tavily API key

Click or tap to explore — scroll the page freely

Things people build with this

USE CASE 1

Run iterative self-improvement rounds on a LangGraph research agent and track which code changes raised or lowered benchmark scores

USE CASE 2

Benchmark a web-search research agent against DeepResearch Bench tasks and read a scored HTML report of every experiment round

USE CASE 3

Use the RACE scoring framework and an AI judge to evaluate research answers on comprehensiveness, depth, instruction-following, and readability

USE CASE 4

Preserve the full experiment history including failed rounds as git commits so you can diff any two versions of the agent

Tech stack

PythonLangGraphLangChainHTML

Getting it running

Difficulty · moderate Time to first run · 1h+

Requires both an OpenAI-compatible API key for the research and judge models and a Tavily API key for web search.

In plain English

AutoDeepResearch is a small experiment framework that lets an AI model repeatedly improve its own research capabilities by running, scoring, and recording rounds of self-modification. The concept is borrowed from a project by Andrej Karpathy called autoresearch: give an AI a research task, let it propose and apply a small change to its own code, measure whether the change helped, and log everything regardless of the outcome. The research agent at the center of the project is built on LangGraph and LangChain, two libraries for connecting AI models to tools and data. When given a question, the agent decomposes it into sub-questions, delegates web searches to a subagent that uses the Tavily search API, and assembles the results into a final answer with citations. The code for this agent lives in a single file called deepresearch_agent.py. That file is the only thing that changes between rounds. Evaluation happens against a fixed set of five research tasks drawn from a benchmark called DeepResearch Bench. Each task has a reference report and a set of checkpoints describing what a complete answer should cover. A scoring framework called RACE grades each answer on four dimensions: how comprehensive it is, how much analytical depth it shows, how well it follows the task instructions, and how readable it is. An AI judge returns scores from one to five on each dimension. The project also tracks latency and token usage, so a change that produces the same quality with lower cost is counted as a win. Every experiment round is stored as a git commit. Failed rounds are not rolled back: the repository preserves the full history of what was tried and what happened. A human-readable HTML report at experiments/report.html shows the score trend, per-round change notes, code diffs, and the judge's rationale for each answer. A tab-separated file at experiments/results.tsv holds the machine-readable history for anyone who wants to analyze trends. Setup requires Python with uv, an OpenAI-compatible API key for the research and judge models, and a Tavily API key for web search.

Copy-paste prompts

Prompt 1

I cloned autodeepresearch. Walk me through setting up my OpenAI-compatible API key and Tavily API key with uv, then running the first experiment round.

Prompt 2

Help me read the experiments/report.html output from autodeepresearch and interpret what the score trend and per-round change notes are telling me.

Prompt 3

I want to add a sixth benchmark task to autodeepresearch. Show me the format the existing five tasks use and how to add a new one with reference checkpoints.

Prompt 4

Explain how autodeepresearch's RACE scoring works and what each of the four dimensions measures so I can interpret the AI judge's scores.

Prompt 5

Show me how to modify deepresearch_agent.py to replace Tavily with a different search API without breaking the autodeepresearch evaluation loop.

Open on GitHub → Explain another repo

← zhoujx4 on gitmyhub — every repo by this author, as a profile.

Verify against the repo before relying on details.