Run iterative self-improvement rounds on a LangGraph research agent and track which code changes raised or lowered benchmark scores
Benchmark a web-search research agent against DeepResearch Bench tasks and read a scored HTML report of every experiment round
Use the RACE scoring framework and an AI judge to evaluate research answers on comprehensiveness, depth, instruction-following, and readability
Preserve the full experiment history including failed rounds as git commits so you can diff any two versions of the agent
Requires both an OpenAI-compatible API key for the research and judge models and a Tavily API key for web search.
AutoDeepResearch is a small experiment framework that lets an AI model repeatedly improve its own research capabilities by running, scoring, and recording rounds of self-modification. The concept is borrowed from a project by Andrej Karpathy called autoresearch: give an AI a research task, let it propose and apply a small change to its own code, measure whether the change helped, and log everything regardless of the outcome. The research agent at the center of the project is built on LangGraph and LangChain, two libraries for connecting AI models to tools and data. When given a question, the agent decomposes it into sub-questions, delegates web searches to a subagent that uses the Tavily search API, and assembles the results into a final answer with citations. The code for this agent lives in a single file called deepresearch_agent.py. That file is the only thing that changes between rounds. Evaluation happens against a fixed set of five research tasks drawn from a benchmark called DeepResearch Bench. Each task has a reference report and a set of checkpoints describing what a complete answer should cover. A scoring framework called RACE grades each answer on four dimensions: how comprehensive it is, how much analytical depth it shows, how well it follows the task instructions, and how readable it is. An AI judge returns scores from one to five on each dimension. The project also tracks latency and token usage, so a change that produces the same quality with lower cost is counted as a win. Every experiment round is stored as a git commit. Failed rounds are not rolled back: the repository preserves the full history of what was tried and what happened. A human-readable HTML report at experiments/report.html shows the score trend, per-round change notes, code diffs, and the judge's rationale for each answer. A tab-separated file at experiments/results.tsv holds the machine-readable history for anyone who wants to analyze trends. Setup requires Python with uv, an OpenAI-compatible API key for the research and judge models, and a Tavily API key for web search.
← zhoujx4 on gitmyhub — every repo by this author, as a profile.
Verify against the repo before relying on details.