explaingit

jurio0304/codex_automated_paper_reader

19PythonAudience · researcherComplexity · 3/5ActiveLicenseSetup · moderate

TLDR

Daily pipeline that fetches arXiv and OpenReview papers, deduplicates and pre-scores them with Python, then asks OpenAI Codex to read the shortlist and write a Markdown report.

Mindmap

mindmap
  root((CAPR))
    Inputs
      arXiv metadata
      OpenReview entries
      Research profile
    Outputs
      Daily Markdown report
      Scored candidate JSON
      Run logs
    Use Cases
      Daily literature triage
      Score papers by relevance
      Skip duplicate-batch days
    Tech Stack
      Python
      OpenAI Codex
      YAML
      pytest

Things people build with this

USE CASE 1

Run a nightly job that pulls recent arXiv and OpenReview papers and produces a one-page daily reading list.

USE CASE 2

Configure positive and negative keywords plus an arXiv category set in config.yaml to focus on a single research area.

USE CASE 3

Skip duplicate work when the candidate list matches yesterday's, with Codex writing a short no-new-batch note instead.

USE CASE 4

Add OpenAlex as a third source by flipping the off-by-default flag once you have the API key set up.

Tech stack

PythonCodexYAMLpytest

Getting it running

Difficulty · moderate Time to first run · 30min

You need a working OpenAI Codex CLI plus your own Paper_Reader.txt research profile, and the README warns about broken proxy variables in CI.

MIT license, so you can use, modify, and redistribute it freely, including in commercial work, as long as you keep the copyright notice.

In plain English

Codex Automated Paper Reader, or CAPR, is a small workflow for keeping up with new academic papers. Each day it pulls a list of recent papers from public sources, cleans the list up, and then hands it to Codex (OpenAI's coding agent) to read, score, and turn into a short daily report in Markdown. The author describes it as an early-stage research helper rather than a finished product. The project is deliberately split into two layers. A Python script called paper-daily does the mechanical work: it fetches paper metadata from arXiv and OpenReview, with an optional source called OpenAlex turned off by default. It puts everything into a single JSON schema, removes duplicates using source IDs and normalized titles, and applies a transparent rule-based score to give Codex a starting point. It also checks the network before fetching so that proxy or connection errors are not mistaken for an empty result, and falls back to scraping arXiv's HTML page when the export API is rate limited. The second layer is Codex itself. Once the candidate list is built, Codex reads the JSON file, scores each paper on things like method relevance, novelty, transferability and practicality, opens the paper page or PDF for the most promising ones, and writes the final report into a reports folder. The README is explicit that the recommendation step is done by Codex, not by template matching on keywords, so the user is expected to fill in their own research background and scoring preferences inside a local Paper_Reader.txt file that is ignored by Git. Getting started involves cloning the repo, creating a Python 3.10 virtual environment, installing the requirements, and running daily_papers.py with a date and a fetch stage. The script writes raw and processed JSON files plus a log under data/ and logs/ folders. A flag in the raw file marks whether the day's candidate list is identical to the previous one, in which case Codex is told to write a short no-new-batch note instead of repeating yesterday's top ten. The README also covers configuration in config.yaml (research profile, positive and negative keywords, arXiv categories, OpenReview venues, candidate limits), how to handle broken proxy variables in automation environments, a pytest test command, and a pointer to the MIT license. The project ships both Chinese and English prompt templates.

Copy-paste prompts

Prompt 1
Set up CAPR on Ubuntu with Python 3.10, install requirements, and run daily_papers.py for today with the fetch stage only.
Prompt 2
Write a Paper_Reader.txt that describes my research background in retrieval-augmented generation and lists my scoring priorities.
Prompt 3
Add a GitHub Actions workflow that runs CAPR every morning at 07:00 UTC and commits the report into reports/.
Prompt 4
Debug why the arXiv export API keeps returning empty results behind a corporate proxy, including how to use the HTML-scrape fallback.
Prompt 5
Extend the pipeline with a Semantic Scholar source while keeping the existing dedup-by-normalized-title logic intact.
Open on GitHub → Explain another repo

Generated 2026-05-22 · Model: sonnet-4-6 · Verify against the repo before relying on details.