explaingit

liumengxuan04/translate-paper-pdf-to-md

20PythonAudience · researcherComplexity · 3/5ActiveSetup · moderate

TLDR

A Codex skill that translates English academic PDFs into target-language Markdown while preserving sections, figures, tables, equations, and references.

Mindmap

mindmap
  root((translate-paper-pdf-to-md))
    Inputs
      English PDF paper
      Translation preferences
      Optional crop spec JSON
    Outputs
      Target language Markdown
      Cropped figure assets
      Validation report
    Use Cases
      Read foreign papers
      Localize research notes
      Re-edit translated drafts
    Tech Stack
      Python
      Codex
      pdftotext
      ImageMagick

Things people build with this

USE CASE 1

Translate an English research PDF into Chinese Markdown for study

USE CASE 2

Extract figures and tables from a paper into an assets folder

USE CASE 3

Validate that a translated Markdown paper has all images and citations

USE CASE 4

Add a Codex skill that asks for tone and terminology before translating

Tech stack

PythonCodexpdftotextpdftocairopdfimagesImageMagick

Getting it running

Difficulty · moderate Time to first run · 30min

Needs Codex installed plus command line tools pdfinfo, pdftotext, pdftocairo, pdfimages, and ImageMagick convert on PATH.

In plain English

This repository is a skill for Codex, OpenAI's coding agent, that helps you turn an English academic PDF paper into a Markdown document in another language, most often Chinese. The README is written in Chinese with an English version linked. The author is clear that this is not a one-click machine translation pipeline. It is meant for people who actually want to read, study, or re-edit a paper, so the workflow asks questions and produces an editable result rather than dumping raw translated text. The skill keeps the structure of the original paper. It preserves section hierarchy, figure and table numbering, equation labels, citations, acknowledgements, and the reference list. Figures and complex tables that cannot be rebuilt as Markdown are cropped from the PDF pages and saved into an assets folder, while simpler tables and equations are rewritten as Markdown tables and LaTeX expressions. Before the actual translation starts, the skill asks for preferences: which target language or region, the paper's field, the tone and intended reader, terminology choices, and how to handle figures and tables. Installation is a copy or symlink of the skill folder into the Codex skills directory. You then invoke it by name in a Codex prompt, for example asking it to translate a PDF at a given path into Chinese Markdown. Two helper Python scripts ship with the repository: one extracts text, layout, and page images from the PDF and can crop figures based on a JSON spec, and the other validates that the final Markdown has all expected images, references, figure numbers, table numbers, and equation tags. The author gives a rough cost estimate of about 0.7 US dollars for a full pass on a 23 page paper, with the caveat that real costs vary by model, paper length, figure density, and retries. The Python scripts only need the standard library, but you are expected to have command line tools like pdfinfo, pdftotext, pdftocairo, pdfimages, and ImageMagick's convert installed. The repository has 20 stars and is written in Python.

Copy-paste prompts

Prompt 1
Install the translate-paper-pdf-to-md skill into my Codex skills directory and show me how to invoke it on a PDF
Prompt 2
Use the translate-paper-pdf-to-md skill to convert paper.pdf into formal academic Chinese Markdown for a distributed systems audience
Prompt 3
Run extract_pdf_assets.py on this PDF with a crop spec for figures 1 to 4 and tables 1 to 2
Prompt 4
Validate paper_zh.md with validate_markdown_assets.py and list any missing image links or references
Prompt 5
Adapt this skill to target Japanese Markdown output while keeping the same figure cropping workflow
Open on GitHub → Explain another repo

Generated 2026-05-22 · Model: sonnet-4-6 · Verify against the repo before relying on details.