oomol-lab/pdf-craft

★ 5,630PythonAudience · researcherComplexity · 3/5Setup · hard

Mindmap

mindmap
  root((pdf-craft))
    What it does
      Scanned PDF to Markdown
      Scanned PDF to EPUB
      Local processing
    Features
      OCR via DeepSeek
      Tables and formulas
      Auto table of contents
      Footnote handling
    Requirements
      NVIDIA GPU with CUDA
      Poppler
      Python pip
    Use cases
      Book digitization
      Research document prep
      Offline conversion

mindmap root((pdf-craft)) What it does Scanned PDF to Markdown Scanned PDF to EPUB Local processing Features OCR via DeepSeek Tables and formulas Auto table of contents Footnote handling Requirements NVIDIA GPU with CUDA Poppler Python pip Use cases Book digitization Research document prep Offline conversion

Click or tap to explore — scroll the page freely

Things people build with this

USE CASE 1

Convert a scanned PDF book into a clean Markdown file with tables and formulas preserved.

USE CASE 2

Turn a scanned PDF into an EPUB with an automatically generated table of contents for e-reader use.

USE CASE 3

Process old scanned documents offline without sending any files to an external server.

Tech stack

PythonCUDADeepSeek OCRPoppler

Getting it running

Difficulty · hard Time to first run · 1h+

Requires an NVIDIA GPU with CUDA configured and Poppler installed, no CPU fallback.

License information is not clearly stated in the explanation.

In plain English

pdf-craft is a Python library for converting PDF files into Markdown or EPUB format. It is built specifically for scanned books, where pages are images rather than searchable text. The library uses a document recognition model called DeepSeek OCR to read text from those scanned pages, handling complex content such as tables, mathematical formulas, and footnotes. The conversion runs entirely on your own machine without sending files to any outside server. To use it, you need a compatible NVIDIA graphics card with CUDA configured, plus Poppler, a tool for parsing PDF files. The library itself is installed via pip, the standard Python package installer. During conversion, pdf-craft analyzes the document's layout: it pulls out the main body text while discarding headers, footers, and other repeated page elements. For EPUB output specifically, it builds a table of contents automatically. Footnotes, embedded images, and other assets attached to footnotes are carried through to the final file intact. Starting with version 1.0.0, the library dropped the large language model it previously used for text correction. The older approach made network calls and introduced delays or occasional failures. The current version relies entirely on the local OCR model, so the process runs faster and works without an internet connection. Users who depended on the LLM correction step can still use the older v0.2.8 release. An online demo lets anyone try the conversion workflow in a browser without installing anything locally. The Python API exposes functions for Markdown and EPUB conversion, each accepting optional parameters for DPI settings, image size limits, language, table rendering format, and formula display style.

Copy-paste prompts

Prompt 1

Using pdf-craft, write Python code to convert a scanned PDF book into EPUB format with an auto-generated table of contents.

Prompt 2

I want to convert a scanned academic paper with mathematical formulas into Markdown using pdf-craft. What parameters should I set for formula display and table format?

Prompt 3

My scanned PDF has headers and footers on every page that I want stripped out. How does pdf-craft handle those during conversion?

Prompt 4

Walk me through setting up pdf-craft from scratch on Linux with CUDA and Poppler, then converting my first PDF.

Prompt 5

What are the differences between pdf-craft v0.2.8 with LLM correction and v1.0.0 with local OCR only, when would I prefer the older version?

Open on GitHub → Explain another repo

← oomol-lab on gitmyhub — every repo by this author, as a profile.

Verify against the repo before relying on details.