Build a question-answering system over financial reports that understands context across multiple sections.
Create a legal document assistant that accurately retrieves relevant clauses from long contracts.
Develop a technical manual search tool that reasons through document structure to find precise answers.
PageIndex is a system for searching long documents, typically PDFs of professional reports, without using a vector database. Its README pitches it as "vectorless, reasoning-based RAG." RAG, short for "retrieval-augmented generation," is the common pattern where an AI model is fed relevant snippets from a knowledge source before it answers; the usual approach is to chop documents into chunks, turn each chunk into a numeric vector, and look up the nearest matches. PageIndex argues that nearest-vector match measures similarity, not actual relevance, and that for professional documents you need reasoning instead. Inspired by AlphaGo, PageIndex first converts each document into a hierarchical "table of contents" tree, then has a large language model walk that tree the way a human expert would, opening the section that looks most likely to contain the answer, drilling down, and backtracking. The README highlights four trade-offs versus traditional RAG: no vector database, no chunking, human-like retrieval, and better explainability because answers point back to specific pages and sections rather than to opaque embeddings. The README reports 98.7% accuracy on the FinanceBench benchmark. The open-source repository lets you self-host with standard PDF parsing, and the README also describes a hosted cloud service with stronger OCR, a chat platform, and access via the Model Context Protocol or an HTTP API. Quick examples are provided, including an agentic demo using the OpenAI Agents SDK and a vision-based variant that reads page images directly. The primary language is Python. The full README is longer than what was provided.
Generated 2026-05-21 · Model: sonnet-4-6 · Verify against the repo before relying on details.