Upload a long research paper or contract and ask specific questions to get direct answers with page citations.
Run the app locally or via Docker to privately query PDFs without storing data in a third-party service.
Use as a starting template for building a document Q&A system without LangChain or a vector database.
Requires a paid OpenAI API key to generate answers.
pdfGPT is a Python application that lets you ask questions about a PDF document and get answers generated by an AI model. You upload a PDF file or provide a URL to one, then type a question, and the application finds the most relevant sections of the document and sends them to OpenAI to generate a precise answer. The author claims it was one of the earliest open-source systems of this kind, first released in 2021, and argues it remains more accurate than many later alternatives because of its simple architecture. The technical approach works like this: the application splits the PDF into small chunks of about 150 words each. It then generates a numerical representation (called an embedding) of each chunk using a deep learning encoder called the Universal Sentence Encoder. When you ask a question, the application generates an embedding of your question and uses a nearest-neighbor search to find the five chunks most similar to it. Those five chunks are inserted into a prompt sent to OpenAI, which generates the final answer. The responses can include page number citations in square brackets so you can locate the source in the original document. One design choice that distinguishes this project from some alternatives is that it does not use a vector database or a third-party orchestration library like LangChain. The embeddings are saved to a file on disk and reloaded on subsequent queries. The application supports OpenAI GPT models including GPT-3.5 Turbo and GPT-4. A Docker Compose file is included for running the application in a container. A live demo is hosted on Hugging Face Spaces. The project is MIT licensed and open to contributors, though the README notes that documentation has not been kept fully up to date.
← bhaskatripathi on gitmyhub — every repo by this author, as a profile.
Verify against the repo before relying on details.