adithya-s-k/omniparse

★ 6,817PythonAudience · developerComplexity · 4/5Setup · hard

Mindmap

mindmap
  root((repo))
    Input Types
      Documents PDF Word
      Images PNG JPG
      Audio MP3 WAV
      Video MP4 MKV
      Web Pages
    Output
      Structured Markdown
    AI Models Used
      Surya OCR
      Florence-2 Vision
      Whisper Audio
    Interfaces
      REST API
      Gradio UI
    Deployment
      Local Server
      Docker

mindmap root((repo)) Input Types Documents PDF Word Images PNG JPG Audio MP3 WAV Video MP4 MKV Web Pages Output Structured Markdown AI Models Used Surya OCR Florence-2 Vision Whisper Audio Interfaces REST API Gradio UI Deployment Local Server Docker

Click or tap to explore — scroll the page freely

Things people build with this

USE CASE 1

Convert a folder of PDFs into clean markdown to feed as context to a language model.

USE CASE 2

Transcribe audio or video files to text using a local Whisper model with no data leaving your machine.

USE CASE 3

Parse a web page by URL into structured markdown for use in an AI summarization pipeline.

USE CASE 4

Use the Gradio interface to test file parsing interactively without writing any code.

Tech stack

PythonGradioDockerWhisperFlorence-2

Getting it running

Difficulty · hard Time to first run · 1h+

Linux only due to specific dependencies, requires downloading several AI models on first run, a GPU is recommended.

License not specified in the explanation.

In plain English

OmniParse is a tool that takes files of almost any type and converts them into clean, structured text that AI systems can use. If you are building an application on top of a language model and need to feed it content from PDFs, presentations, images, audio recordings, videos, or websites, OmniParse handles the conversion step. The output is formatted markdown, which is a simple text format that AI tools and many other programs understand well. The tool runs entirely on your own machine, with no calls to outside services. It uses several AI models internally to do the work: an OCR model called Surya and a vision model called Florence-2 handle documents and images, while a model called Whisper handles audio and video transcription. These models are downloaded when you set up the server. The server itself only runs on Linux, which is noted as a requirement due to specific dependencies. You start OmniParse as a local server, then send files to it through API endpoints. For example, you can post a PDF to one endpoint and get back structured markdown, or post an audio file and get back a text transcript. There is also an endpoint for crawling and parsing a web page by URL. A simple graphical interface built with a library called Gradio is included for interactive use without writing any code. Supported file types include Word documents, PDFs, PowerPoint files, common image formats (PNG, JPG, TIFF, HEIC), video formats (MP4, MKV, AVI, MOV), audio formats (MP3, WAV, AAC), and dynamic web pages. The README also mentions Docker as a deployment option for running the server inside a container. The project is at an early stage and the README notes that integrations with popular AI frameworks are coming soon. It runs on a GPU if one is available, but the documentation notes that a modest GPU is sufficient.

Copy-paste prompts

Prompt 1

I want to use OmniParse to convert a PDF into markdown so I can feed it to a language model. Show me how to start the server and send a POST request to parse the PDF.

Prompt 2

I have a folder of MP4 lecture videos I want to transcribe to text using OmniParse. What endpoint do I call, what parameters do I pass, and what does the response look like?

Prompt 3

How do I deploy OmniParse using Docker on a Linux machine? Show me the docker run command and any required GPU flags.

Prompt 4

I want to crawl a website URL and get back structured markdown using OmniParse. Show me the API call I need to make.

Open on GitHub → Explain another repo

← adithya-s-k on gitmyhub — every repo by this author, as a profile.

Verify against the repo before relying on details.