drknght4/nextcloud-qdrant-pipeline

★ 16PythonAudience · ops devopsComplexity · 4/5LicenseSetup · hard

Mindmap

mindmap
  root((repo))
    What It Does
      PDF ingestion
      Semantic chunking
      Meaning-based search
    Pipeline Steps
      Split PDF into chunks
      Embed with bge-m3
      Store in Qdrant
    Tech Stack
      Python 3.11+
      Nextcloud WebDAV
      Qdrant vector database
      Infinity embedding server
    Features
      SHA256 deduplication
      Telegram notifications
      Background polling
    Use Cases
      Personal document search
      Self-hosted RAG

mindmap root((repo)) What It Does PDF ingestion Semantic chunking Meaning-based search Pipeline Steps Split PDF into chunks Embed with bge-m3 Store in Qdrant Tech Stack Python 3.11+ Nextcloud WebDAV Qdrant vector database Infinity embedding server Features SHA256 deduplication Telegram notifications Background polling Use Cases Personal document search Self-hosted RAG

Click or tap to explore — scroll the page freely

Things people build with this

USE CASE 1

Drop a PDF into your Nextcloud inbox and have it automatically chunked, embedded, and made searchable by meaning within 30 seconds.

USE CASE 2

Ask a plain-language question against your personal document archive and get back the most relevant passages without knowing exact keywords.

USE CASE 3

Set up SHA256-based deduplication so re-uploading an unchanged PDF is skipped and only updated versions trigger a re-index.

USE CASE 4

Receive a Telegram notification each time a PDF is processed or fails so you know the background pipeline is healthy.

Tech stack

PythonNextcloudQdrantbge-m3WebDAVTelegram

Getting it running

Difficulty · hard Time to first run · 1h+

Requires three self-hosted components running together: Nextcloud with WebDAV, an Infinity server running bge-m3, and a Qdrant database instance.

Use freely for any purpose including commercial use, as long as you keep the copyright notice.

In plain English

This project creates an automated pipeline for making your PDF documents searchable by meaning rather than by exact keywords. You drop a PDF into a specific folder in your Nextcloud storage (Nextcloud is self-hosted cloud storage, similar to Dropbox but run on your own server), and the pipeline picks it up, processes it, and stores it in a way that lets you later ask questions in plain language and get back the relevant passages. The processing happens in three steps. First, the PDF is split into smaller text chunks, with the system trying to detect section headers and preserve the document structure rather than cutting blindly by character count. Second, each chunk is converted into a set of numbers that represent its meaning, using an AI embedding model called bge-m3. Third, those numerical representations are stored in Qdrant, a database built specifically for storing and searching this kind of data. Once the pipeline has run, you can query the collection with a question and get back the passages whose meaning is closest to what you asked. The pipeline also handles updates intelligently. Each PDF gets a fingerprint (a SHA256 hash) when it is first ingested. If you drop the same file again, the system detects the match and skips it. If you drop an updated version of the same file with a different fingerprint, the old entries are deleted and the new version is ingested. Successfully processed files are moved to a processed folder, failed ones go to a separate failed folder. Telegram notifications report the outcome of each file. The whole system runs as a background service that checks your Nextcloud inbox folder every 30 seconds. It requires Nextcloud with WebDAV enabled, an Infinity server running the bge-m3 embedding model, and a Qdrant instance. All three can be self-hosted. The project is written in Python 3.11 or newer and is licensed under MIT.

Copy-paste prompts

Prompt 1

I have Nextcloud with WebDAV enabled and a running Qdrant instance. Walk me through setting up nextcloud-qdrant-pipeline to monitor my inbox folder and ingest PDFs automatically.

Prompt 2

The pipeline uses bge-m3 via an Infinity server. How do I set up the Infinity server locally and point the pipeline configuration at it?

Prompt 3

My PDF ingestion failed and the file moved to the failed folder. What are the most common reasons a PDF fails in this pipeline and how do I debug it?

Prompt 4

I want to query my Qdrant collection from Python after PDFs have been ingested. Show me the query code using the same bge-m3 embedding model the pipeline uses.

Open on GitHub → Explain another repo

← drknght4 on gitmyhub — every repo by this author, as a profile.

Verify against the repo before relying on details.