alirezasalemi7/grepseek

★ 14PythonAudience · researcherComplexity · 4/5LicenseSetup · hard

Mindmap

mindmap
  root((GrepSeek))
    Approach
      Direct Corpus Interaction
      Shell command generation
      No vector index needed
    Training
      Supervised fine-tuning
      GRPO reinforcement learning
      10000 search trajectories
    Tech Stack
      Qwen3.5 9B
      grep and ripgrep
      Python
    Results
      7 QA benchmarks
      7.6x faster search
      14 GB RAM only

mindmap root((GrepSeek)) Approach Direct Corpus Interaction Shell command generation No vector index needed Training Supervised fine-tuning GRPO reinforcement learning 10000 search trajectories Tech Stack Qwen3.5 9B grep and ripgrep Python Results 7 QA benchmarks 7.6x faster search 14 GB RAM only

Click or tap to explore — scroll the page freely

Things people build with this

USE CASE 1

Reproduce the GrepSeek paper results on seven question-answering benchmarks without building a vector index.

USE CASE 2

Fine-tune a language model to search raw text corpora using shell commands instead of dense retrieval.

USE CASE 3

Run the parallel sharded search engine for up to 7.6x faster Wikipedia corpus search with byte-identical results.

Tech stack

PythonQwen3.5ripgrepGRPOJupyterApache

Getting it running

Difficulty · hard Time to first run · 1h+

Training requires GPU resources. The released Colab notebook lets you try inference without a local GPU.

Use freely for any purpose, including commercial use, as long as you include the license and copyright notice.

In plain English

GrepSeek is a research project that trains a compact AI model to answer factual questions by running shell search commands directly on a raw text corpus. Instead of building a vector database or a pre-computed search index, the model learns to write grep and ripgrep commands against a 14 GB Wikipedia corpus stored as plain text. The approach is called Direct Corpus Interaction. The project comes with code, training scripts, a trained model, and a dataset, all published alongside an academic paper. The model is a 9 billion parameter language model from the Qwen3.5 family, fine-tuned in two stages. The first stage uses a dataset of 10,000 example search trajectories generated by a teacher model, teaching the agent how to break down a question into a sequence of shell commands. The second stage uses reinforcement learning (a method called GRPO) where the model is rewarded for finding answers that match the correct text. The combined approach outperforms dense retrieval systems that require large vector indices across a benchmark of seven question-answering datasets, achieving the best average score. One practical advantage is cost and simplicity. Setting up a dense vector index for the same Wikipedia corpus requires 70 GB of RAM or many hours of GPU processing. GrepSeek needs only the raw text and about 14 GB of RAM, with roughly one minute of setup. The repository also includes a sharded parallel search engine that runs corpus searches up to 7.6 times faster than plain grep while producing byte-identical results. The codebase is split into folders for data generation, supervised fine-tuning, reinforcement learning training, and inference. A Jupyter notebook lets anyone try the released model on Google Colab without writing training code. The project is licensed under Apache 2.0.

Copy-paste prompts

Prompt 1

I want to use the GrepSeek model from alirezasalemi7/grepseek to answer factual questions from a plain-text corpus. How do I load the released model in the Colab notebook and run inference on a custom question?

Prompt 2

Walk me through the two-stage training pipeline in GrepSeek: what does the supervised fine-tuning stage teach the model and how does the GRPO reinforcement learning stage change its behavior?

Prompt 3

I want to add a different text corpus to GrepSeek instead of Wikipedia. What format does the corpus need to be in and which part of the codebase handles setting up the sharded search engine?

Prompt 4

How does GrepSeek compare to dense retrieval in terms of RAM requirements and index setup time? I want to explain the practical tradeoff to my team before we decide which approach to use.

Open on GitHub → Explain another repo

← alirezasalemi7 on gitmyhub — every repo by this author, as a profile.

Verify against the repo before relying on details.