Train a query-side adapter that improves Hit Rate on a domain-specific RAG corpus
Generate synthetic question/chunk pairs from a corpus using Claude Haiku
Run an LLM-judge eval that scores RAG answers on six 1-5 metrics
Sweep retrieval hyperparameters with Optuna keyed on validation Hit Rate at 5
Needs a Claude API key for synthetic-question generation, GPU-friendly PyTorch setup, and Optuna sweeps that take a while to converge.
Fine-tuned RAG is a small research project that tries to improve the retrieval step inside a Retrieval Augmented Generation pipeline. In a normal RAG system, when you ask a question the system turns your question into a numeric vector, compares it against the vectors of every document chunk in a database, and returns the closest matches by cosine similarity. The author argues that this default approach treats every dimension of the vector as equally important, which adds noise on a focused corpus like legal text where only some dimensions actually help. The project tests this idea on a public legal dataset called isaacus/legal-rag-bench. The author builds a small extra neural network that sits between the query and the database. It takes the raw query embedding and produces a transformed embedding where useful dimensions get amplified and noisy ones get dampened. The database side stays unchanged, so the system can still use any standard embedding model and any standard vector store. Training data comes from the corpus itself. For each chunk in the training set, Claude Haiku is asked to write up to five realistic legal questions that the chunk could answer, producing about 14,000 question and chunk pairs from around 4,800 chunks. The small network is then trained with Multiple Negatives Ranking loss, which pushes each transformed question vector close to its correct chunk and away from all other chunks in the batch. A masking trick avoids penalising the model when several questions point at the same chunk. Optuna runs Bayesian hyperparameter search and keeps the checkpoint with the best validation Hit Rate at 5. The README reports that on 100 held-out questions, the fine-tuned retriever beats the raw embedding baseline on Hit Rate at 20. It also runs both pipelines through the same LLM to write answers, then uses an LLM judge to score those answers on six metrics from 1 to 5. The fine-tuned version wins on every metric, with the biggest gains in completeness and faithfulness. The code is split into three pipeline scripts: pipeline_generate.py to build the synthetic question and chunk caches, pipeline_train.py to train the network on those caches, and pipeline_eval.py to compare retrieval quality and run the LLM judge. Shared logic lives under src in model.py, train.py, retrieval.py, and generation.py, with all hyperparameters and paths centralised in config.py.
Generated 2026-05-22 · Model: sonnet-4-6 · Verify against the repo before relying on details.