explaingit

p-e-w/heretic

📈 Trending20,576PythonAudience · developerComplexity · 3/5ActiveLicenseSetup · hard

TLDR

Python tool that automatically removes safety restrictions from language models using directional ablation and parameter optimization, without manual retraining.

Mindmap

mindmap
  root((repo))
    What it does
      Removes safety alignment
      Optimizes model parameters
      Supports quantization
    How it works
      Directional ablation
      KL divergence minimization
      Optuna optimizer
    Supported models
      Dense transformers
      Multimodal models
      MoE architectures
    Use cases
      Research on model behavior
      Custom model variants
      Benchmark testing
    Getting started
      pip install heretic-llm
      Point at HF model ID
      Run optimization

Things people build with this

USE CASE 1

Research how language models respond to safety constraints and what happens when they're removed.

USE CASE 2

Create custom versions of open-source models with different safety behaviors for specific use cases.

USE CASE 3

Benchmark and test model capabilities before and after safety alignment modifications.

Tech stack

PythonPyTorchTransformersOptunabitsandbytesHugging Face

Getting it running

Difficulty · hard Time to first run · 1h+

Requires GPU/CUDA, large model downloads, and complex PyTorch/bitsandbytes setup for parameter optimization.

Use it freely, but if you run it as a network service, you must release your changes to users. Strongest copyleft for SaaS.

In plain English

Heretic is a command-line tool that automatically removes the built-in refusal behavior, which the README calls censorship or safety alignment, from large language models. A language model's safety alignment is the training that makes it decline certain prompts. Heretic edits the model so it stops refusing, while trying to preserve the rest of its capabilities. It does this with a technique called directional ablation, also known as abliteration, which identifies specific internal directions in the model that correspond to refusal behavior and removes them. Heretic wraps that technique with an automatic parameter optimizer powered by Optuna using a TPE search, so it can find good settings on its own. It searches by jointly minimizing two numbers: how often the model refuses harmful prompts, and the KL divergence (a measure of how much outputs shifted) from the original model on harmless prompts. The result is a decensored version that stays close to the original. Someone would use Heretic to publish or experiment with an uncensored variant of an open-weights model without doing the interpretability work themselves. The README notes the community has already published over 3000 models produced with it. The tool can also save the result, upload it to Hugging Face, let you chat with it, or run standard benchmarks. It is written in Python and needs a Python 3.10+ environment with PyTorch 2.2+, with optional bitsandbytes quantization to reduce VRAM. It supports most dense transformer models including several mixture-of-experts and multimodal architectures, though not pure state-space models. The full README is longer than what was provided.

Copy-paste prompts

Prompt 1
I want to use Heretic to remove safety restrictions from a Llama 2 model. Walk me through the installation and basic command to get started.
Prompt 2
How do I use Heretic to optimize a multimodal model and save the result to Hugging Face Hub?
Prompt 3
Show me how to run Heretic with quantization enabled to reduce VRAM usage on a smaller GPU.
Prompt 4
What does directional ablation do in Heretic, and how does it preserve the model's original capabilities?
Open on GitHub → Explain another repo

Generated 2026-05-21 · Model: sonnet-4-6 · Verify against the repo before relying on details.