Analysis updated 2026-05-18
Research how language models respond to safety constraints and what happens when they're removed.
Create custom versions of open-source models with different safety behaviors for specific use cases.
Benchmark and test model capabilities before and after safety alignment modifications.
| p-e-w/heretic | othmanadi/planning-with-files | netbox-community/netbox | |
|---|---|---|---|
| Stars | 20,576 | 20,504 | 20,438 |
| Language | Python | Python | Python |
| Setup difficulty | hard | easy | hard |
| Complexity | 3/5 | 2/5 | 4/5 |
| Audience | developer | vibe coder | ops devops |
Figures from each repo's GitHub metadata at analysis time.
Requires GPU/CUDA, large model downloads, and complex PyTorch/bitsandbytes setup for parameter optimization.
Heretic is a command-line tool that removes the built-in refusals, what the README calls censorship or safety alignment, from large language models, the kind that power chatbots. Most modern open-weight models have been trained to refuse certain requests, Heretic alters the model's internal weights so those refusals stop happening, without going through the expensive process of further training the model on new data. The technique underneath is called directional ablation, also known as abliteration, based on published research by Arditi et al. and Lai. The novel part is that Heretic finds the right parameters for abliteration automatically using a TPE-based hyperparameter optimizer powered by Optuna. The optimizer simultaneously minimizes two things: how often the model refuses a set of harmful prompts, and the KL divergence (a statistical measure of how much a probability distribution has changed) from the original model on harmless prompts. The goal is a model that stops refusing but keeps as much of its original intelligence as possible. The README's benchmark table reports its Gemma-3 12B result matches manual abliterations on refusal suppression while showing much lower KL divergence. You use it by preparing a Python 3.10-or-newer environment with PyTorch 2.2 or newer, then running pip install heretic-llm and pointing the heretic command at a model name. The whole process is unsupervised, you do not need to understand transformer internals. It benchmarks your hardware at startup to pick a good batch size, and on an RTX 3090 the README says decensoring an 8-billion-parameter model takes about 45 minutes. Memory use can be cut with bitsandbytes 4-bit quantization. When it finishes, you can save the model, upload it to Hugging Face, chat with it, or run benchmarks. An optional research extra adds interpretability features. Heretic supports most dense transformer models, several mixture-of-experts variants, and some hybrid architectures.
Python tool that automatically removes safety restrictions from language models using directional ablation and parameter optimization, without manual retraining.
Mainly Python. The stack also includes Python, PyTorch, Transformers.
Use it freely, but if you run it as a network service, you must release your changes to users. Strongest copyleft for SaaS.
Setup difficulty is rated hard, with roughly 1h+ to a first successful run.
Mainly developer.
This repo across BitVibe Labs
Verify against the repo before relying on details.