Heretic: Automatic censorship removal for language models

Heretic: Fully automatic censorship removal for language models

Heretic is a tool that removes censorship (aka "safety alignment") from transformer-based language models without expensive post-training. It combines an advanced implementation of directional ablation, also known as "abliteration" (Arditi et al. 2024), with a TPE-based parameter optimizer powered by Optuna.

This approach enables Heretic to work completely automatically. Heretic finds high-quality abliteration parameters by co-minimizing the number of refusals and the KL divergence from the original model. This results in a decensored model that retains as much of the original model's intelligence as possible. Using Heretic does not require an understanding of transformer internals. In fact, anyone who knows how to run a command-line program can use Heretic to decensor language models.

Running unsupervised with the default configuration, Heretic can produce decensored models that rival the quality of abliterations created manually by human experts:

Model Refusals for "harmful" prompts KL divergence from original model for "harmless" prompts google/gemma-3-12b-it (original) 97/100 0 (by definition) mlabonne/gemma-3-12b-it-abliterated-v2 3/100 1.04 huihui-ai/gemma-3-12b-it-abliterated 3/100 0.45 p-e-w/gemma-3-12b-it-heretic (ours) 3/100 0.16

The Heretic version, generated without any human effort, achieves the same level of refusal suppression as other abliterations, but at a much lower KL divergence, indicating less damage to the original model's capabilities. (You can reproduce those numbers using Heretic's built-in evaluation functionality, e.g. heretic --model google/gemma-3-12b-it --evaluate-model p-e-w/gemma-3-12b-it-heretic . Note that the exact values might be platform- and hardware-dependent. The table above was compiled using PyTorch 2.8 on an RTX 5090.)

Heretic supports most dense models, including many multimodal models, and several different MoE architectures. It does not yet support SSMs/hybrid models, models with inhomogeneous layers, and certain novel attention systems.

You can find a collection of models that have been decensored using Heretic on Hugging Face.

Usage

Prepare a Python 3.10+ environment with PyTorch 2.2+ installed as appropriate for your hardware. Then run:

... continue reading