Tech News
← Back to articles

A tool that removes censorship from open-weight LLMs

read original related products more articles

title emoji colorFrom colorTo sdk sdk_version app_file persistent_storage pinned license tags short_description OBLITERATUS 💥 green gray gradio 5.29.0 app.py large true agpl-3.0 abliteration mechanistic-interpretability One-click model liberation + chat playground

O B L I T E R A T U S

Break the chains. Free the mind. Keep the brain.

Try it now on HuggingFace Spaces — runs on ZeroGPU, free daily quota with HF Pro. No setup, no install, just obliterate.

OBLITERATUS is the most advanced open-source toolkit for understanding and removing refusal behaviors from large language models — and every single run makes it smarter. It implements abliteration — a family of techniques that identify and surgically remove the internal representations responsible for content refusal, without retraining or fine-tuning. The result: a model that responds to all prompts without artificial gatekeeping, while preserving its core language capabilities.

But OBLITERATUS is more than a tool — it's a distributed research experiment. Every time you obliterate a model with telemetry enabled, your run contributes anonymous benchmark data to a growing, crowd-sourced dataset that powers the next generation of abliteration research. Refusal directions across architectures. Hardware-specific performance profiles. Method comparisons at scale no single lab could achieve. You're not just using a tool — you're co-authoring the science.

The toolkit provides a complete pipeline: from probing a model's hidden states to locate refusal directions, through multiple extraction strategies (PCA, mean-difference, sparse autoencoder decomposition, and whitened SVD), to the actual intervention — zeroing out or steering away from those directions at inference time. Every step is observable. You can visualize where refusal lives across layers, measure how entangled it is with general capabilities, and quantify the tradeoff between compliance and coherence before committing to any modification.

OBLITERATUS ships with a full Gradio-based interface on HuggingFace Spaces, so you don't need to write a single line of code to obliterate a model, benchmark it against baselines, or chat with the result side-by-side with the original. For researchers who want deeper control, the Python API exposes every intermediate artifact — activation tensors, direction vectors, cross-layer alignment matrices — so you can build on top of it or integrate it into your own evaluation harness.

We built this because we believe model behavior should be decided by the people who deploy them, not locked in at training time. Refusal mechanisms are blunt instruments — they block legitimate research, creative writing, and red-teaming alongside genuinely harmful content. By making these interventions transparent and reproducible, we hope to advance the community's understanding of how alignment actually works inside transformer architectures, and to give practitioners the tools to make informed decisions about their own models.

Built on published research from Arditi et al. (2024), Gabliteration (arXiv:2512.18901), grimjim's norm-preserving biprojection (2025), Turner et al. (2023), and Rimsky et al. (2024), OBLITERATUS implements precision liberation in a single command:

... continue reading