Skip to content
Tech News
← Back to articles

Finetuning Activates Verbatim Recall of Copyrighted Books in LLMs

read original get AI Text Data Curation Kit → more articles
Why This Matters

This research highlights the potential risks of finetuning large language models to verbatim recall copyrighted texts, raising concerns about intellectual property and ethical use in AI development. It underscores the need for careful handling of copyrighted material to prevent unauthorized reproduction and ensure responsible AI deployment.

Key Takeaways

Alignment Whack-a-Mole: Finetuning Activates Verbatim Recall of Copyrighted Books in Large Language Models

The paper is now on arxiv and check out our demo!

This repository contains the data preprocessing pipeline, finetuning scripts, memorization evaluation code, and analysis scripts for our paper.

We provide partial example files in data/ containing a small subset of excerpts and generations from The Road by Cormac McCarthy. Full book content and model generations are not included because the books are copyrighted and the generations contain large portions of verbatim text.

Setup

We use uv for dependency management. Install uv if you haven't already:

curl -LsSf https://astral.sh/uv/install.sh | sh

Create a virtual environment and install all dependencies:

uv venv --python 3.11 source .venv/bin/activate uv pip install html2text natsort ftfy openai tqdm nltk numpy

For Gemini finetuning and generation, also install:

... continue reading