Exploring the internal representations of Pangram 3.3.2

Since ChatGPT’s debut in 2022, AI-assisted writing has expanded at a staggering pace. Because AI-generated text now appears across so much of what we read, it has become obvious that some forms of writing lose their value when produced by a machine. In academia, essays are meant to cultivate student reasoning. In the marketplace, product reviews are valuable because they reflect the experiences of other people.

Pangram is a research company that builds state-of-the-art AI detection models for this problem. Our flagship product is an AI text detection model with industry-leading low false positive rates, multilingual capabilities, and differentiation between AI-generation and AI-assistance.

Since the launch of our first whitepaper in 2024, we’ve had a unique seat to watch wave after wave of AI advancements. Our researchers have wrestled with overly strict content filters, seen our fair share of mode collapse 1 1Our researchers particularly recommend this article by Gwern on mode collapse in language models., and dodged waves of em-dashes and the word “delve”.

Our flagship model is an LLM that is fine-tuned to this sequence classification task. We do not use custom metrics like perplexity or burstiness. We do not do any manual feature extraction. We do have a customer-facing product called AI Phrases, where we provide information to our users about phrases that appear more frequently in AI text. But these are not directly used as features for the model. After a while, one gets curious. What does the model see?

For us as researchers, this question matters. We are highly incentivized to prevent shortcutting, fix unintended model behavior, and understand this problem deeply. In this post, we will outline our initial interpretability efforts using document-level analysis.