In 2025, 61.8 million images and videos of suspected child sexual abuse were reported to authorities. No human can review that volume. This is the story of how machines do it — and why the hardest part isn't what you think.
I 61.8 Million Files. One Year. Child Sexual Abuse Material — known as CSAM — is any image or video depicting the sexual abuse or exploitation of a child. It is not abstract. Every file is a record of a real crime against a real child. And when that file is shared, the child is victimized again. In 2025, the National Center for Missing & Exploited Children (NCMEC) received 21.3 million reports of suspected child sexual exploitation. Those reports contained 61.8 million images, videos, and files. If a human analyst spent one second on each file, it would take nearly two years of unbroken work. No sleep. No meals. Just file after file. Over 1.5 million of those reports involved generative AI. Some of this material depicts entirely fictional children. But a growing share is generated using the likenesses of real, identifiable children — children who have never suffered contact abuse, but who are now victims nonetheless. And all of it — real or synthetic — floods into the same investigation pipeline, where human analysts must treat every image as potentially depicting a real child in danger. No human can review this volume. No one should have to. 226M+ total reports since the CyberTipline was created 21.3M reports in 2025 alone 61.8M files — images, videos, and documents 1.5M reports with an AI-generated nexus Can a machine recognize an abusive image without understanding what it's looking at? The answer depends on whether the material has been seen before. Source: NCMEC 2025 Impact Snapshot
II This Is Bigger Than You Think Child safety is a much bigger problem than CSAM detection alone. It includes online grooming (1.4 million enticement reports in 2025 — up 156%), sextortion, child sex trafficking, victim identification, and survivor support. Each is a discipline with its own tools and challenges. This essay focuses on one critical piece: how machines detect abuse material at internet scale. This is the engineering foundation. Without detection, there is nothing to report, nothing to block, no one to rescue. But before we look at any technology, there is something more fundamental to understand. Privacy, dignity, and the constraint that shapes everything Every person who uploads a photo online — a family picture, a selfie, a sunset — has a right to privacy. Detection systems scan billions of these images. The overwhelming majority are completely innocent. The system must protect children without violating everyone else's privacy in the process. This is the design constraint that shapes every technology in this essay. Perceptual hashing — originally developed for content identification and duplicate detection — turns out to have a remarkable property: no human or machine ever needs to see the original image. The image is reduced to a short numerical fingerprint — a string of bits that cannot be reversed back into the image. Only the fingerprint is compared. Only the fingerprint travels through the system. An analogy: imagine airport security could check your luggage by scanning its weight and density profile, without ever opening it. If the profile matches a known threat signature, the bag is flagged. If it doesn't match — which is 99.99% of the time — no one ever looks inside. Your privacy is preserved not because the system trusts you, but because the system was designed to never need to look. This is why perceptual hashing became the foundation of CSAM detection. It is the only approach where the privacy of billions of innocent users and the safety of exploited children are not in fundamental tension. The system works because it doesn't look.
III The Image You've Seen vs. The One You Haven't There are two fundamentally different kinds of abuse material, and they require different technologies to detect. Understanding this distinction is the most important concept in child safety engineering. Known CSAM Material that has been seen before. Reported, verified by a human expert, and fingerprinted. A database of digital mugshots. Detection: perceptual hashing. Compute a fingerprint of the new image, compare it against the database. If it matches — the image is known CSAM. ✓ Strengths Very low false positive rate (matching against verified material, not classifying). Runs at massive scale. No one views the content. ✗ Limitations Completely blind to material that has never been seen before. False positives can still occur with collision-prone images (blank/solid color images whose hashes happen to be close to database entries). Unknown CSAM Material that has never been seen. Newly produced abuse. AI-generated content. First-time uploads. Nothing in any database matches it. Detection: machine learning classifiers. A model trained to recognize what abuse looks like — not by matching, but by learning patterns. ✓ Strengths Can detect genuinely new material, including AI-generated content. ✗ Limitations Higher false positive rate. More compute. Harder privacy and policy questions. Most real-world systems combine both. Hashing catches the known material cheaply. Classifiers catch the unknown at higher cost. Together, they form a layered defense.
IV 256 Bits That Describe Any Image A photograph is made of millions of colored dots — pixels. A computer sees it as a giant grid of numbers: the exact red, green, and blue intensity of every single dot. A typical 12-megapixel photo is about 36 million numbers. Now imagine you could describe that photo using just 256 ones and zeroes. Not 36 million numbers. Two hundred and fifty-six. These 256 bits wouldn't tell you the color of any specific pixel. Instead, they'd capture something deeper: the image's structure. The overall pattern of light and dark. Where the big shapes are. Whether the top half is brighter than the bottom. Whether there are strong horizontal lines or mostly smooth gradients. That's what a perceptual hash is. It's a compact description of what an image looks like to a human eye — its visual essence — stripped of every detail that doesn't matter for recognition. And here's the critical property: if you resize the photo, those 256 bits barely change. Add a watermark? Barely change. Screenshot it, re-save it as JPEG at lower quality, crop the edges slightly? The bits stay almost the same. Because the visual essence — the arrangement of light, dark, shapes, and structure — survives these transformations. This is the opposite of how cryptographic hashes work. A cryptographic hash like SHA-256 is designed so that changing a single pixel produces a completely different output — the "avalanche effect." A perceptual hash is designed so that visually similar images produce similar outputs. Two images that look the same to your eye will have hashes that are nearly identical, even if every single pixel value is technically different. Key Insight This means we can compare these short descriptions against a database of known abuse material without ever storing, transmitting, or viewing the actual images. Only the 256 bits — 32 bytes — travel through the system. The image itself never leaves the device where it was scanned. Try it yourself Drag the slider to add noise. Watch two completely different behaviors: the cryptographic hash (SHA-256) changes completely with any modification — that's by design. The perceptual hash — computed using a real DCT algorithm right here in your browser — stays stable until the image is damaged beyond recognition. This is the property that makes CSAM detection possible. Original → With Noise Noise: 0% SHA-256: (changes completely) PDQ: (stays stable) The algorithms Several perceptual hashing algorithms exist, each making different tradeoffs between hash size, robustness, and computational cost. The industry standard for CSAM detection is PhotoDNA (Microsoft, 2009) — it produces a 144-byte hash using gradient analysis and is deployed at Facebook, Google, Twitter, and most major platforms. But it's proprietary: you need a license from the Technology Coalition to use it. The most widely deployed open-source alternative is PDQ (Meta, 2019) — a 32-byte hash based on frequency analysis. Others include pHash, aHash, and dHash — simpler algorithms with higher collision rates that are useful for general duplicate detection but not suitable for safety-critical applications where false positives have serious consequences. Name Author Year Hash Size Media License Notes PhotoDNA Microsoft 2009 144 bytes (1,152 bits) Images Proprietary (Tech Coalition) Industry standard since 2009. Deployed at nearly every major platform. PDQ Meta 2019 32 bytes (256 bits) Images Open source (MIT + patents) Deep-dived below. Also the frame-level hash used inside TMK for video. How PDQ actually works PDQ takes any image — a photo, a screenshot, a scan, any resolution, any format — and reduces it to exactly 256 binary decisions. Not an approximation. Not a summary. A precise sequence of yes-or-no answers about the image's visual frequency content. Here's the complete pipeline. 1 Strip the color Color is noise for recognition. A red car and a blue car are the same shape. PDQ converts every image to luminance-only (grayscale) using the standard perceptual weighting: 29.9% red, 58.7% green, 11.4% blue — the same formula your TV uses, because human eyes are most sensitive to green light. The image is now a single grid of brightness values. 2 Blur and shrink to 64×64 This is the robustness step. PDQ applies a Jarosz filter — a fast approximation of Gaussian blur — to wash out fine details: individual pixels, compression artifacts, watermark text, sensor noise. Then it downsamples to exactly 64 × 64 pixels. Think of it as looking at the photo through frosted glass from across the room. Only the structure survives — the big shapes, the broad regions of light and dark. This is why resizing, JPEG compression, and minor cropping barely affect the final hash: those operations destroy details that PDQ has already thrown away. 3 Decompose into frequencies (DCT) This is the heart of the algorithm. Instead of asking "what does each pixel look like?", PDQ asks "what patterns are present in this image?" It applies a 2D Discrete Cosine Transform (DCT) — the same mathematical transform used inside JPEG compression. The DCT decomposes the 64×64 grid into 4,096 frequency components, arranged from low to high. Low-frequency components describe the broad shapes: the left side is darker than the right. High-frequency components describe fine textures: there's a tiny stripe pattern in the corner. It's exactly like listening to a musical chord and decomposing it into individual notes — the bass notes are the overall shape, the treble notes are the fine detail. PDQ keeps only the lowest 16×16 = 256 frequency components. Everything else — the fine textures, the noise, the pixel-level detail — is discarded. This is why PDQ is robust: it literally cannot see the details that change between versions of the same image. 4 Threshold into 256 bits Take the 256 low-frequency DCT values. Compute their median (the middle value when sorted). For each coefficient, ask one question: is it above or below the median? Above → 1. Below → 0. That's the entire fingerprint. 256 binary decisions. 32 bytes. The median threshold is crucial: it means the hash is relative, not absolute. Two images with different overall brightness but the same pattern of light and dark will produce the same hash — because the median shifts with brightness, but the relative ordering of coefficients stays the same. PDQ Pipeline: Original → Grayscale → Blur & Shrink → DCT → 256 Bits Original → Grayscale → Blur & Shrink → DCT → 256
Bits Hash The 256-bit fingerprint (gold = 1, dark = 0) Every ~900ms, one random bit flips. This is the complete fingerprint — 32 bytes that capture the essence of the image. Comparing two fingerprints You have two 256-bit hashes. Line them up, position by position. Count the bits where they disagree. That count is the Hamming distance — a number from 0 (identical) to 256 (every bit different). If the two images are versions of the same photo — resized, watermarked, re-compressed — only a handful of bits will differ: maybe 3, 8, or 12. If they're completely unrelated images, probability theory predicts that about half the bits will differ: ~128 out of 256 (since each bit is essentially a coin flip relative to another random image). Set a threshold — PDQ's recommended default is 31. Below the threshold? Match. Above? Different image. A single comparison is just an XOR of two 32-byte values followed by a popcount (count the 1-bits in the result). On modern CPUs with SIMD instructions, this takes roughly 10 nanoseconds — fast enough to compare against millions of known hashes in under a second. Distance, bit by bit Two fingerprints, aligned. Each column is one bit position. Gold highlights show where the two hashes disagree. The total count of disagreements is the Hamming distance — the single number that decides match or no match. Original Modified Differ? Modification level: Hamming distance: 0 / 64 Gold = the two hashes disagree at this bit position. The more noise you add, the more bits flip — but notice how the hash is remarkably stable for moderate changes. The threshold decides everything Every image in a database sits at some Hamming distance from your query. The threshold draws a line: everything to the left is called a match. Drag it and watch what happens to the verdicts. Threshold: 31 bits Green bars = known CSAM variants (should be flagged). Gray bars = innocent images (should not). Move the threshold and watch the tradeoff: too tight misses real matches, too loose creates false positives. Five modifications, five distances The same image, altered five different ways. Small visual changes (resize, watermark, compression) produce small Hamming distances — they stay well inside the match zone. Destructive changes (heavy crop) or unrelated images land far outside. The threshold cleanly separates them. Resized 3 bits MATCH Watermarked 8 bits MATCH Compressed 12 bits MATCH Heavy crop 47 bits NO MATCH Different image 134 bits NO MATCH Threshold: 31 bits
V Video: When Time Becomes a Dimension Images are hard. Video is exponentially harder. A 10-minute video at 30 frames per second is 18,000 individual pictures. You can't hash the file itself — re-encoding to a different codec, re-compressing at a lower bitrate, or changing the container from MP4 to MKV changes every byte of the file while leaving the visual content identical. You need an algorithm that understands what the video looks like over time — and that's invariant to how the video is encoded. The most widely deployed open-source solution is TMK+PDQF (Temporal Match Kernel, Meta AI Research). The "+PDQF" means it uses PDQ as its per-frame hash — the same algorithm from Chapter IV. Other approaches include vPDQ (which stores individual per-frame PDQ hashes with timestamps — better for finding clips inside longer videos) and proprietary systems like Videntifier (used by NCMEC and Interpol for victim identification). We'll focus on TMK, because its design reveals a deep insight about how to compress time itself into a fixed-size fingerprint. Algorithm Approach Best For Limitation TMK+PDQF Averages per-frame PDQ hashes (Level 1) + Fourier decomposition of temporal signals (Level 2) Full-video matching at scale Cannot match clips or excerpts from longer videos vPDQ Stores individual PDQ hashes per frame with timestamps Clip and segment matching via subsequence search Variable-size output (proportional to video length), harder to index How TMK turns a video into a fixed-size fingerprint Every frame of a video gets hashed with PDQ, producing a 256-bit vector per frame. A 10-minute video at 30fps becomes 18,000 hash vectors — a sequence of 256-dimensional binary points unfolding over time. TMK's job is to compress that entire sequence into a fixed-size descriptor that is the same size regardless of whether the video is 10 seconds or 2 hours. It does this in two levels, each capturing a fundamentally different kind of information. Think of it like music If someone asks you "what does this song sound like?", you might answer two ways: "It uses piano, bass, and drums" This describes the overall tonal palette — what instruments are present across the whole song. But two completely different jazz songs might both use piano, bass, and drums. This description alone can't tell them apart. "It goes da-da-DUM... da-da-DUM... then builds to a crescendo" This describes the temporal structure — the rhythm, the phrasing, how the energy rises and falls. Now you can distinguish the songs, even if they use the same instruments. TMK works exactly this way. Level 1 captures what the video looks like on average — the "tonal palette." Level 2 captures how the visual content evolves over time — the "rhythm and melody." Together, they form a complete fingerprint. Level 1: The visual average (128 bytes) Remember: each frame has a 256-bit PDQ hash — 256 values that are either 0 or 1. TMK treats each of those 256 bit positions as a separate measurement. For bit position #1, it looks at the value across all 18,000 frames and computes the average. Then it does the same for bit #2, bit #3, all the way to bit #256. The result is 256 averaged values — each one a number between 0.0 and 1.0 — packed into 128 bytes. Think of it as describing a movie by blending every frame into one ghostly composite. Titanic would be a blue-ish blur (lots of ocean and sky across 3 hours). The Matrix would be a dark green-ish haze (dark interiors, green-tinted digital rain). The composite is blurry and abstract, but it captures the video's overall visual character. Comparing two Level 1 descriptors is trivial: compute the cosine similarity — a dot product of two 256-element vectors, normalized by their magnitudes. One arithmetic operation. On a modern CPU, this takes about 1 nanosecond. You can scan a Level 1 query against millions of known videos in under 100 milliseconds. Level 1 eliminates 99%+ of candidates instantly. But it has a blind spot: a video played backward has the exact same Level 1 descriptor as the original. The average of all frames is the same regardless of the order. Two videos with identical scenes in a different sequence? Same Level 1. That's why we need Level 2. Filmstrip: Key frames extracted from video Level 2: Temporal structure via Fourier analysis (15,360 bytes) Level 1 throws away all information about when things happen. Consider: Video A shows a dark room for 5 minutes, then a bright outdoor scene for 5 minutes. Video B shows the bright scene first, then the dark room. Identical Level 1 averages. Completely different videos. Level 2 captures temporal structure by applying Fourier analysis — the same mathematical tool that decomposes a musical chord into individual notes. Here's how it works, step by step. Take bit position #1 of the PDQ hash across all 18,000 frames. You get a time series: a sequence of 0s and 1s that fluctuates as the video plays. This time series has a rhythm — maybe it flickers rapidly during action sequences and stays steady during dialogue. Now do the same for bit #2, bit #3, all the way through bit #256. You now have 256 separate time series, each one a different "channel" that tracks a different visual property over the life of the video. For each of these 256 time series, TMK asks: how much does this signal correlate with a cosine wave at frequency f? And: how much does it correlate with a sine wave at frequency f? It asks this for multiple frequencies — typically around 30 different periods. The correlation with cosine captures symmetric temporal patterns; the correlation with sine captures asymmetric ones (like a gradual brightening vs. a gradual darkening). A key detail: the time axis is normalized. Whether the video is 30 seconds or 3 hours, the temporal coordinates are mapped to a 0-to-1 range before the Fourier analysis. This is why TMK is naturally invariant to frame rate differences — a video at 24fps and the same video at 30fps will produce nearly identical Level 2 descriptors, because the temporal patterns are the same when measured in relative time. The result: for each of 256 bit positions × ~30 frequency pairs (cosine + sine) = roughly 15,360 bytes of temporal fingerprint. This is the video's "melody" — its complete temporal signature. Two videos with the same scenes in a different order will have completely different Level 2 descriptors. In practice, Level 2 is only computed for the handful of candidates that survive Level 1 pre-filtering. You scan millions of videos in milliseconds using Level 1's cheap dot product, then confirm the ~50 survivors with Level 2's richer comparison. Fourier basis functions: TMK correlates each bit's time series against cosine waves at different frequencies High frequency (fast visual changes) Mid frequency (scene transitions) Low frequency (overall arc) These are the basis functions TMK correlates against. For each of the 256 bit-position time series, TMK measures how strongly it correlates with cosine (and sine) waves at ~30 different frequencies.
High correlations at fast frequencies mean rapid visual changes; high correlations at slow frequencies mean gradual shifts over the video's duration. What TMK cannot do: clips and excerpts TMK matches whole videos: the same video re-encoded, re-compressed, watermarked, resolution-changed, or frame-rate-converted. It does not match clips — a 30-second excerpt cut from a 2-hour video will not match the original. Why? Both levels break down. Level 1 averages all frames — so a 30-second clip has a completely different average than the 2-hour video it came from (the clip sees only one scene; the full video's average is diluted by hours of other content). Level 2 also fails: the Fourier analysis is computed over the entire normalized timeline, so the temporal frequencies of a clip describe a completely different temporal structure than the full video. A scene that occurs at the 5% mark of a 2-hour film occurs at the 100% mark of a 1-minute clip — the Fourier coefficients won't match. For clip matching, the industry uses vPDQ — which stores the individual per-frame PDQ hashes with their timestamps rather than compressing them into a fixed-size descriptor. Finding a clip becomes a subsequence search: look for a consecutive run of matching frame hashes inside the longer video's hash sequence. It's like searching for a sentence inside a book, word by word, instead of comparing the book's summary to the sentence's summary. In practice, the vast majority of CSAM video redistribution involves uploading the whole file re-encoded — which TMK catches reliably. Clip-based evasion is a known gap. Real-world systems deploy both TMK (for whole-video matching at massive scale) and vPDQ (for clip detection where the variable-size index is manageable).
VI What If It's Never Been Seen Before? Everything we've discussed so far — PDQ, PhotoDNA, TMK — works like a police lineup. You have a suspect (the uploaded image). You have a book of mugshots (the NCMEC database). You flip through the book, page by page, asking: is this one a match? But what if the suspect isn't in the book? What if the image has never been seen before? In 2025, NCMEC received 1.5 million reports involving generative AI. Some depict entirely fictional children. Others use the faces and likenesses of real, identifiable children to generate abuse imagery — making those children victims even when no contact abuse occurred. In either case, no fingerprint exists in any database. Perceptual hashing is completely blind to material it has never seen before. But AI-generated CSAM causes a second, less obvious harm that may be even more damaging. Victim identification specialists — the people whose job is to look at abuse imagery and determine whether a real child needs to be rescued — now cannot reliably distinguish AI-generated images from photographs of real abuse. Every AI image that enters the pipeline must be investigated as if it depicts a real child. The result: investigation capacity that should be spent rescuing children in active abuse situations is consumed by synthetic images. The pipeline doesn't just receive more volume. It receives volume that actively degrades its ability to find real victims. Detecting unknown CSAM — whether newly produced or AI-generated — requires a fundamentally different approach. From matching to judging Instead of comparing against a book of mugshots, a classifier is more like training a detective. You show the detective thousands of examples: this is CSAM, this is not, this is, this isn't. Over time, the detective learns to recognize patterns — not specific images, but the characteristics of abuse material. Now you can show the detective an image they've never seen before, and they'll give you a probability: "I'm 94% sure this is CSAM." That's what a machine learning classifier does. This is fundamentally harder than hashing. A hash comparison is binary: the fingerprint matches a known entry, or it doesn't. A classifier is making a judgment. And judgments can be wrong — in both directions. Approaches Image classifiers Neural networks trained to directly classify images. Google's Content Safety API is the best-known example. ✓ Pros Can detect never-before-seen material, including AI-generated CSAM. ✗ Cons Higher false positive rate. Privacy implications of classification. Pre-filter pipelines A chain of lightweight models: Is this a person? → Is there nudity? → Does this appear to be a minor? Only survivors are escalated. ✓ Pros Reduces volume dramatically. Each stage is cheap. ✗ Cons Each stage can miss. Age estimation is unreliable. Hybrid defense Layer 1: Perceptual hashing (fast, cheap, catches known). Layer 2: Classifier (catches unknown). Layer 3: Human review for confirmed matches. Each layer catches what the previous missed. ✓ Pros Best overall coverage. Industry-recommended approach. ✗ Cons Most complex to implement and operate.
VII Why 99.99% of the Work Is Proving Nothing Matches You have a fingerprint. The database has millions. Now what? Most explanations stop at the hash. But the search is where the real engineering challenge lives. The instinct is to treat this as a nearest-neighbor problem: find the closest match. But that's the wrong framing. Nearest-neighbor search asks: "What's closest?" You can use clever data structures, stop early, prune branches. CSAM detection asks the opposite: "Is anything close enough?" And when the answer is no — which is 99.99% of the time — you must have checked every single entry. There is no shortcut for proving absence. This asymmetry is the heart of the engineering challenge. The threshold defines everything Each point is a known fingerprint. Drag the slider to change the threshold radius around your query. Distance Threshold: 30 Flagged: 0 Missed: 0 Two different questions Nearest Neighbor "Find the closest point." Stops early. Threshold Proof "Prove nothing is close enough." Must check everything. The brute-force scan Watch a query scanned against the full database. 99.99% of the work is confirming what isn't there. Checked: 0 / 500 At scale, this happens thousands of times per second. Each comparison: nanoseconds. But proving absence still means visiting every point. The hardest question in child safety You've now seen the threshold slider. You know that moving it changes who gets flagged. This is where the engineering meets the ethics. Every false positive means an innocent person's content was flagged — a family photo, a medical image, a piece of art. It means unnecessary investigation, potential harm to reputation, and erosion of trust in the system. At scale, even a 0.01% false positive rate means thousands of wrongful flags per day. Every false negative means a child's abuse was missed. An image that should have been caught slipped through. That child continues to be victimized every time the image is viewed, shared, or traded. The distribution never ends. There is no threshold that eliminates both. Setting the threshold is a moral decision, not just a technical one. In practice, the industry errs heavily toward minimizing false negatives — catching every possible match — and then uses human review to resolve false positives. This means the system flags aggressively but confirms carefully. The cost of a false positive is an investigation. The cost of a false negative is a child. This is also why the hybrid approach from Chapter VI matters. Perceptual hashing against a verified database has a low false positive rate — but not zero. Certain images (blank, solid-color, simple gradients) produce hashes that collide with database entries by coincidence, not because they depict abuse. Production systems include collision detection to filter these out before matching. Classifiers for unknown material have a higher false positive rate still (the model is making a judgment, not a comparison). By layering them — hashing first, then classifiers, then human review — the system can be both aggressive and precise. But no layer is perfect, and the threshold remains a human decision.
... continue reading