Recreating Epstein PDFs from raw encoded attachments

There have been a lot of complaints about both the competency and the logic behind the latest Epstein archive release by the DoJ: from censoring the names of co-conspirators to censoring pictures of random women in a way that makes individuals look guiltier than they really are, forgetting to redact credentials that made it possible for all of Reddit to log into Epstein’s account and trample over all the evidence, and the complete ineptitude that resulted in most of the latest batch being corrupted thanks to incorrectly converted Quoted-Printable encoding artifacts, it’s safe to say that Pam Bondi’s DoJ did not put its best and brightest on this (admittedly gargantuan) undertaking. But the most damning evidence has all been thoroughly redacted… hasn’t it? Well, maybe not.

I was thinking of writing an article on the mangled quoted-printable encoding the day this latest dump came out in response to all the misinformed musings and conjectures that were littering social media (and my dilly-dallying cost me, as someone beat me to the punch), and spent some time searching through the latest archives looking for some SMTP headers that I could use in the article when I came across a curious artifact: not only were the emails badly transcoded into plain text, but also some binary attachments were actually included in the dumps in their over-the-wire Content-Transfer-Encoding: base64 format, and the unlucky intern that was assigned to the documents in question didn’t realize the significance of what they were looking at and didn’t see the point in censoring seemingly meaningless page after page of hex content!

Just take a look at EFTA00400459, an email from correspondence between (presumably) one of Epstein’s assistants and Epstein lackey/co-conspirator Boris Nikolic and his friend, Sam Jaradeh, inviting them to a ████████ benefit:

Those hex characters go on for 76 pages, and represent the file DBC12 One Page Invite with Reply.pdf encoded as base64 so that it can be included in the email without breaking the SMTP protocol. And converting it back to the original PDF is, theoretically, as easy as copy-and-pasting those 76 pages into a text editor, stripping the leading > bytes, and piping all that into base64 -d > output.pdf … or it would be, if we had the original (badly converted) email and not a partially redacted scan of a printout of said email with some shoddy OCR applied.

If you tried to actually copy that text as digitized by the DoJ from the PDF into a text editor, here’s what you’d see:

You can ignore the EFTA00400459 on the second line; that (or some variant thereof) will be interspersed into the base64 text since it’s stamped at the bottom of every page to identify the piece of evidence it came from. But what else do you notice? Here’s a hint: this is what proper base64 looks like:

Notice how in this sample everything lines up perfectly (when using a monospaced font) at the right margin? And how that’s not the case when we copied-and-pasted from the OCR’d PDF? That’s because it wasn’t a great OCR job: extra characters have been hallucinated into the output, some of them not even legal base64 characters such as the , and [ , while other characters have been omitted altogether, giving us content we can’t use:1

> pbpaste \ | string match -rv 'EFTA' \ | string trim -c " >" \ | string join "" \ | base64 -d >/dev/null base64: invalid input

I tried the easiest alternative I had at hand: I loaded up the PDF in Adobe Acrobat Pro and re-ran an OCR process on the document, but came up with even worse results, with spaces injected in the middle of the base64 content (easily fixable) in addition to other characters being completely misread and butchered – it really didn’t like the cramped monospace text at all. So I thought to do it manually with tesseract , which, while very far from state-of-the-art, can still be useful because it lets you do things like limit its output to a certain subset of characters, constraining the field of valid results and hopefully coercing it into producing better results.

Only one problem: tesseract can’t read PDF input (or not by default, anyway). No problem, I’ll just use imagemagick / ghostscript to convert the PDF into individual PNG images (to avoid further generational loss) and provide those to tesseract , right? But that didn’t quite work out, they seem (?) to try to load and perform the conversion of all 76 separate pages/png files all at once, and then naturally crash on too-large inputs (but only after taking forever and generating the 76 (invalid) output files that you’re forced to subsequently clean up, of course):

... continue reading