As it turns out they don't actually want you to do this (and have some interesting ways to stop you)
TL;DR
I bought my first ebook from amazon
Amazon's Kindle Android app was really buggy and crashed a bunch
Tried to download my book to use with a functioning reader app
Realized Amazon no longer lets you do that
Decided to reverse engineer their DRM system out of spite
Discovered multiple layers of protection including randomized alphabets
Defeated all of them with font matching wizardry
You can now download the books you own books with my code
Part 1: Amazon Made This Personal
The One Time I Tried To Do Things The Right Way
I've been "obtaining" ebooks for years. But this ONE time, I thought: "Let's support the author."
Download Kindle app on Android. Open book.
Crash.
I Just Wanted To Read My Book
App crashes. Fine, I'll use the web reader.
Oh wait, can't download it for offline reading. What if I'm on a plane?
Hold on, I can't even export it to Calibre? Where I keep ALL my other books?
So let me get this straight:
I paid money for this book
I can only read it in Amazon's broken app
I can't download it
I can't back it up
I don't actually own it
Amazon can delete it whenever they want
This is a rental, not a purchase.
This does not say "Rent"
It Becomes Personal
I could've refunded and pirated it in 30 seconds. Would've been easier.
But that's not the point.
The point is I PAID FOR THIS BOOK. It's mine. And I'm going to read it in Calibre with the rest of my library even if I have to reverse engineer their web client to do it.
Reversal Time
Kindle Cloud Reader (the web version) actually works. While looking through the network requests, I spotted this:
https://read.amazon.com/renderer/render
To download anything, you need:
1. Session cookies - standard Amazon login
2. Rendering token - from the startReading API call
3. ADP session token - extra auth layer
Sending the same headers and cookies the browser does returns a TAR file.
What's Inside The TAR?
page_data_0_4.json # The "text" (spoiler: it's not text) glyphs.json # SVG definitions for every character toc.json # Table of contents metadata.json # Book info location_map.json # Position mappings
Part 3: Amazon's Obfuscation Layers of Ebook Hell
Downloaded the first few pages, expected to see text. Got this instead:
{ "type": "TextRun", "glyphs": [24, 25, 74, 123, 91, 18, 19, 30, 4, ...], "style": "paragraph" }
These aren't letters. They're glyph IDs. Character 'T' isn't Unicode 84, it's glyph 24.
And glyph 24 is just a series of numbers that define a stroke path, its just an image of a letter.
It's a substitution cipher! Each character maps to a non-sequential glyph ID.
The Alphabet Changes Every. Five. Pages.
Downloaded the next batch of pages. Same letter 'T' is now glyph 87.
Next batch? Glyph 142.
They randomize the entire alphabet on EVERY request.
This means:
You can only get 5 pages at a time (API hard limit)
Each request gets completely new glyph mappings
Glyph IDs are meaningless across requests
You can't build one mapping table for the whole book
Let Me Show You How Bad This Is
For my 920-page book:
184 separate API requests needed
needed 184 different random alphabets to crack
to crack 361 unique glyphs discovered (a-z, A-Z, punctuation, ligatures)
discovered (a-z, A-Z, punctuation, ligatures) 1,051,745 total glyphs to decode
Fake Font Hints (They're Getting Sneaky)
Some SVG paths contained this garbage:
M695.068,0 L697.51,-27.954 m3,1 m1,6 m-4,-7 L699.951,-55.908 ...
Those tiny m3,1 m1,6 m-4,-7 commands? They're micro-MoveTo operations.
Why this is evil:
Browsers handle them fine (native Path2D)
Python SVG libraries create spurious connecting lines
Makes glyphs look corrupted when rendered naively
Breaks path-sampling approaches
This is deliberate anti-scraping. The glyphs render perfectly in browser but make it so we cant just compare paths in our parser.
Take a look
Fun!
Eventually I figured out that filling in the complete path mitigated this.
Multiple Font Variants
Not just one font. FOUR variants:
bookerly_normal (99% of glyphs)
bookerly_italic (emphasis)
bookerly_bold (headings)
bookerly_bolditalic (emphasized headings)
Plus special ligatures: ff, fi, fl, ffi, ffl
More variations = more unique glyphs to crack = more pain.
OCR Is Mid (My Failed Attempt)
Tried running OCR on rendered glyphs. Results:
178/348 glyphs recognized (51%)
170 glyphs failed completely
OCR just sucks at single characters without context. Confused 'l' with 'I' with '1'. Couldn't handle punctuation. Gave up on ligatures entirely.
OCR needs words and sentences to work well. Single characters? Might as well flip a coin.
Part 4: The Solution That Actually Worked
Every request includes `glyphs.json` with SVG path definitions:
{ "24": { "path": "M 450 1480 L 820 1480 L 820 0 L 1050 0 L 1050 1480 ...", "fontFamily": "bookerly_normal" }, "87": { "path": "M 450 1480 L 820 1480 L 820 0 L 1050 0 L 1050 1480 ...", "fontFamily": "bookerly_normal" } }
Glyph IDs change, but SVG shapes don't.
Why Direct SVG Comparison Failed
First attempt: normalize and compare SVG path coordinates.
Failed because:
Coordinates vary slightly
Path commands represented differently
Pixel-Perfect Matching
Screw coordinate comparison. Let's just render everything and compare pixels.
Render that A
1. Render every SVG as an image
Use cairosvg (lets us handle those fake font hints correctly)
Render at 512 x 512px for accuracy
2. Generate perceptual hashes
Hash each rendered image
The hash becomes the unique identifier
Same shape = same hash, regardless of glyph ID
3. Build normalized glyph space
Map all 184 random alphabets to hash-based IDs
Now glyph "a1b2c3d4..." always means letter 'T'
4. Match to actual characters
Download Bookerly TTF fonts
Render every character (A-Z, a-z, 0-9, punctuation)
Use SSIM (Structural Similarity Index) to match
Why SSIM Is Perfect For This
SSIM compares image structure, not pixels directly. It handles:
Slight rendering differences
Anti-aliasing variations
Minor scaling issues
For each unknown glyph, find the TTF character with highest SSIM score. That's your letter.
Handling The Edge Cases
Ligatures: ff, fi, fl, ffi, ffl
These are single glyphs for multiple characters
Had to add them to TTF library manually
Special characters: em-dash, quotes, bullets
Extended character set beyond basic ASCII
Matched against full Unicode range in Bookerly
Font variants: Bold, italic, bold-italic
Built separate libraries for each variant
Match against all libraries, pick best score
Part 5: The Moment It All Worked
Final Statistics
=== NORMALIZATION PHASE === Total batches processed: 184 Unique glyphs found: 361 Total glyphs in book: 1,051,745 === MATCHING PHASE === Successfully matched 361/361 unique glyphs (100.00%) Failed to match: 0 glyphs Average SSIM score: 0.9527 === DECODED OUTPUT === Total characters: 5,623,847 Pages: 920
Perfect. Every single character decoded correctly.
EPUB Reconstruction With Perfect Formatting
The JSON includes positioning for every text run:
{ "glyphs": [24, 25, 74], "rect": {"left": 100, "top": 200, "right": 850, "bottom": 220}, "fontStyle": "italic", "fontWeight": 700, "fontSize": 12.5, "link": {"positionId": 7539} }
I used this to preserve:
Paragraph breaks (Y-coordinate changes)
Text alignment (X-coordinate patterns)
Bold/italic styling
Font sizes
Internal links
The final EPUB is near indistinguishable from the original!
The Real Conclusion
Amazon put real effort into their web DRM.
Was It Worth It?
To read one book? No.
To prove a point? Absolutely.
To learn about SVG rendering, perceptual hashing, and font metrics? Probably yes.
Use This Knowledge Responsibly
This is for backing up books YOU PURCHASED.
Don't get me sued into oblivion thanks.