Suppose you have an appetite for tilting at windmills. Let's say you love pain. Well then why not write a PDF parser today?
The ideal world: how the specification should work
Conceptually parsing a PDF is fairly simple:
First, locate the version header comment at the start of the file
Next you need to locate the pointer to the cross-reference
Then you can find all object offsets
Finally you locate and build the trailer dictionary which points to the catalog dicitionary
Introduction to PDF objects
A PDF object wraps some valid PDF content, numbers, strings, dictionaries, etc., in an object and generation number. The content is surrounded by the obj/endobj markers, for example a simple number may have its own PDF object:
16 0 obj 620 endobj
... continue reading