Tech News
← Back to articles

So you want to parse a PDF?

read original related products more articles

Suppose you have an appetite for tilting at windmills. Let's say you love pain. Well then why not write a PDF parser today?

The ideal world: how the specification should work

Conceptually parsing a PDF is fairly simple:

First, locate the version header comment at the start of the file

Next you need to locate the pointer to the cross-reference

Then you can find all object offsets

Finally you locate and build the trailer dictionary which points to the catalog dicitionary

Introduction to PDF objects

A PDF object wraps some valid PDF content, numbers, strings, dictionaries, etc., in an object and generation number. The content is surrounded by the obj/endobj markers, for example a simple number may have its own PDF object:

16 0 obj 620 endobj

... continue reading