Reverse engineering iWork

The app I’m working on ingests a lot of files, and there’s no good solution for parsing .key , .numbers , or .pages files. Every existing approach requires you to first export your document to PDF (or some other format), then upload it for server-side processing. At that point, you’re either running it through a vision model or a PDF parser, both of which lose significant information or don’t work particularly well.

This isn’t my first time solving distribution problems by going directly to the source. I previously ported Perl to WebAssembly so ExifTool could run client-side for metadata extraction, avoiding the need to upload files or have Perl installed. Same principle applies here: if you want high-quality extraction from iWork files without round-tripping through export formats or sending data to a server, you need to parse the native format.

I am not held back by the conventional wisdom for the simple reason that I am completely unaware of it. So I decided to build a proper parser that keeps user files on their computer and produces the highest quality output possible.

A Brief History of iWork

In 2013, Apple switched the iWork document format from XML to a new binary format built on Google’s Protocol Buffers. The change affected Pages, Keynote, and Numbers, and coincided with iCloud support for iWork and the transition to 64-bit applications. Apple never publicly explained the decision, but the old XML format, which loaded entire documents and assets into memory at once, would have made it difficult to deliver a good experience on the early iPhone, iPad, and web.

Finding the Descriptors

Apple ships Pages, Keynote, and Numbers with their protobuf message descriptors preserved in the executables. These descriptors define the structure of every message type and can be recovered from the binaries.

The recovery process works by scanning through the binary data looking for specific patterns. Protocol Buffer descriptors have a recognizable structure: they start with a length-delimited field (tag 0x0A ) followed by a varint length and then the filename, which always ends in .proto . Once we find a potential descriptor, we validate it by reading through the protobuf wire format:

// Search for ".proto" filename suffix in binary data let protoSuffix = ".proto".data(using: .utf8)! let protoStartMarker: UInt8 = 0x0A // Protobuf wire format tag // When we find ".proto", scan backwards for the start marker let markerIndex = findMarkerBackwards( in: data, to: suffixRange.lowerBound, marker: protoStartMarker ) // Read the filename length as a varint and verify var nameLength: UInt64 = 0 guard readVarint(&nameLength, from: data, offset: markerIndex + 1) else { continue }

After validating the descriptor start, we need to find where it ends. Descriptors terminate with a null tag (tag value of 0), so we read through the wire format until we hit that marker:

... continue reading