Reverse engineering iWork

The app I’m working on ingests a lot of files, and there’s no good solution for parsing .key , .numbers , or .pages files. Every existing approach requires you to first export your document to PDF (or some other format), then upload it for server-side processing. At that point, you’re either running it through a vision model or a PDF parser, both of which lose significant information or don’t work particularly well. This isn’t my first time solving distribution problems by going directly to the source. I previously ported Perl to WebAssembly so ExifTool could run client-side for metadata extraction, avoiding the need to upload files or have Perl installed. Same principle applies here: if you want high-quality extraction from iWork files without round-tripping through export formats or sending data to a server, you need to parse the native format. I am not held back by the conventional wisdom for the simple reason that I am completely unaware of it. So I decided to build a proper parser that keeps user files on their computer and produces the highest quality output possible. A Brief History of iWork In 2013, Apple switched the iWork document format from XML to a new binary format built on Google’s Protocol Buffers. The change affected Pages, Keynote, and Numbers, and coincided with iCloud support for iWork and the transition to 64-bit applications. Apple never publicly explained the decision, but the old XML format, which loaded entire documents and assets into memory at once, would have made it difficult to deliver a good experience on the early iPhone, iPad, and web. Finding the Descriptors Apple ships Pages, Keynote, and Numbers with their protobuf message descriptors preserved in the executables. These descriptors define the structure of every message type and can be recovered from the binaries. The recovery process works by scanning through the binary data looking for specific patterns. Protocol Buffer descriptors have a recognizable structure: they start with a length-delimited field (tag 0x0A ) followed by a varint length and then the filename, which always ends in .proto . Once we find a potential descriptor, we validate it by reading through the protobuf wire format: // Search for ".proto" filename suffix in binary data let protoSuffix = ".proto".data(using: .utf8)! let protoStartMarker: UInt8 = 0x0A // Protobuf wire format tag // When we find ".proto", scan backwards for the start marker let markerIndex = findMarkerBackwards( in: data, to: suffixRange.lowerBound, marker: protoStartMarker ) // Read the filename length as a varint and verify var nameLength: UInt64 = 0 guard readVarint(&nameLength, from: data, offset: markerIndex + 1) else { continue } After validating the descriptor start, we need to find where it ends. Descriptors terminate with a null tag (tag value of 0), so we read through the wire format until we hit that marker: let stream = ProtobufInputStream(data: potentialDescriptorData) let descriptorLength = stream.readUntilNullTag() The readUntilNullTag() method handles the various protobuf wire types; varints, fixed-width integers, and length-delimited data—advancing through the stream until it encounters the terminating null tag. Converting Descriptors to Source Swift’s protobuf library doesn’t include a built-in mechanism for converting descriptors back into human-readable .proto files. The C++ library has this feature, but we’re working with Swift here. Fortunately, SwiftProtobuf provides a visitor API that lets us traverse the descriptor structure and reconstruct the schema ourselves. The visitor pattern walks through each field in the descriptor: struct ProtoRenderer: SwiftProtobuf.Visitor { private var output: [String] = [] mutating func visitRepeatedMessageField( value: [M], fieldNumber: Int ) throws { switch fieldNumber { case 4: // Message definitions for msg in value { try renderMessage(msg) } case 5: // Enum definitions for enumMsg in value { try renderEnum(enumMsg) } case 6: // Service definitions for svc in value { try renderService(svc) } } } } When we visit string fields, we capture the syntax version and package name. When we visit repeated message fields with specific field numbers (4, 5, 6), we know we’re looking at message definitions, enum definitions, or service definitions. Each gets rendered with proper indentation and syntax: private mutating func renderMessage(_ desc: DescriptorProto) throws { emit("message \(desc.name) {") indent += 1 for field in desc.field { let label = fieldLabel(for: field.label) // optional, required, repeated let type = fieldType(for: field.type) // int32, string, etc. emit(”\(label)\(type) \(field.name) = \(field.number);”) } indent -= 1 emit("}") } Apple’s Object-Oriented Protobuf System Once we dump the schemas, we can see how Apple structured their system. In typical Apple fashion, they built an entire object-oriented layer on top of protobufs. The format includes inheritance hierarchies (messages have “supers”), reflection capabilities, and a type system that clearly took inspiration from the Objective-C runtime it was designed to interface with. Consider how a typical message is defined: message TSDDrawableArchive { optional .TSP.Reference super = 1; optional .TSD.FillArchive fill = 10; optional .TSD.StrokeArchive stroke = 11; // ... more fields } That super field isn’t standard protobuf; it’s Apple’s way of encoding inheritance. The TSP.Reference type points to another message, effectively creating a parent-child relationship between objects. This mirrors how Objective-C class hierarchies work, with each object maintaining a reference to its superclass. Mapping Types to Prototypes Having the schemas is only half the battle. Each message in an iWork Archive (IWA) includes a type identifier; an integer that tells the parser which protobuf message definition to use. We need to build a mapping from these type IDs to their corresponding Swift classes. Apple stores this mapping in a runtime registry called TSPRegistry . We can extract it using Frida: var TSPRegistry = ObjC.classes.TSPRegistry; var registry = TSPRegistry.sharedRegistry(); // The registry’s description includes the type-to-prototype mappings var description = registry.toString(); var lines = description.split(" "); for (var line of lines) { if (line.includes(" -> ")) { var [typeNum, className] = line.split(" -> "); results[parseInt(typeNum)] = className.trim(); } } This gives us a JSON mapping like: { "1": "TSPArchiveInfo", "2": "TSPDataInfo", "10000": "TSDDrawableArchive", "11000": "TSTTableDataList" } With this mapping in hand, we can generate Swift decoder functions that translate type IDs to the appropriate message classes. We also need to handle extensions—some messages use protobuf extensions for additional fields. By scanning the generated Swift code, we build a secondary map of which extensions apply to which messages: # Find MessageExtension declarations in Swift files extension_pattern = r’static\s+let\s+(\w+)\s+=\s+MessageExtension<[^,]+,\s*(\w+)>’ # Example match: extension TSA_ThemePresetsArchive { enum Extensions { static let `extension` = SwiftProtobuf.MessageExtension, TSS_ThemeArchive>( _protobuf_fieldNumber: 210, fieldName: "TSA.ThemePresetsArchive.extension" ) } } Then we generate decoders that automatically include the right extensions: func decodeKeynote(type: UInt32, data: Data) throws -> SwiftProtobuf.Message { switch type { case 10000: return try TSDDrawableArchive( serializedBytes: data, extensions: SimpleExtensionMap([ TSDArchive.Extensions.geometry, TSDArchive.Extensions.externalTextWrap ]) ) // ... more cases } } Parsing the IWA Format With the schemas extracted, the type mappings generated, and the decoders built, we have everything needed to parse iWork documents. Document Structure An iWork document comes in two physical formats: a directory bundle or a ZIP archive. Both contain the same logical structure: Document.pages/ ├── Index.zip # Contains all .iwa files ├── Metadata/ │ ├── Properties.plist │ ├── DocumentIdentifier │ └── BuildVersionHistory.plist ├── Data/ # Referenced media files │ ├── image-123.jpeg │ └── video-456.mp4 └── preview.jpg # Document thumbnail The Index.zip archive (or Index/ directory in bundle format) contains the actual document structure as a collection of IWork Archives. Each IWA holds a series of compressed protobuf messages that represent different parts of the document. Snappy Compression IWA use Snappy compression, however, Apple’s implementation diverges from the standard Snappy framing format. Traditional Snappy includes CRC32 checksums for data integrity; Apple’s version drops these entirely. Initially, I wrapped the C Snappy library, but this made Swift build times incredibly long. After examining the format more carefully, I realized Apple’s simplified approach meant I could write a pure Swift implementation: private static func decompressSnappyChunks(_ data: Data) throws -> Data { var result = Data() var offset = 0 while offset < data.count { // Apple’s custom header: single byte (always 0) + 3-byte length let headerType = data[offset] guard headerType == 0 else { throw IWorkError.invalidIWAHeader(expected: 0, found: headerType) } offset += 1 // Read 24-bit little-endian length let length = Int(data[offset]) | (Int(data[offset + 1]) << 8) | (Int(data[offset + 2]) << 16) offset += 3 let compressedChunk = data.subdata(in: offset..> 2) + 1 offset += 1 // Handle variable-length encoding for longer literals if len > 60 { let extraBytes = len - 59 len = readMultiByteLength(from: data, at: offset, bytes: extraBytes) offset += extraBytes } chunks.append(data.subdata(in: offset.. 0 ? CGFloat(imageGeometry.size.width / maskGeometry.size.width) : 1.0, height: maskGeometry.size.height > 0 ? CGFloat(imageGeometry.size.height / maskGeometry.size.height) : 1.0 ) Before properly implementing protobuf extensions, I ran into some interesting discoveries. Web videos (embedded YouTube/Vimeo content) appeared to be just images with hyperlinks; I could see the thumbnail (because it was the image) and the URL in its parent, but none of the video-specific attributes. Turns out these attributes lived in protobuf extensions that I wasn’t loading.Once I added extension support, suddenly all the remote video. Media One of my favorite quirks: audio files, video files, and 3D models all use the same TSD_MovieArchive message type. Apple distinguishes them through flags and extensions rather than separate message types: private func parseMediaType(from movie: TSD_MovieArchive) -> MediaType { if movie.audioOnly { return .audio } if let dataID = parseMediaDataID(from: movie), let fileInfo = resolveFile(from: metadata, dataID: dataID), let filename = fileInfo.0 { if filename.lowercased().hasSuffix(".gif") { return .gif } } return .video } private func is3DObject(from movie: TSD_MovieArchive) -> Bool { return movie.hasTSA_Object3DInfo_object3DInfo } Audio, video, and GIF files share most of their metadata structure: duration, volume, loop options, poster images. 3D models get additional properties through an extension: pose information (yaw, pitch, roll), animation flags, bounding rectangles, and traced paths for text wrapping. Video? TSD_MovieArchive. Music? TSD_MovieArchive. GIF?! TSD_MovieArchive. Photo Gallery? TSD_MovieArchive.3D Model? Believe it or not, TSD_MovieArchive Equations They’re stored in TSD_ImageArchive messages; technically they render as images in the document. Initially, I didn’t have protobuf extensions working, so I saw equation images, but no source data. I assumed Apple stored the LaTeX or MathML source in the PDF metadata stream, so I wrote a PDF parser: package func extractMetadataFromPDF(with id: UInt64, using data: Data) throws -> IWorkEquation { guard let dataProvider = CGDataProvider(data: data as CFData), let document = CGPDFDocument(dataProvider), let catalog = document.catalog else { throw IWorkError.equationReadFailed(id: id, reason: "Failed to create PDF document") } var metadataObjectRef: CGPDFObjectRef? guard CGPDFDictionaryGetObject(catalog, "Metadata", &metadataObjectRef) else { throw IWorkError.equationReadFailed(id: id, reason: "No metadata object found") } var metadataStream: CGPDFStreamRef? guard CGPDFObjectGetValue(metadataObjectRef!, .stream, &metadataStream) else { throw IWorkError.equationReadFailed(id: id, reason: "Failed to get metadata stream") } var format = CGPDFDataFormat.raw guard let rawData = CGPDFStreamCopyData(metadataStream!, &format) else { throw IWorkError.equationReadFailed(id: id, reason: "Failed to copy metadata stream") } let metadataData = Data(bytes: CFDataGetBytePtr(rawData), count: CFDataGetLength(rawData)) guard let xmlString = String(data: metadataData, encoding: .utf8) else { throw IWorkError.equationReadFailed(id: id, reason: "Failed to decode metadata") } return try parseEquationFromXML(with: id, from: xmlString) } package func parseEquationFromXML(with id: UInt64, from xmlString: String) throws -> IWorkEquation { let regex = Regex { "" } guard let match = xmlString.firstMatch(of: regex) else { throw IWorkError.equationReadFailed(id: id, reason: "No equation CDATA found") } let equationContent = String(match.1) if equationContent.contains("http://www.w3.org/1998/Math/MathML") { return .mathml(equationContent) } else { return .latex(equationContent) } } This worked because Apple does embed the source in PDF metadata. But once I got extensions working properly, I discovered the equation source is right there in the IWA file as an extension field. All that PDF parsing code reduced to: if let equation = image.equation { return .equation(equation) } Much better. Tables Tables are easily the most complex structures in the format. A TST_TableModelArchive contains the table’s structure and data, but the actual cell contents live in a compressed binary format within tile storage. Tables are divided into tiles (typically 256x256 cells) to avoid loading massive spreadsheets entirely into memory. Each tile contains a packed representation of its cells: for (tileIndex, tileInfo) in tableData.tiles.enumerated() { guard let tile: TST_Tile = document.dereference(tileInfo.tile) else { continue } for (rowIndex, rowInfo) in tile.rowInfos.enumerated() { let actualRowIndex = tileIndex * Int(tile.numrows) + rowIndex // Process cells in this row try await processCellRow( rowInfo: rowInfo, rowIndex: actualRowIndex, columnCount: Int(tableData.columnCount), stringMap: tableData.stringMap, richMap: tableData.richMap, coordinateSpace: coordinateSpace, tableModel: table ) } } Each cell’s data is packed into a binary structure starting with a 12-byte header: Offset 0: Version (1 byte) Offset 1: Cell type (1 byte) Offset 2-5: Reserved Offset 6-7: Extras (16-bit flags) Offset 8-11: Storage flags (32-bit bitfield) The storage flags indicate which optional fields follow. A cell might have: Decimal128 value (16 bytes) Double value (8 bytes) Timestamp in seconds (8 bytes) String ID (4 bytes, indexes into string table) Rich text ID (4 bytes, indexes into rich text table) Cell style ID (4 bytes) Text style ID (4 bytes) Formula ID (4 bytes) Various format IDs (4 bytes each) The parser reads these sequentially based on which flags are set: var dataOffset = offset + 12 if flags.contains(.hasDecimal128) { if dataOffset + 16 <= buffer.count { decimal128 = unpackDecimal128(from: buffer, offset: dataOffset) dataOffset += 16 } } if flags.contains(.hasDouble) { if dataOffset + 8 <= buffer.count { double = buffer.withUnsafeBytes { bytes in bytes.loadUnaligned(fromByteOffset: dataOffset, as: Double.self) } dataOffset += 8 } } // so on... Decimal128 values use a custom format with a bias constant for the exponent: private func unpackDecimal128(from buffer: Data, offset: Int) -> Double { let byte15 = UInt16(buffer[offset + 15]) let byte14 = UInt16(buffer[offset + 14]) let expBits = ((byte15 & 0x7F) << 7) | (byte14 >> 1) let exp = Int(expBits) - IWorkConstants.decimal128Bias // Bias = 0x1820 var mantissa: UInt64 = UInt64(byte14 & 1) for i in (0..<14).reversed() { mantissa = mantissa * 256 + UInt64(buffer[offset + i]) } let sign = (byte15 & 0x80) != 0 var value = Double(mantissa) * pow(10.0, Double(exp)) if sign { value = -value } return value } Cell borders require special handling. Rather than storing borders on individual cells (which would be wasteful), tables use stroke layers that define continuous border runs: private func parseBorderForSide( layers: [TSP_Reference], row: Int, column: Int, isHorizontal: Bool ) -> Border? { var bestBorder: (border: Border, order: Int)? for layerRef in layers { guard let strokeLayer: TST_StrokeLayerArchive = document.dereference(layerRef) else { continue } for strokeRun in strokeLayer.strokeRuns { let origin = Int(strokeRun.origin) let length = Int(strokeRun.length) let order = Int(strokeRun.order) let intersects: Bool if isHorizontal { intersects = (layerIndex == row) && (column >= origin) && (column < origin + length) } else { intersects = (layerIndex == column) && (row >= origin) && (row < origin + length) } guard intersects else { continue } let border = createBorderFromStroke(strokeRun) // Higher order wins when strokes overlap if bestBorder == nil || order > bestBorder!.order { bestBorder = (border, order) } } } return bestBorder?.border } This stroke layer system means a 1000x1000 table with full borders only needs to store 1000 horizontal stroke runs and 1000 vertical stroke runs, rather than 4 million border definitions. Currency cells get special treatment with their own cell type and format information: case CellType.currency.rawValue: if let value = decimal128 ?? double, let format = currencyFormat { return .currency(value, format: format, metadata: metadata) } return .empty The currency format includes the ISO 4217 code, decimal places (or 253 for automatic), whether to show the symbol, and whether to use accounting style (parentheses for negatives). Fortunately, the computed values from cells containing formulas are cached in the document, so we don’t need an engine or separate formula parser. Shapes and Vector Graphics Shapes use a path system that supports multiple geometry types. The PathSource enum represents all possible shape types: public enum PathSource { case point(PointPathSource) // Arrows, stars, plus signs case scalar(ScalarPathSource) // Rounded rectangles, polygons case bezier(BezierPath) // Standard vector paths case callout(CalloutPathSource) // Speech bubbles case connectionLine(ConnectionLinePathSource) // Lines between shapes case editableBezier(EditableBezierPathSource) // Paths with editable nodes } Each type stores its geometry differently. Point-based shapes define their form through a single control point: if pathSource.hasPointPathSource { let pointSource = pathSource.pointPathSource let type: PointPathSource.PointType switch pointSource.type { case .kTsdleftSingleArrow: type = .leftSingleArrow case .kTsdrightSingleArrow: type = .rightSingleArrow case .kTsddoubleArrow: type = .doubleArrow case .kTsdstar: type = .star case .kTsdplus: type = .plus } return .point( PointPathSource( type: type, point: PathPoint(x: Double(pointSource.point.x), y: Double(pointSource.point.y)), naturalSize: CGSize( width: CGFloat(pointSource.naturalSize.width), height: CGFloat(pointSource.naturalSize.height) ) )) } Bezier paths store sequences of path elements (moveTo, lineTo, curveTo, closeSubpath): let elements = bezierSource.path.elements.map { element -> PathElement in let type: PathElementType switch element.type { case .moveTo: type = .moveTo case .lineTo: type = .lineTo case .quadCurveTo: type = .quadCurveTo case .curveTo: type = .curveTo case .closeSubpath: type = .closeSubpath } let points = element.points.map { point in PathPoint(x: Double(point.x), y: Double(point.y)) } return PathElement(type: type, points: points) } Shapes can contain text; they have an optional ownedStorage field referencing a text storage. Text boxes are just shapes with specific styling. Charts Charts combine multiple systems: data grids, axes, series styling, and legends. The chart grid stores values in a row/column structure with a direction flag indicating whether series run by row or by column: let rows = grid.gridRow.map { gridRow in let values = gridRow.value.map { tschValue -> ChartGridValue in if tschValue.hasNumericValue { return .number(tschValue.numericValue) } if tschValue.hasDateValue { return .date(tschValue.dateValue) } if tschValue.hasDurationValue { return .duration(tschValue.durationValue) } return .empty } return ChartGridRow(values: values) } Each axis gets its own style and non-style archives. The style archive controls visual appearance (visibility, colors, line widths). The non-style archive controls data properties (minimum/maximum values, number formats, scale type): if let firstNonStyleRef = axisNonStyles.first, let axisNonStyle = document.dereference(firstNonStyleRef) as? TSCH_ChartAxisNonStyleArchive { let props = axisNonStyle.TSCH_Generated_ChartAxisNonStyleArchive_current if isValueAxis { if props.hasTschchartaxisdefaultusermin, props.tschchartaxisdefaultusermin.hasNumberArchive { minimumValue = props.tschchartaxisdefaultusermin.numberArchive } if props.hasTschchartaxisdefaultusermax, props.tschchartaxisdefaultusermax.hasNumberArchive { maximumValue = props.tschchartaxisdefaultusermax.numberArchive } } } Series get individual styling, allowing mixed chart types (a bar series and a line series on the same chart): for seriesIndex in 0.. [TSWP_ParagraphStyleArchive] { guard let style = style else { return [] } var chain: [TSWP_ParagraphStyleArchive] = [] var current: TSWP_ParagraphStyleArchive? = style while let s = current { chain.append(s) if s.super.hasParent, let parentRef = s.super.parent as TSP_Reference?, let parent = document.dereference(parentRef) as? TSWP_ParagraphStyleArchive { current = parent } else { current = nil } } return chain.reversed() // Root to leaf } Properties resolve by walking the chain from root to leaf, with later styles overriding earlier ones: for style in chain { let props = style.paraProperties if props.hasAlignment { alignment = StyleConverters.convertTextAlignment(props.alignment) } if props.hasLeftIndent { leftIndent = Double(props.leftIndent) } // ... more properties } Spatial Information and Reading Order Every positioned element includes complete spatial information: frame, rotation, z-index, and coordinate space. This matters because iWork documents use an infinite canvas model; elements can be anywhere, overlapping and rotated freely. When converting to linear formats like Markdown or analyzing document structure, we need to establish a natural reading order. For Keynote slides and Numbers sheets, we sort drawables by spatial position (top-to-bottom, then left-to-right): private func sortDrawablesByPosition(_ drawables: [TSP_Reference]) -> [TSP_Reference] { return drawables.sorted { refA, refB in guard let drawableA = document.dereference(refA), let drawableB = document.dereference(refB) else { return false } let frameA = parseFrameFromDrawable(drawableA) let frameB = parseFrameFromDrawable(drawableB) let centerYA = frameA.midY let centerYB = frameB.midY if centerYA != centerYB { return centerYA < centerYB } return frameA.midX < frameB.midX } } This produces text that reads naturally rather than jumping around the canvas randomly. For Pages documents, we separate inline content (flows with text) from floating content (positioned absolutely), processing them in separate passes. The Visitor Protocol Rather than exposing the raw protobuf structure, the parser uses a visitor pattern. Implementing the IWorkDocumentVisitor protocol gives you callbacks for each document element in traversal order: public protocol IWorkDocumentVisitor { func willVisitDocument(type: IWorkDocument.DocumentType, layout: DocumentLayout?, pageSettings: PageSettings?) async func willVisitParagraph(style: ParagraphStyle, spatialInfo: SpatialInfo?) async func visitInlineElement(_ element: InlineElement) async func didVisitParagraph() async func willVisitTable(name: String?, rowCount: UInt32, columnCount: UInt32, spatialInfo: SpatialInfo) async func visitTableCell(row: Int, column: Int, content: TableCellContent) async func didVisitTable() async func visitImage(info: ImageInfo, spatialInfo: SpatialInfo, ocrResult: OCRResult?, hyperlink: Hyperlink?) async // ... } All inline elements within a paragraph arrive through a single method in document order: func visitInlineElement(_ element: InlineElement) async { switch element { case .text(let text, let style, let hyperlink): // Process text run case .image(let info, let spatialInfo, let ocrResult, let hyperlink): // Process inline image case .footnoteMarker(let footnote): // Process footnote reference // ... } } This preserves the exact reading order: text before shape, shape content, text after shape. You don’t need to manually reconstruct the sequence from separate arrays and position calculations. The visitor receives fully resolved styles (inheritance chains already processed), decoded table cells (binary format already parsed), and proper coordinate transformations (mask transforms already calculated). The end results of which I’m very happy with. And just to further validate the spacial information, I wrote a naive PDF visitor to export an iWork document to PDF: Wrapping up All the code discussed here is available as a Swift package at github.com/6over3/WorkKit. The repository includes the protobuf schema extraction tools, type mapping generators, and the full parser implementation. Documentation covers the visitor protocol and common use cases. A few areas aren’t implemented yet: the legacy (A)XML format, formula evaluation, and some of the more obscure shape types. Pull requests welcome if you need these features. I believe this is the only real parser for iWork to-date.

Reverse engineering iWork

Share this article

Related Articles