A note on “we”: Throughout this series, “we” refers to maderix (human) and Claude Opus 4.6 (by Anthropic) working as a pair. The reverse engineering, benchmarking, and training code were developed collaboratively — human intuition driving the exploration, AI reasoning through the data and writing the analysis. We think this kind of human–AI collaboration is a new and natural way to do systems research: one partner as the architect with intuition, the other as the engineer writing the code and crafting experiments .
This whole thing started with a simple question: can you train a model on Apple’s Neural Engine?
Apple doesn’t want you to know the answer. They don’t publish the ANE’s ISA. They don’t document its internal architecture. They don’t even give you a way to program it directly — everything goes through CoreML, which adds layers of abstraction, optimization passes, and overhead that make it nearly impossible to understand what the hardware is actually doing.
So we reverse-engineered it.
Over several days, we mapped the entire software stack from CoreML down to the IOKit kernel driver, discovered how to compile and execute programs on the ANE without CoreML, cracked the binary format, measured the true peak performance (spoiler: Apple’s “38 TOPS” number is misleading), and ultimately got a neural network training on a chip designed exclusively for inference.
This is Part 1 of a three-part series. Here we cover the reverse engineering — how we peeled back the layers to understand what the M4 Neural Engine actually is and how to talk to it directly.
What is the Neural Engine?
The ANE is not a GPU. It’s not a CPU. It’s a graph execution engine — a fixed-function accelerator that takes a compiled neural network graph and executes the entire thing as one atomic operation. You don’t issue individual multiply-accumulate instructions. You submit a compiled program describing an entire computation graph, and the hardware executes it end-to-end.
Apple introduced the Neural Engine in the A11 (2017) as a 2-core design. Each generation has scaled it up:
The M4’s ANE (codename H16G) is what we’re working with. 16 cores, a queue depth of 127 evaluation requests, independent DVFS (dynamic voltage/frequency scaling), and hard power gating that drops it to exactly 0 milliwatts when idle.
... continue reading