Efficient Computer's Electron E1 CPU – 100x more efficient than Arm?

We've been building general-purpose CPUs wrong for decades, apparently. That's the bold claim from Efficient Computer, a new startup stepping into the embedded market with their Electron E1 chip. For too long, our processors have been stuck in a control flow model, constantly shuffling data back and forth between caches, memory, and compute units – a process I think we can all agree that burns significant energy at every step. Efficient’s goal is to approach the problem by static scheduling and control of the data flow - don’t buffer, but run. No caches, no out-of-order design, but it’s also not a VLIW or DSP design. It’s a general purpose processor.

The Electron E1 is designed as a ‘clean sheet processor’ boasting a custom Instruction Set Architecture and ‘smart’ compiler stack. Now, typically, when someone says they have a smart compiler, it’s a red flag - so many instances in this industry of smart auto-vectorization compilers being absolute 💩 for performance. But Efficient Computer claims their chip is built upon a spatial data flow architecture - not as just another AI accelerator in disguise, but a "very general purpose CPU" designed specifically for power-constrained systems like those found in the embedded market.

The numbers they're throwing around are, frankly, somewhat insane. Efficient Computer claims up to 100 times better energy efficiency than ARM's best embedded cores. My ears perked up when they told me they now have working silicon, soon to be available to developers. The fact that this is still a seed startup with its initial funding from Eclipse Ventures showcases it isn't just another academic exercise, even if a number of executives herald from an academic background.

Thanks for reading More Than Moore! This post is public so feel free to share it. Share

But the embedded market, let's be honest, isn't known for architectural breakthroughs. Most chips in this space are small, low-power microcontrollers built on designs that are decades old. This is partly due to the "tried and true" philosophy and the need for replaceable parts far into the future - being able to have the same chip in use in 20 years is often an absolute requirement. However, the increasing demand for new features, particularly with the rise of AI, and the ever-present power constraints, are pushing these old designs to their limits. We're seeing more features crammed into smaller devices like robotics and wearables, where power consumption is king. Developers want more performance and local AI inference, but power budgets just aren't changing.

My own Huawei Watch GT 2, for example, from 2019, can still last a week on its battery with the screen still on. In most of my posts, I'd say this could be improved with software optimization or shrinking silicon nodes, like a 3nm design. But Efficient Computer, with their E1 chip, is betting that's not enough. They believe the fundamental architecture is what's holding us back.

The Data Flow Paradigm: A Deeper Dive

CPUs, as they stand, spend a disproportionate amount of energy just moving data around – sometimes more than they spend computing with it. Traditional architectures focus on performance, power, or power per operation, often ignoring the data movement overhead. This is precisely the bottleneck in power-constrained embedded systems or systems that run on small custom batteries or cells.

When most people hear "low power chip" or "embedded CPU," they think an in-order ARM Cortex-M or slightly above with something out-of-order in the architecture, paired with enough on-chip memory or some DRAM off-chip. The model is simple: a small processor fetching, decoding, ordering, scheduling, executing, then retiring pipelined instructions step-by-step, moving data to and from memory as need.

Efficient’s architecture, called simply "Fabric," is based on a spatial data flow model. Instead of instructions flowing through a centralized pipeline, the E1 pins instructions to specific compute nodes called tiles and then lets the data flow between them. A node, such as a multiply, processes its operands when all the operand registers for that tile are filled. The result then travels to the next tile where it is needed. There's no program counter, no global scheduler. This native data-flow execution model supposedly cuts a huge amount of the energy overhead typical CPUs waste just moving data.

... continue reading