Nvidia DGX Spark and Apple Mac Studio = 4x Faster LLM Inference with EXO 1.0

NVIDIA DGX Spark™ + Apple Mac Studio = 4x Faster LLM Inference with EXO 1.0

We recently received early access to 2 NVIDIA DGX Spark™ units. NVIDIA calls it the world's smallest AI supercomputer. It has ~100 TFLOPs of FP16 performance with 128GB of CPU-GPU coherent memory at 273 GB/s.

With EXO, we've been running LLMs on clusters of Apple Mac Studios with M3 Ultra chips. The Mac Studio has 512GB of unified memory at 819 GB/s, but the GPU only has ~26 TFLOPs of FP16 performance.

The DGX Spark has 4x the compute, the Mac Studio has 3x the memory bandwidth.

What if we combined them? What if we used DGX Spark for what it does best and Mac Studio for what it does best, in the same inference request?

NVIDIA DGX Spark™ early access units (with quality control supervisor) Mac Studio M3 Ultra stack used for LLM inference with EXO

What Determines LLM Inference Performance?

What you see as a user boils down to two numbers:

TTFT (time‑to‑first‑token): delay from sending a prompt to seeing the first token.

(time‑to‑first‑token): delay from sending a prompt to seeing the first token. TPS (tokens per second): cadence of tokens after the first one appears.

... continue reading