Apple shows how much faster the M5 runs local LLMs compared to the M4

A new post on Apple’s Machine Learning Research blog shows how much the M5 Apple silicon improved over the M4 when it comes to running a local LLM. Here are the details.

A bit of context

A couple of years ago, Apple released MLX, which the company describes as “an array framework for efficient and flexible machine learning on Apple silicon”.

In practice, MLX is an open-source framework that helps developers build and run machine learning models natively on their Apple silicon Macs, supported by APIs and interfaces that are familiar to the AI world.

Here’s Apple again on MLX:

MLX is an open source array framework that is efficient, flexible, and highly tuned for Apple silicon. You can use MLX for a wide variety of applications ranging from numerical simulations and scientific computing to machine learning. MLX comes with built in support for neural network training and inference, including text and image generation. MLX makes it easy to generate text with or fine tune of large language models on Apple silicon devices. MLX takes advantage of Apple silicon’s unified memory architecture. Operations in MLX can run on either the CPU or the GPU without needing to move memory around. The API closely follows NumPy and is both familiar and flexible. MLX also has higher level neural net and optimizer packages along with function transformations for automatic differentiation and graph optimization.

One of the MLX packages available today is MLX LM, which is meant for generating text and for fine-tuning language models on Apple silicon Macs.

With MLX LM, developers and users can download most models available on Hugging Face, and run them locally.

This framework even supports quantization, which is a compression method that enables large models to run while using less memory. This leads to faster inference, which is basically the step during which the model produces an answer to an input or a prompt.

M5 vs. M4

... continue reading