How AI on Microcontrollers Actually Works: Operators and Kernels

The buzz around “edge AI”, which means something slightly different to almost everyone you talk to, is well past reaching a fever pitch. Regardless of what edge AI means to you, the one commonality is typically that the hardware on which inference is being performed is constrained in one or more dimensions, whether it be compute, memory, or network bandwidth. Perhaps the most constrained of these platforms are microcontrollers.

I have found that, while there is much discourse around “running AI” (i.e. performing inference) on microcontrollers, there is a general lack of information about what these systems are actually capable of, and how new hardware advancements impact that equation. It is my hope with this series to peel back some of the layers of terminology and explore what actually happens between supplying inputs to a model and receiving outputs. Along the way, we’ll ground our exploration in performing inference with real models on real constrained hardware.

While “weights” get the majority of the attention with AI models, they alone are not sufficient for performing inference. Depending on how a model is distributed and what runtime is used, additional data or metadata may be supplied alongside the model, or may be defined explicitly in software that interacts with the weights. The most popular runtime for microcontrollers is Tensorflow Lite for Microcontrollers ( tflite-micro ), which is an optimized version of Tensorflow Lite.

Note: Google recently rebranded Tensorflow Lite to LiteRT, and tflite-micro to LiteRT for Microcontrollers.

tflite-micro uses the .tflite file format, which encodes data using FlatBuffers. Unlike some other model file formats, .tflite files include not only the tensors that encapuslate model weights, but also the computation graph, which informs the runtime of what operations to use when performing inference. In order to do so, there needs to be a defined set of operators. This is somewhat analagous to instructions defined in an instruction set architecture (ISA) for a processor. With an ISA, a compiler will take a higher level programming language and map the behavior onto instructions available in the ISA. Tensorflow supports an extensive set of built-in operators, while Tensorflow Lite, and thus tflite-micro , supports only a subset.

Continuing the analaogy, many processors implement specific versions of the ARM architecture, but that doesn’t mean that processors implementing the same ISA are equivalent. Every instruction that is supported has to be implemented in hardware, and decisions about how the processor is designed can impact performance on multiple dimensions. Similarly, while Tensorflow Lite defines a set of operators, the implementation of those operators, which are referred to as kernels, may vary. Kernels are implemented in software, but depending on the underlying hardware, a kernel might take many instructions to execute, or may be able to optimized to leverage dedicated hardware support.

A simple example is the addition operator ( TFL::AddOp ). We’ll cover how operators and kernels are registered and invoked in a future post, but let’s start by taking a look at the default tflite-micro addition operator logic.

tensorflow/lite/micro/kernels/add.cc

TfLiteStatus AddEval(TfLiteContext* context, TfLiteNode* node) { auto * params = reinterpret_cast (node->builtin_data); TFLITE_DCHECK(node->user_data != nullptr ); const OpDataAdd* data = static_cast < const OpDataAdd*>(node->user_data); const TfLiteEvalTensor* input1 = tflite::micro::GetEvalInput(context, node, kAddInputTensor1); const TfLiteEvalTensor* input2 = tflite::micro::GetEvalInput(context, node, kAddInputTensor2); TfLiteEvalTensor* output = tflite::micro::GetEvalOutput(context, node, kAddOutputTensor); if (output->type == kTfLiteFloat32 || output->type == kTfLiteInt32) { TF_LITE_ENSURE_OK( context, EvalAdd(context, node, params, data, input1, input2, output)); } else if (output->type == kTfLiteInt8 || output->type == kTfLiteInt16) { TF_LITE_ENSURE_OK(context, EvalAddQuantized(context, node, params, data, input1, input2, output)); } else { MicroPrintf( "Type %s (%d) not supported." , TfLiteTypeGetName(output->type), output->type); return kTfLiteError; } return kTfLiteOk; } TFLMRegistration Register_ADD() { return tflite::micro::RegisterOp(AddInit, AddPrepare, AddEval); }

As can be observed in AddEval() , the type of output we are expecting influences the implementation of the operator. To illustrate how the underlying hardware impacts performance, let’s focus on the case in which we expect kTfLiteInt8 (signed 8-bit integer) or kTfLiteInt16 (signed 16-bit integer) output, meaning that we’ll call EvalAddQuantized() .

... continue reading