ONNX Runtime and CoreML May Silently Convert Your Model to FP16

ONNX Runtime & CoreML May Silently Convert Your Model to FP16 (And How to Stop It)

TLDR Running an ONNX model in ONNX RunTime (ORT) with the CoreMLExecutionProvider may change the predictions your model makes implicitly and you may observe differences when running with PyTorch on MPS or ONNX on CPU. This is because the default arguments ORT uses when converting your model to CoreML will cast the model to FP16. The fix is to use the following setup when creating the InferenceSession: = ort.InferenceSession(onnx_model_path, providers = [( "CoreMLExecutionProvider" , { "ModelFormat" : "MLProgram" })]) ort_sessionort.InferenceSession(onnx_model_path, providers[(, {})]) This ensures your model remains in FP32 when running on a Mac GPU.

Uncovering an Issue in ONNX Runtime - Benchmarking the EyesOff Model Having trained the EyesOff model, I began evaluating the model and its run time. I was looking into the ONNX format and using it to run the model efficiently. I setup a little test bench in which I ran the model using PyTorch and ONNX with ONNX Runtime (ORT), both using MPS and CPU. While checking the outputs, I noticed that the metrics from the model ran on ONNX on MPS had a different output to those on ONNX CPU and PyTorch CPU and MPS. Note, the metrics from PyTorch on CPU and MPS were the same. When I say ORT and MPS, this is achieved through ORT’s execution providers. To run an ONNX model on the Mac GPU you have to use the CoreMLExecutionProvider (more on this to come). Now in Figure 1 and 2, observe the metric values - the PyTorch ones (Figure 1) are the same across CPU and MPS, this isn’t the same story for ONNX (Figure 2): Figure 1 - PyTorch CPU & MPS metrics output Figure 2 - ORT CPU & MPS metrics output Wow, look at the diff in Figure 2! When I saw this it was quite concerning, floating point math can lead to differences in the calculations carried out across the GPU and CPU but the values here don’t appear to be a result of floating point math, the values are too large. Given the difference in metrics, I was worried that running the model with ORT was changing the output of the model and hence the behaviour. The reason the metrics change is because some of the model predictions around the threshold flipped to the opposite side of the threshold (which is 0.5), this can be seen in the confusion matrices for the ONNX CPU run and MPS run: FP32 Confusion Matrix Predicted Negative Predicted Positive Actual Negative 207 (TN) 24 (FP) Actual Positive 69 (FN) 164 (TP) FP16 Confusion Matrix Predicted Negative Predicted Positive Actual Negative 206 (TN) 25 (FP) Actual Positive 68 (FN) 165 (TP) So two predictions flipped from negative to positive. Having said that, the first thing I did was to make my life easier, by simplifying the scenario from the large EyesOff model to a simple one layer MLP and using that to run the experiments.

Why am I Using ONNX and ONNX RunTime? Before going on it’s worth discussing what ONNX and ORT are, and why I’m even using them in the first place. ONNX1 ONNX stands for Open Neural Network Exchange. It can be thought of as a common programming language in which to describe ML models. Under the hood ONNX models are represented as graphs, these graphs outline the computation path of a model and it shows the operators and transformations required to get from input to prediction. These graphs are called ONNX graphs. The use of a common language to describe models makes deployment easier and in some cases can add efficiency in terms of resource usage + or inference speed. Firstly, the ONNX graph itself can be optimised. Take PyTorch for example, you train the model in it and sure PyTorch is very mature and extremely optimised but it’s such a large package some things can be overlooked or difficult to change. By converting the model to ONNX, you take advantage of the fact that ONNX was built specifically with inference in mind and with that comes optimisations which the PyTorch team could implement but have not yet. Furthermore, ONNX models can be ran cross platform in specialised runtimes. These runtimes are specially optimised for different architectures and add another layer of efficiency gains. ONNX RunTime (ORT)2 ORT is one of these runtimes. ORT actually runs the model, it can be thought of as an interpreter, it takes the ONNX graph and actually implements the operators and runs them on the specified hardware. ORT has a lot of magic built into it, the operators are extremely optimised and through the use of execution providers they target a wide range of hardware. Each execution provider is optimised for the specific hardware it refers to, this enables the ORT team to implement extremely efficient operators giving us another efficiency gain. CoreML3 As mentioned before, I used the CoreMLExecutionProvider to run the model on a Mac GPU. This execution provider informs ORT to make use of CoreML. CoreML is an apple developed framework which lets models (neural networks and classical ML models) run on Apple hardware, CPU, GPU and ANE. ORT’s purpose in this phase is to take the ONNX graph and convert it to a CoreML model. CoreML is Apple’s answer to running efficient on device models on Apple hardware. Note, that all of this doesn’t always mean the model will run faster. Some models may run faster in PyTorch, TensorRT or any other framework. This is why it is important to benchmark and test as many approaches as makes sense.

Finding the Source of the CPU vs MPS Difference - With an MLP The MLP used is very simple it has a single layer, with 4 inputs, 3 outputs and the bias turned off. So, I pretty much created a fancy matrix multiplication. To understand where the issue was coming from I ran this MLP through some different setups: - PyTorch CPU - PyTorch MPS - ORT CPU - ORT MPS - CoreML FP32 - CoreML FP16 The goal of this exercise is to find out if 1 - the difference in outputs is seen in a simple model and 2 - to figure out where exactly the issue arises. Before showing the full results, I want to explain why I included the CoreML FP16 and FP32 runs - specifically why the FP16 one. When I initially ran the MLP experiment I only ran PyTorch, ORT and CoreML FP32 but the output numbers of ORT MPS looked like FP16 numbers. So, I tested if they were and also if the outputs from the other runs were true FP32 numbers. You can do this with a “round trip” test, by converting a number to FP16 and back to FP32. If after this process the number is unchanged then it is an FP16 number but if it changes then it was a true FP32. The number changes as FP16 can represent fewer floating point numbers than FP32. It’s a very simple check to carry out: import numpy as np numpynp = np.array([ 0.6480752 , - 0.34015813 , 1.4329923 ], dtype = np.float32) onnx_cpunp.array([], dtypenp.float32) = np.array([ 0.6484375 , - 0.34033203 , 1.4326172 ], dtype = np.float32) # We cast the ort MPS numbers up to FP32, if they were FP16 this has no effect onnx_coremlnp.array([], dtypenp.float32) = onnx_cpu.astype(np.float16).astype(np.float32) cpu_roundtriponnx_cpu.astype(np.float16).astype(np.float32) = onnx_coreml.astype(np.float16).astype(np.float32) coreml_roundtriponnx_coreml.astype(np.float16).astype(np.float32) print ( "ORT CPU values:" ) print ( " Original:" , onnx_cpu) , onnx_cpu) print ( " fp16 roundtrip:" , cpu_roundtrip) , cpu_roundtrip) print ( " Changed?" , not np.allclose(onnx_cpu, cpu_roundtrip, atol = 0 )) np.allclose(onnx_cpu, cpu_roundtrip, atol)) print ( "

ORT CoreML values:" ) print ( " Original:" , onnx_coreml) , onnx_coreml) print ( " fp16 roundtrip:" , coreml_roundtrip) , coreml_roundtrip) print ( " Changed?" , not np.allclose(onnx_coreml, coreml_roundtrip, atol = 0 )) np.allclose(onnx_coreml, coreml_roundtrip, atol)) The output of this is: Figure 3 - Roundtrip FP16 Test The CPU values change and the MPS values don’t! Now it’s getting interesting - perhaps when using the CoreML execution provider the output is FP16? This prompted adding the CoreML direct run in FP16 precision. I tested this theory with an experiment. Originally, when benchmarking it was all about inference speed, now it’s about floating point precision and figuring out where the diffs come from. Running on PyTorch CPU and MPS gives a strong baseline, PyTorch is a very mature ecosystem and I used the results from that as my ground truth. It being so close together is what drove me to understand what caused ORT runs on different hardware to have a difference. Then using CoreML FP32 and FP16 aimed to show if the issue was an ONNX one or a CoreML one. Check Figure 4 for the outputs and Figure 5 for differences in the outputs here: Figure 4 - MLP Output in Different Scenarios Figure 5 - MLP Output Diffs Wow, would you look at that - once again PyTorch + ORT CPU match and so does PyTorch CPU + CoreML FP32. Also note that CoreML FP16 and ORT MPS match! This is a big insight into what is happening and why the metrics output differed before. Along with the round trip experiment this proves our model is being ran in FP16 when using the CoreML execution provider in ORT! A Refresher on Floating Points (FP) Floating points numbers are defined by three values: Sign : 1 bit to define if the number is positive or negative

: 1 bit to define if the number is positive or negative Significand : Contains the numbers digits

: Contains the numbers digits Exponent: This says where the decimal place should be placed relative to the beginning of the significand Floating point numbers are often expressed in scientific notation, e.g: Figure 8 - Table showing Significand, Exponent and scientific representation6 FP16 and FP32 specifically, have the following specification: Format Total bits Significand bits Exponent bits Smallest number Largest number Single precision 32 23 + 1 sign 8 \(1.2 * 10^{-38}\) \(3.4 * 10^{38}\) Half precision 16 10 + 1 sign 5 \(5.96 * 10^{-8}\) \(6.55 * 10^{4}\) So as FP16 is half the size it affords a couple benefits, firstly it requires half the memory to store and secondly it can be quicker to do computations with too. However, this comes at a cost of precision, FP16 cannot represent very small numbers and the distances between small numbers as accurately as FP32. An example of FP16 vs FP32 - The Largest Number Below 1 FP32 - 0.999999940395355225 4

FP16 - 0.999511725 As you see FP32 can represent a value much closer to 1. The Link to the ONNX Issue Having said all that, going back to the issue at hand we observe a ~\(1.17*e^{-7}\) error between PyTorch and CoreML FP32 which is typical of FP32. But, then ORT and CoreML when ran on MPS have a difference of ~\(3.7*e{-4}\) which is much more representative of FP16, this is what prompted the round trip experiment. So Why Do the Predictions Flip - A Slightly Deeper Look Into FP16 Values If you need a quick refresher on FP values, please expand the box above. If you already read that or you know enough about FP already let’s look at why some predictions flip. In my model the base threshold for a 0 or 1 class is 0.5. Both FP16 and FP32 can represent 0.5 exactly: = np.array([ 0.5 ], dtype = np.float32) fp_32_05np.array([], dtypenp.float32) = np.array([ 0.5 ], dtype = np.float16) fp_16_05np.array([], dtypenp.float16) fp_32_05.item(), fp_16_05.item() ( 0.5 , 0.5 ) But we know that FP representations cannot represent every single number, so there will be some values around 0.5 which cannot be represented and hence will get rounded either up or down. Let’s look into that and find the threshold, this will show why some predictions of the EyesOff model were flipped when changing the model to run in FP16. Also, note by flipped we mean they go from a negative (0) prediction to a positive (1) class prediction, the rounding means it’d have to be below 0.5 and then be rounded up to cross the threshold boundary. Any other scenario would keep the label the same, i.e if it’s above 0.5 and gets rounded to 0.5 that’s fine as the predicted class is still the same. The first step is to find the next representable number below 0.5: # Show the representable values just below 0.5 = np.nextafter(np.float32( 0.5 ), np.float32( 0.0 )) fp32_belownp.nextafter(np.float32(), np.float32()) = np.nextafter(np.float16( 0.5 ), np.float16( 0.0 )) fp16_belownp.nextafter(np.float16(), np.float16()) = 0.5 - fp32_below fp32_gapfp32_below = 0.5 - fp16_below fp16_gapfp16_below print ( f"

Closest value BELOW 0.5:" ) print ( f"FP32: { fp32_below :.20f} " ) fp32_below print ( f"FP16: { fp16_below :.20f} " ) fp16_below print ( f"

... continue reading