A blog series recounting our adventures in the quest to port the BEAM JIT to the ARM32-bit architecture. This work is made possible thanks to funding from the Erlang Ecosystem Foundation and the ongoing support of its Embedded Working Group. The Erlang ARM32 JIT is born! ​ This week we finally achieved our first milestone in developing the ARM32 JIT. We executed our first Erlang function through JITted ARM32 machine code! shell ~/arm32-jit$ qemu-arm -L /usr/arm-linux-gnueabihf ./otp/RELEASE/erts-15.0/bin/beam.smp -S 1:1 -SDcpu 1:1 -SDio 1 -JDdump true -JMsingle true -- -root /home/arm32-jit/otp/RELEASE -progname erl -home /home ~/arm32-jit$ echo $? 42 The BEAM successfully runs and terminates with error code 42! That 42 comes from an Erlang function, just-in-time compiled by our ARM32 JIT! Announcement is done! All code is available at https://github.com/stritzinger/otp/tree/arm32-jit Keep reading for a lot of interesting details! The first piece of Erlang code ​ erlang - module ( hello ). - export ([ start / 2 ]). start (_BootMod, _BootArgs) -> halt ( 42 , [{ flush , false }]). This is hello.erl that contains a start/2 function. The function head mimics the erl_init:start/2 function, which is the entry point of the first Erlang process. We replaced erl_init:start/2 with hello:start/2 in the erl_init.c module of the BEAM VM. This way, we forced the runtime to execute this Erlang function. hello:start/2 is very simple as it just calls the erlang:halt/2 . This function is a BIF (Built-in Function) that executes C code, part of the BEAM VM. This code executes an ordered shutdown of the BEAM and allows us to customize the error code, in this case: 42 . (Why {flush, false} ? At the time I am writing this, letting it be true causes a segmentation fault EHEH) Obviously, we need to compile this Erlang module, but I will also generate the BEAM assembly so we can have a look at what we will have to deal with. erlang { module , hello }. %% version = 0 { exports , [{ module_info , 0 },{ module_info , 1 },{ start , 2 }]}. { attributes , []}. { labels , 7 }. { function , start , 2 , 2 }. { label , 1 }. { line ,[{ location , "erts/preloaded/src/hello.erl" , 74 }]}. { func_info ,{ atom , hello },{ atom , start }, 2 }. { label , 2 }. { move ,{ literal ,[{ flush , false }]},{ x , 1 }}. { move ,{ integer , 42 },{ x , 0 }}. { line ,[{ location , "erts/preloaded/src/hello.erl" , 76 }]}. { call_ext_only , 2 ,{ extfunc , erlang , halt , 2 }}. { function , module_info , 0 , 4 }. { label , 3 }. { line ,[]}. { func_info ,{ atom , hello },{ atom , module_info }, 0 }. { label , 4 }. { move ,{ atom , hello },{ x , 0 }}. { call_ext_only , 1 ,{ extfunc , erlang , get_module_info , 1 }}. { function , module_info , 1 , 6 }. { label , 5 }. { line ,[]}. { func_info ,{ atom , hello },{ atom , module_info }, 1 }. { label , 6 }. { move ,{ x , 0 },{ x , 1 }}. { move ,{ atom , hello },{ x , 0 }}. { call_ext_only , 2 ,{ extfunc , erlang , get_module_info , 2 }}. You can spot the start function and the two standard module_info functions that all Erlang modules have. We do not care much about those right now as we discovered that they are not executed and are not required to work, for now. We can see that the core of the start function is just two move operations and one call_ext_only . But bear in mind that the BEAM loader will transmute these Generic BEAM Operations into Specific operations. More complexity will pop up! We are using qemu-arm to emulate Arm32 and we are directly using beam.smp to run the BEAM. shell ~/arm32-jit$ qemu-arm -L /usr/arm-linux-gnueabihf ./otp/RELEASE/erts-15.0/bin/beam.smp -S 1:1 -SDcpu 1:1 -SDio 1 -JDdump true -JMsingle true -- -root /home/vagrant/arm32-jit/otp/RELEASE -progname erl -home /home/vagrant JIT initialization ​ At boot, the BEAM initializes the JIT if enabled. The JIT leverages the AsmJit library to emit all machine code instructions. Emission of all global shared fragments ​ There are 90+ code snippets that are shared among all modules. The JIT loads them one single time and sets up jumps to them in every other module. It is like a global library for all modules. We skipped most of these because just the shared fragments involved in the hello:start/2 execution were needed. Emission of the erts_beamasm module ​ As part of the JIT initialization, erts_beamasm is emitted. This module is an internal hardcoded module that exists only when BEAM is using the JIT. It holds 7 fundamental instructions used to manage the Erlang process executions. run_process - The main process execution entry point normal_exit - Normal process termination continue_exit - Continue after exit handling exception_trace - Exception tracing functionality return_trace - Return value tracing return_to_trace - Return to tracing state call_trace_return - Call tracing return handling Preloaded modules ​ The hello.erl module has been compiled and put as first and single Erlang module in the list of preloaded modules. Preloaded modules are Erlang fundamental modules that are always loaded by the BEAM before the first Erlang process can start. They implement, in Erlang, the core features of the Erlang Runtime System (ERTS). The OTP build scripts group all ebin files into a single C header that is then linked into the executable. This makes the Erlang binaries available as a static C array in the BEAM source code. These are then loaded one by one after the BEAM VM is initialized. Cool, let's nuke all these modules and leave just our hello.erl . It does not need many BEAM instructions and we can easily verify that it executes. To do the substitution we just need to change this build variable in otp/erts/emulator/Makefile.in We are running BEAMASM with -JDdump true so asmjit will dump all ARM32 assembly for each module! This is incredibly useful if monitored while executing with a debugger, as we can see the assembler being printed line by line by our code. shell ~ /arm32-jit$ cat hello.asm L6: .byte 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00 # i_flush_stubs # func_line_I # aligned_label_Lt label_1: # i_func_info_IaaI # hello:start/2 blx L8 .byte 0x00, 0x00, 0x00, 0x00 .byte 0x0B, 0x4F, 0x00, 0x00, 0x0B, 0xA4, 0x00, 0x00, 0x02, 0x00, 0x00, 0x00 # aligned_label_Lt start/2: # i_breakpoint_trampoline str lr, [r7, -4 ] ! b L9 bl L11 L9: # i_test_yield adr r2, start/2 subs r9, r9, 1 b.le L13 # i_move_sd ldr r12, [L14] str r12, [r4, 68] # i_move_sd movw r12, 687 str r12, [r4, 64] # line_I # allocate_tt # call_light_bif_be L15: ldr r3, [L16] movw r1, 10188 movt r1, 16432 adr r2, L15 # BIF: erlang:halt/2 sub r12, r7, 4 cmp r10, r12 b.ls L17 udf 48879 L17: movw r12, 12424 add r12, r4, r12 ldr r12, [r12] cmp sp, r12 b.eq L18 udf 57005 L18: bl L20 # deallocate_t movw r0, 64676 movt r0, 16480 blx L22 # return movw r0, 61636 movt r0, 16480 blx L22 # i_flush_stubs # func_line_I # aligned_label_Lt label_3: # i_func_info_IaaI # hello:module_info/0 blx L8 .byte 0x00, 0x00, 0x00, 0x00 .byte 0x0B, 0x4F, 0x00, 0x00, 0x4B, 0x6B, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00 # aligned_label_Lt module_info/0: # i_breakpoint_trampoline str lr, [r7, -4 ] ! b L23 bl L11 L23: # i_test_yield adr r2, module_info/0 subs r9, r9, 1 b.le L13 # i_move_sd movw r12, 20235 str r12, [r4, 64] # allocate_tt # call_light_bif_be L24: ldr r3, [L25] movw r1, 4772 movt r1, 16425 adr r2, L24 # BIF: erlang:get_module_info/1 sub r12, r7, 4 cmp r10, r12 b.ls L26 udf 48879 L26: movw r12, 12424 add r12, r4, r12 ldr r12, [r12] cmp sp, r12 b.eq L27 udf 57005 L27: bl L20 # deallocate_t movw r0, 64676 movt r0, 16480 blx L22 # return movw r0, 61636 movt r0, 16480 blx L22 # i_flush_stubs # func_line_I # aligned_label_Lt label_5: # i_func_info_IaaI # hello:module_info/1 blx L8 .byte 0x00, 0x00, 0x00, 0x00 .byte 0x0B, 0x4F, 0x00, 0x00, 0x4B, 0x6B, 0x00, 0x00, 0x01, 0x00, 0x00, 0x00 # aligned_label_Lt module_info/1: # i_breakpoint_trampoline str lr, [r7, -4 ] ! b L28 bl L11 L28: # i_test_yield adr r2, module_info/1 subs r9, r9, 1 b.le L13 # i_move_sd ldr r12, [r4, 64] str r12, [r4, 68] # i_move_sd movw r12, 20235 str r12, [r4, 64] # allocate_tt # call_light_bif_be L29: ldr r3, [L30] movw r1, 4868 movt r1, 16425 adr r2, L29 # BIF: erlang:get_module_info/2 sub r12, r7, 4 cmp r10, r12 b.ls L31 udf 48879 L31: movw r12, 12424 add r12, r4, r12 ldr r12, [r12] cmp sp, r12 b.eq L32 udf 57005 L32: bl L20 # deallocate_t movw r0, 64676 movt r0, 16480 blx L22 # return movw r0, 61636 movt r0, 16480 blx L22 # int_code_end L33: movw r0, 18576 movt r0, 16480 blx L22 L13: L12: movw r12, 1968 movt r12, 14656 blx r12 L22: L21: movw r12, 29192 movt r12, 16399 blx r12 L11: L10: movw r12, 1752 movt r12, 14656 blx r12 L20: L19: movw r12, 680 movt r12, 14656 blx r12 L8: L7: movw r12, 1824 movt r12, 14656 blx r12 # Begin stub section L14: .xword 0x000000007FFFFFFF L16: .xword 0x000000007FFFFFFF L25: .xword 0x000000007FFFFFFF L30: .xword 0x000000007FFFFFFF # End stub section L34: .section .rodata {#1} md5: .byte 0x6D, 0xC4, 0x1E, 0xF1, 0x13, 0x1E, 0xBF, 0xF2, 0x4B, 0xF5, 0xC0, 0x41, 0x57, 0x86, 0xDF, 0xD5 .section .text {#0} ; CODE_SIZE: 632 Bear in mind, this assembler is not what hello should look like. We are missing a lot of things. You can spot many sequences like: asm movw r0, 64676 movt r0, 16480 blx L22 # <---- branch to NYI This is a call to nyi (Not Yet Implemented) function and the argument loaded to R0 is the pointer to a string that contains the name of the BEAM instruction that should have been emitted instead. You can spot many of these since we are only emitting the code to reach halt. Everything after that is not important now as halt will never return! There are many more comments we could make around all the details in this assembler dump, but let's move on. Jumping into Jitted code! ​ Later in the BEAM initialization the first Erlang process will be allocated and started. We swap the module and function with hello in erts/emulator/beam/erl_init.c cpp erl_spawn_system_process ( & parent, am_hello, am_start, args, & so); One BEAM scheduler thread will jump to the process_main function. You can find it here in the source code. This is emitted by our JIT and is the first emitted code that will run. Here we need to handle the Erlang processes scheduling by calling BEAM routines that implement the algorithms of Erlang concurrency, like erts_schedule . erts_schedule will return the pointer to the Process C structure that holds all information about the process that is going to execute. We then load all necessary data inside registers and then we branch to the exact point where the program execution stopped. The first Erlang function call ​ In this case we are calling hello:start/2 so the first instruction to execute is apply_only that does a few things but ends up calling the C apply routine. The routine processes the Module-Function-Arity information to get the address where the function code resides in memory. What follows is the Erlang function prologue. You can see it in the assembler code section above. For example, all functions have these instructions in their prologue: i_breakpoint_trampoline: handle breakpoints for the debugger app app i_test_yield: checks if the function should yield and go back to the scheduler We have minimal or partial implementations of these since we do not really need them. We have to emit them though, as the C++ generated loader functions from the BEAM are expanding the Erlang function call Operation into a more specific and complex function prologue sequence. After that, we added support for the call_light_bif operation that precedes the call to the halt_2 BIF routine. This implementation is also minimal. Question for later: did you notice that we put a 42 as a number in the code? Numeric constants are printed as decimals in the dump, but we cannot spot any 42!? After the call, we see two other operations: dealloc return These are just calls to NYI as we will never reach this code! So for now, we can skip them... Let's roll the JIT! ​ shell ~/arm32-jit$ qemu-arm -L /usr/arm-linux-gnueabihf ./otp/RELEASE/erts-15.0/bin/beam.smp -S 1:1 -SDcpu 1:1 -SDio 1 -JDdump true -JMsingle true -- -root /home/arm32-jit/otp/RELEASE -progname erl -home /home ~/arm32-jit$ Impressive, the program returns immediately without even saying "Hi" ... and without Segmentation Fault!! But let's check the program return code! ~/arm32-jit$ echo $? 42 We can safely say that number is not there by accident! This is a great achievement as from now on we will be able to incrementally add Erlang instructions. Every Erlang line we add will trigger new Opcodes. By emitting them and running the code we will have immediate feedback on everything. The next goal now is to complete the hello module to host all possible beam instructions! Hey where is 42??? ​ One interesting thing I spotted looking at the assembly: You cannot find the number 42 in there. Or actually, you can, it is just hidden in plain sight. To understand you need to know how we are using ARM32 registers. In particular the register r4 , a callee-saved register. We are using it to store the pointer to the ErtsSchedulerRegisters struct. The ErtsSchedulerRegisters contains the X register array. When a function is called, X registers are used to store the arguments of the call. This becomes more obvious if we compare the Erlang assembly to the Arm32 assembly. asm # i_move_sd <---- {move,{literal,[{flush,false}]},{x,1}}. % List at X[1] ldr r12 , [L14] str r12 , [r4, 68 ] # i_move_sd <---- {move,{integer,42},{x,0}}. % 42 at X[0] movw r12 , 687 str r12 , [r4, 64 ] # line_I # allocate_tt # call_light_bif_be L15: ldr r3, [L16] movw r1, 10188 movt r1, 16432 adr r2, L15 # BIF: erlang:halt/2 # ... 42 is stored at r4 +64. r4: pointer to the ErtsSchedulerRegisters struct struct 64: base offset from the beginning of the struct to the beginning of the x_reg_array The list is stored at r4 +68. 68: is the base offset + the size of one Eterm (4 bytes on ARM32) But why in assembly do we see 687 and not 42? Converting both numbers to hex we get: 42 -> 2A 687 -> 2AF !! Yep, this is an example of a Tagged Value. If we consult the BEAM book we can learn about the Tagging Scheme: 00 11 Pid 01 11 Port 10 11 Immediate 2 11 11 Small integer