Emulating aarch64 in software using JIT compilation and Rust

Emulating aarch64 in software using JIT compilation and Rust by Manos Pitsidianakis on 2025-08-25 I was able to write a simple just-in-time compiled emulator for the aarch64 ISA (Arm A-profile A64 Instruction Set Architecture). The Armv8-A/Armv9-A specs are massive in size, so the initial scope is for basic functionality and almost no optional architectural features such as SIMD. I wrote the emulator as an exercise in understanding how QEMU’s TCG (Tiny Code Generator) software emulation works in principle. I did not follow the C code implementation, but rather implemented the same concepts from scratch in Rust, leveraging other libraries for the heavy lifting (disassembly and JIT compilation). In this article we’ll go through what is needed to go from a virtual machine’s instructions to native code execution. Repository: https://github.com/epilys/simulans $ cargo run -- release -- \ cargo runrelease -- memory 4GiB \ memory 4GiB \ -- generate - fdt \ generatefdt \ -- entry - point - address 0x40080000 \ entrypointaddress . bin test_kernelbin in 0 . 06s Finished `release` profile [optimized] target(s)06s / release / simulans -- memory 4GiB -- generate - fdt -- entry - point - address 0x40080000 test_kernel . bin` Running `targetreleasesimulansmemory 4GiBgeneratefdtentrypointaddresstest_kernelbin` world! Hello 6 devicetree nodes! Parseddevicetree /: Some ( Some ( "linux,dummy-virt" )) )) : None chosen @ 0 : None memory 0x0000000000000000 , length Some ( 4294967296 ) length : None cpus @ 0 : Some ( Some ( "arm,arm-v8" )) cpu)) 0x0000000000000000 , length None length : Some ( Some ( "arm,psci-0.2" )) psci)) . Halting the machine $ Translating an ISA to native code The emulation is performed in these steps: Disassembling aarch64 binary code using binja Translate each instruction with Cranelift’s JIT backend Note: QEMU TCG uses its own JIT implementation, as well as decoding instructions (see decodetree documentation). The translation logic performs a big match on the instruction operation and emits (hopefully!) equivalent JIT operations that cranelift then compiles to native code. It must also appropriately update machine state such as condition flags. Example translation of the bitwise OR instruction: fn translate_instruction( translate_instruction( & mut self , : & bad64:: Instruction , instructionInstruction ) -> ControlFlow < Option < Value >> { ControlFlowValue let op = instruction . op() ; opinstructionop() macro_rules! unexpected_operand { unexpected_operand ( $ other : expr) => {{ otherexpr) let other = $ other ; otherother panic! ( "unexpected lhs in {op:?}: {other:?}. Instruction: {instruction:?}" ) }}; } match op { op Op:: ORR => { ORR // Bitwise OR // This instruction performs a bitwise (inclusive) OR of a register value and an // immediate value, and writes the result to the destination register. let target = match instruction . operands()[ 0 ] { targetinstructionoperands()[ bad64::Operand:: Reg { Reg ref reg , reg : None , arrspec } => * self . reg_to_var(reg , true ) , reg_to_var(reg => unexpected_operand! (other) , other(other) }; let a = self . translate_operand( & instruction . operands()[ 1 ]) ; translate_operand(instructionoperands()[]) let b = self . translate_operand( & instruction . operands()[ 2 ]) ; translate_operand(instructionoperands()[]) let value = self . builder . ins() . bor(a , b) ; valuebuilderins()bor(ab) self . builder . def_var(target , value) ; builderdef_var(targetvalue) } ... } ControlFlow:: Continue(()) Continue(()) } The Arm A-profile A64 Instruction Set Architecture specification describes the exact operation of each instruction in detail. The instructions are organised in translation blocks: at the entry of a block, all aarch64 architectural registers are loaded in JIT variables (the prologue) and when execution does not continue to the next instruction, disassembly must stop, and all registers are updated with their final values (the epilogue). That means that a single translation block can emulate more than one instruction at a time. This is what makes JIT emulation faster than interpreted emulation. Note: QEMU performs an additional optimization at this point; if the next instruction is already translated, it goes directly to the next block and skips the epilogue. Otherwise, when the next instruction’s block is translated, it patches the previous block to skip the epilogue. This is called direct block chaining and makes things faster, since register state save/load is expensive. A translation block can call Rust helpers to access memory (including MMIO – device access). These helpers are declared extern "C" to be able to call them from JIT’ed code. Exception handling (WIP) and MMU page table walk would also use Rust helpers. Devices An emulator is much more than ISA translation. Interrupt controllers, block devices, flash memory (for firmware), timers, are all complex on their own. For early stages development we only need a way to print stuff out from the VM. Fortunately, I had written a PL011 (Arm UART peripheral) implementation in Rust for QEMU last year. I copy pasted its code in my emulator, changed the output to be printed to stdout instead of QEMU’s character backends, and it just worked right away – the perks of writing Rust. struct PL011MemoryOps { PL011MemoryOps : u64 , device_id : Stdout , char_backendStdout : Arc < Mutex < PL011Registers >>, regsArcMutexPL011Registers } impl crate ::memory:: DeviceMemoryOps for PL011MemoryOps { DeviceMemoryOpsPL011MemoryOps fn id( & self ) -> u64 { id( self . device_id device_id } fn read( & self , offset : u64 , width : Width) -> u64 { read(offsetwidthWidth) match RegisterOffset:: try_from(offset) { try_from(offset) Err (v) if ( 0x3f8 .. 0x400 ) . contains( & (v >> 2 )) => { (v)contains((v)) let device_id = PL011State:: DEVICE_ID ; device_idDEVICE_ID u64 :: from(device_id[(offset - 0xfe0 ) >> 2 ]) from(device_id[(offset]) } Err (_) => { (_) log::error! ( "pl011_read: Bad offset 0x{:x} width {:?}" , offset , width) ; offsetwidth) 0 } Ok (field) => { (field) let result = { result let mut regs = self . regs . lock() . unwrap() ; regsregslock()unwrap() let (update_irq , result) = regs . read(field) ; (update_irqresult)regsread(field) let remainder = offset - field as u64 ; remainderoffsetfield if update_irq { update_irq . update() ; regsupdate() ; drop(regs) } if remainder != 0 { remainder assert! ( matches! (width , Width:: _32 | Width:: _16) , "{width:?}" ) ; (width_32_16) } result }; . into() resultinto() } } } fn write( & self , offset : u64 , value : u64 , width : Width) { write(offsetvaluewidthWidth) if let Ok (field) = RegisterOffset:: try_from(offset) { (field)try_from(offset) let mut char_backend = self . char_backend . lock() ; char_backendchar_backendlock() if field == RegisterOffset:: DR { fieldDR let ch : [ u8 ; 1 ] = [value as u8 ] ; ch[value . write_all( & ch) . unwrap() ; char_backendwrite_all(ch)unwrap() . flush() . unwrap() ; char_backendflush()unwrap() } let mut regs = self . regs . lock() . unwrap() ; regsregslock()unwrap() let update_irq = regs . write(field , value as u32 , char_backend) ; update_irqregswrite(fieldvaluechar_backend) if update_irq { update_irq . update() ; regsupdate() } } else { log::error! ( "write bad offset 0x{offset:x} value 0x{value:x}" ) ; } } } A UART’s operation is simple: code writes and reads from memory-mapped UART registers, and this MMIO triggers side-effects like configuring the UART or printing characters. The machine For simplicity we emulate only one core (Processing Element or PE in Arm terminology). The emulator can provide a memory-mapped region of configurable size as the VM’s RAM. Optionally, it can generate and load a simplistic device tree to the guest. Note: On Linux, we had better madvise the memory chunk with MADV_DONTDUMP to prevent it from being included when our emulator crashes and dumps core. Note 2: On macos , we need to enable JIT support by calling pthread_jit_write_protect_np as well as pass the flag MAP_JIT to the mmap call. To execute translation blocks, we keep track of the next program counter to execute. When a block finishes execution, it stores that value to the machine state: we use that to lookup the next translation block to execute, which will be either cached or translated on demand. #[ repr ( transparent )] reprtransparent #[ derive ( Clone , Copy )] derive /// An "entry" function for a block. /// /// It can be either a JIT compiled translation block, or a special emulator /// function. pub struct Entry( pub extern "C" fn ( & mut JitContext , & mut Armv8AMachine) -> Entry) ; Entry(JitContextArmv8AMachine)Entry) /// Lookup [`machine.pc`] in cached entry blocks ([`Armv8AMachine::entry_blocks`]). #[ no_mangle ] no_mangle pub extern "C" fn lookup_entry(context : & mut JitContext , machine : & mut Armv8AMachine) -> Entry { lookup_entry(contextJitContextmachineArmv8AMachine)Entry let pc : u64 = machine . pc ; pcmachinepc if context . single_step { contextsingle_step // Do not cache single step blocks let (_ , next_entry) = context . compile(machine , pc) . unwrap() ; (_next_entry)contextcompile(machinepc)unwrap() return next_entry ; next_entry } if let Some (entry) = machine . entry_blocks . get( & pc) { (entry)machineentry_blocksget(pc) log::trace! ( "lookup entry entry found for 0x{:x}-0x{:x}" , pc , entry . 0 ) ; pcentry return entry . 1 ; entry } log::trace! ( "generating entry for pc 0x{:x}" , pc) ; pc) let (pc_range , next_entry) = context . compile(machine , pc) . unwrap() ; (pc_rangenext_entry)contextcompile(machinepc)unwrap() . entry_blocks . insert(pc_range , next_entry) ; machineentry_blocksinsert(pc_rangenext_entry) log::trace! ( "returning generated entry for pc 0x{:x}" , pc) ; pc) next_entry } It’s important to invalidate translated blocks when the guest writes to the memory associated with it. In practice, kernels (should) use read-only memory protection for executable memory and don’t have a lot of self-modifying code – among the exceptions for Linux are tracepoints which require patching specific areas of code. This allows for translation block caching to persist even when a kernel schedules userspace processes that might overlap with already cached addresses because it uses virtual memory and all memory accesses go through the MMU. Memory reads and writes go through Rust helpers that determine which memory region the memory access refers to: /// A flattened memory map of the guest. pub struct MemoryMap { MemoryMap : Vec < MemoryRegion >, regionsMemoryRegion : MemorySize , max_sizeMemorySize } impl MemoryMap { MemoryMap pub fn find_region( & self , addr : Address) -> Option <& MemoryRegion > { find_region(addrAddress)MemoryRegion let index = match self . regions . binary_search_by_key( & addr , | x | x . phys_offset) { indexregionsbinary_search_by_key(addrphys_offset) Ok (x) => Some (x) , (x)(x) Err (x) if (x > 0 && addr . 0 <= self . regions[x - 1 ] . last_addr() . 0 ) => Some (x - 1 ) , (x)(xaddrregions[xlast_addr()(x _ => None , }; . and_then( | x | self . regions . get(x)) indexand_then(regionsget(x)) } } Then goes through the memory region’s specific read or write implementation (different for physical memory and MMIO). Writing a single byte: pub extern "C" fn memory_region_write_8( memory_region_write_8( : & mut MemoryRegion , mem_regionMemoryRegion : u64 , address_inside_region : u8 , value ) { match mem_region . backing { mem_regionbacking MemoryBacking:: Mmap( ref mut map @ MmappedMemory { .. } ) => { Mmap(mapMmappedMemory let destination = destination // SAFETY: when resolving the guest address to a memory region, we // essentially performed a bounds check so we know this offset is valid. unsafe { map . as_mut_ptr() . add(address_inside_region as usize ) }; mapas_mut_ptr()add(address_inside_region // SAFETY: destination is a valid pointer unsafe { std::ptr:: write_unaligned(destination . cast:: < u8 > () , value) }; write_unaligned(destination()value) } MemoryBacking:: Device( ref ops) => { Device(ops) . write( opswrite( , address_inside_region u64 :: from(value) , from(value) Width:: _8 , _8 ) ; } } } Machine state All register state, as well as Processor State ( PSTATE ), is stored inside the machine struct. Some register state such as current exception level ( ELx ) affects operation of instructions. For example, accessing system registers might cause an exception. aarch64 has many registers, and keeping this state is an expensive operation. Debugging the guest with GDB Using the excellent https://github.com/daniel5151/gdbstub Rust library, we can create a GDB server that provides a remote target for GDB to connect to, just like QEMU does. The emulator creates a socket that speaks the GDB Remote Serial Protocol. GDB can connect to it using the target remote path/to/socket command. The GDB server code drives the emulator itself according to what breakpoint / continue / step commands it receives. Single stepping is implemented by forcing the emulator to limit blocks to 1 instruction at a time and not re-using blocks that are more than one instruction long. $ cargo run -- --gdb-stub-path ./gdb ./test_kernel.bin cargo run./gdb ./test_kernel.bin [INFO simulans::gdb] Waiting for a GDB connection on ./gdb... simulans::gdb] Waiting for a GDB connection on ./gdb... From another terminal: $ gdb-multiarch ./test_kernel gdb-multiarch ./test_kernel Reading symbols from ./test_kernel.... symbols from ./test_kernel.... ( gdb ) target remote ./gdb remote ./gdb Remote debugging using ./gdb debugging using ./gdb 0x0000000000000004 in ?? ( ) in ( gdb ) disas $pc ,+20 ,+20 Dump of assembler code from 0x4 to 0x18: of assembler code from 0x4 to 0x18: = > 0x0000000000000004: ldr x0, 0x1c 0x0000000000000004: ldr x0, 0x1c 0x0000000000000008: mov x1, xzr mov x1, xzr 0x000000000000000c: mov x2, xzr mov x2, xzr 0x0000000000000010: mov x3, xzr mov x3, xzr 0x0000000000000014: ldr x4, 0x24 ldr x4, 0x24 End of assembler dump. of assembler dump. ( gdb ) stepi 0x0000000000000004 in ?? ( ) in ( gdb ) stepi 0x0000000000000008 in ?? ( ) in ( gdb ) disas $pc ,+4 ,+4 Dump of assembler code from 0x8 to 0x9: of assembler code from 0x8 to 0x9: = > 0x0000000000000008: mov x1, xzr 0x0000000000000008: mov x1, xzr End of assembler dump. of assembler dump. Debugging the emulator with GDB This is less useful than seeing the guest execute. We can inspect the generated assembly for JIT compiled translation blocks. Personally I choose to do this on an aarch64 Ampere workstation since I’m more familiar with aarch64 ISA than other popular ISAs. Testing I used two approaches to test it: Unit tests: Under tests/ , there are many small functions that create a tiny VM instance, map a few lines of assembly to its memory, run it and check the register state against the expected outcome. This is useful to check the result of standalone instructions. Writing them by hand is the biggest challenge, so I automated part of it in the emulator’s xtask crate. Example usage: $ cat sdiv.S cat sdiv.S sub sp, sp, #0x10 sp, sp, str w0, [sp, #8] w0, [sp, ldr w8, [sp, #8] w8, [sp, mov w9, #2 w9, sdiv w8, w8, w9 w8, w8, w9 $ cargo xtask compile-assembly-to-rust-slice sdiv.S cargo xtask compile-assembly-to-rust-slice sdiv.S Finished ` dev ` profile [unoptimized + debuginfo] target ( s ) in 0.03s profile [unoptimized + debuginfo] target Running ` xtask/target/debug/xtask compile-assembly-to-rust-slice test_sdiv.s ` compile-assembly-to-rust-slice test_sdiv.s const TEST_INPUT: & [u8] = b "\xff\x43\x0\xd1\xe0\xb\x0\xb9\xe8\xb\x40\xb9\x49\x0\x80\x52\x8\xd\xc9\x1a" ; TEST_INPUT:= b Running a simple bare metal program: I wrote a very simple binary for the aarch64-unknown-none-softfloat Rust target –which disables use of SIMD/neon instructions and registers– that prints strings to the UART and also parses the flattened device tree passed through the x0 register. The test kernel is useful in more than one way. Besides running it directly and seeing if it works as expected, we can run it in parallel with QEMU and observe any differences between the two. I wrote a simple python script that connects to two remote GDB targets and single steps through them in parallel. At every step, it compares the register state differences between the two taken steps. Understandably it’s very slow. However it helped find a large amount of bugs that were difficult to spot otherwise! What’s next My eventual goal is of course to boot Linux, so we still need: Exception handling as well as switching between Exception Levels (currently work in progress) Timer support MMU/Virtual memory Interrupt controller, likely GICv2 for simplicity. Incorporating rust-vmm components such as https://github.com/rust-vmm/vm-memory I’m also particularly interested in finding a nice way to either generate codegen code or at least test cases with the SAIL specification of the Arm ISA, hopefully when I have time. Resources Also: Repository: https://github.com/epilys/simulans The emulator’s DEVELOPMENT.md documentation documentation See cargo run -- --help for CLI usage information for CLI usage information Join #simulans on IRC Libera.chat Discussions:

Emulating aarch64 in software using JIT compilation and Rust

Share this article

Related Articles