Following up on the Python JIT
Please consider subscribing to LWN Subscriptions are the lifeblood of LWN.net. If you appreciate this content and would like to see more of it, your subscription will help to ensure that LWN continues to thrive. Please visit this page to join up and keep LWN on the net.
Performance of Python programs has been a major focus of development for the language over the last five years or so; the Faster CPython project has been a big part of that effort. One of its subprojects is to add an experimental just-in-time (JIT) compiler to the language; at last year's PyCon US, project member Brandt Bucher gave an introduction to the copy-and-patch JIT compiler. At PyCon US 2025, he followed that up with a talk on "What they don't tell you about building a JIT compiler for CPython" to describe some of the things he wishes he had known when he set out to work on that project. There was something of an elephant in the room, however, in that Microsoft dropped support for the project and laid off most of its Faster CPython team a few days before the talk.
Bucher only alluded to that event in the talk, and elsewhere has made it clear that he intends to continue working on the JIT compiler whatever the fallout. When he gave the talk back in May, he said that he had been working with Python for around eight years, as a core developer for six, part of the Microsoft CPython performance engineering team for four, and has been working on the JIT compiler for the last two years. While the team at Microsoft is often equated with the Faster CPython project, it is really just a part of it; " our team collaborates with lots of people outside of Microsoft ".
Faster CPython results
The project has seen some great results over the last few Python releases. Its work first appeared in 2022 as part of Python 3.11, which averaged 25% faster than 3.10, depending on the workload; " no need to change your code, you just upgrade Python and everything works ". In the years since, there have been further improvements: Python 3.12 was 4% faster than 3.11, and 3.13 improved by 7% over 3.12. Python 3.14, which is due in October, will be around 8% faster than its predecessor.
In aggregate, that means Python has gotten nearly 50% faster in less than four years, he said. Around 93% of the benchmarks that the project uses have improved their performance over that time; nearly half (46%) are more than 50% faster. 20% of the benchmarks are more than 100% faster. Those are not simply micro-benchmarks, the benchmarks represent real workloads; Pylint has gotten 100% faster, for example.
All of those increases have come without the JIT; they come from all of the other changes that the team has been working on, while " taking a kind of holistic approach to improving Python performance ". Those changes have a meaningful impact on performance and were done in such a way that the community can maintain them. " This is what happens when companies fund Python core development ", he said, " it's a really special thing ". On his slides, that was followed by the crying emoji 😢 accompanied by an uncomfortable laugh.
Moving on, he gave a "duck typing" example that he would refer to throughout the talk. It revolved around a duck simulator that would take an iterator of ducks and "quack" each one, then print the sound. As an additional feature, if a duck has an "echo" attribute that evaluates to true, it would double the sound:
def simulate_ducks(ducks): for duck in ducks: sound = duck.quack() if duck.echo: sound += sound print(sound)
class Duck: echo = False def quack(self): return "Quack!" class RubberDuck: echo = True def __init__(self, loud): self.loud = loud def quack(self): if self.loud: return "SQUEAK!" return "Squeak!"
That was coupled with two classes that produced different sounds:
He stepped through an example execution of the loop in simulate_ducks() . He showed the bytecode for the stack-based Python virtual machine that was generated by the interpreter and stepped through one iteration of the loop describing the changes to the stack and to the duck and sound local variables. That process is largely unchanged " since Python was first created ".
Specialization
The 3.11 interpreter added specialized bytecode into the mix, where some of the bytecode operations are changed to assume they are using a specific type—chosen based on observing the execution of the code a few times. Python is a dynamic language, so the interpreter always needs to be able to fall back to, say, looking up the proper binary operator for the types. But after running the loop a few times, it can assume that " sound += sound " will be operating on strings so it can switch to a bytecode with a fast path for that explicit operation. " You actually have bytecode that can still handle anything, but has inlined fast paths for the shape of your actual objects and data structures and memory layout. "
All of that underlies the JIT compiler, which uses the specialized bytecode interpreter, and can be viewed as being part of the same pipeline, Bucher said. The JIT compiler is not enabled by default in any build of Python, however. As he described in last year's talk, the specialized bytecode instructions get further broken down into micro-ops, which are " smaller units of work within an individual bytecode instruction ". The translation to micro-ops is completely automatic because the bytecodes are defined in terms of them, " so this translation step is machine-generated and very very fast ", he said.
The micro-ops can be optimized, that is basically the whole point of generating them, he said. Observing the different types and values that are being encountered when executing through the micro-ops will show optimizations that can be applied. Some micro-ops can be replaced with more efficient versions, others can be eliminated because they " are doing work that is entirely redundant and that we can prove we can remove without changing the semantics ". He showed a slide full of micro-ops that corresponded to the duck loop and slowly replaced and eliminated something approaching 25% of them, which corresponds to what the 3.14 version of the JIT does.
The JIT will then translate the micro-ops into machine code one-by-one, but it does so using the copy-and-patch mechanism. The machine-code templates for each of the micro-ops are generated at CPython compile time; it is somewhat analogous to the way the micro-ops themselves are generated in a table-driven fashion. Since the templates are not hand-written, fixing bugs in the micro-ops for the rest of the interpreter also fixes them for the JIT; that helps with the maintainability of the JIT, but also helps lower the barrier to entry for working on it, Bucher said.
Region selection
With that background out of the way, he moved on to some " interesting parts of working on a JIT compiler " that are often overlooked, starting with region selection. Earlier, he had shown a sequence of micro-ops that needed to be turned into machine code, but he did not describe how that list was generated; " how did we get there in the first place? "
The JIT compiler does not start off with such a sequence, it starts with code like in his duck simulation. There are several questions that need to be answered about that code based on its run-time activity. The first is: " what do we want to compile? " If something is running only a few times, it is not a good candidate for JIT compilation, but something that is running a lot is. Another question is where should it be compiled? A function can be compiled in isolation or it can be inlined into its callers and those can be compiled instead.
When should the code be compiled? There is a balance to be struck between compiling things too early, wasting that effort because the code is not actually running all that much, and too late, which may not actually make the program any faster. The final question is "why?", he said; it only makes sense to compile code if it is clear that compiling will make the code more efficient. " If they are using really dynamic code patterns or doing weird things that we don't actually compile well, then it's probably not worth it. "
One approach that can be taken is to compile entire functions, which is known as "method at a time" or "method JIT". It " maps naturally to the way we think about compilers " because it is the way that many ahead-of-time compilers work. So, when the JIT looks at simulate_ducks() , it can just compile the entire function (the for loop) wholesale, but there are some other opportunities for optimization. If it recognizes that most of the time the loop operates on Duck objects, it can inline the quack() function from it:
for duck in ducks: if duck.__class__ is Duck: sound = "Quack!" else: sound = duck.quack() ...
RubberDuck
quack()
duck.echo
it's not always super-easy to reason about, especially for something that is running while you are compiling it
If there are lots ofobjects too, that class'smethod could be inlined as well. Likewise, the attribute lookup forcould be inlined for one or both cases, but that all starts to get somewhat complicated, he said; "".
Meanwhile, what if ducks is not a list, but is instead a generator? In simple cases, with a single yield expression, it is not that much different from the list case, but with multiple yield expressions and loops in the generator, it also becomes hard to reason about. That creates a kind of optimization barrier and that kind of code is not uncommon, especially in asynchronous programming contexts.
Another technique, and the one that is currently used in the CPython JIT, is to use a "tracing JIT" instead of a method JIT. The technique takes linear traces of the program's execution, so it can use that information to make optimization decisions. If the first duck is a Duck , the code can be optimized as it was earlier, with a guard based on the class and inlining the sound assignment. Next up is a lookup for duck.echo , but the code in the guarded branch has perfect type information; it already knows that it is processing a Duck , so it knows echo is false, and that if can be removed, leaving:
for duck in ducks: if duck.__class__ is Duck: sound = "Quack!" print(sound)
This is pretty efficient. If you have just a list of Duck s, you're going to be doing kind of the bare minimum amount of work to actually quack all those ducks.
The code still needs to handle the case where the duck is not a Duck , but it does not need to compile that piece; it can, instead, just send it back to the interpreter if the class guard is false. If the code is also handling RubberDuck objects, though, eventually that else branch will get "hot" because it is being taken frequently.
At that point, the tracing can be turned back on to see what the code is doing. If we assume that it mostly has non- loud RubberDuck objects, the resulting code might look like:
elif duck.__class__ is RubberDuck: if self.loud: ... sound = "Squeak!Squeak!" print(sound) else: ...
echo
+=
Duck
loud
RubberDuck
The two branches that are not specified would simply return to the regular interpreter when they are executed. Since the tracing has perfect type information, it knows thatis true, so the sound should be doubled, but there is no need to actually use "" to get the result. So, now the function has the minimum necessary code to quack either aor a non-. If those other branches start getting hot at some point, tracing can once again be used optimize it further.
One downside of the tracing JIT approach is that it can compile duplicates of the same code, as with " print(sound) ". In " very branchy code " Bucher said, " some things near the tail of those traces can be duplicated quite a bit ". There are ways to reduce that duplication, but it is a downside to the technique.
Another technique for selecting regions is called "meta tracing", but he did not have time to go into it. He suggested that attendees ask their LLM of choice " about the 'first Futamura projection' and don't misspell it like me, it's not 'Futurama' ", Bucher said to some chuckles around the room.
Memory management
JIT compilers " do really weird things with memory ". C programmers are familiar with readable (or read-only) data, such as a const array, and data that is both readable and writable is the normal case. Memory can be dynamically allocated using malloc() , but that kind of memory cannot be executed; since a JIT compiler needs memory that it can read, write, and execute, it requires " the big guns ": mmap() . " If you know the right magic incantation, you can whisper to this thing with all these secret flags and numbers " to get memory that is readable, writable, and executable:
char *data = mmap(NULL, 4096, PROT_READ | PROT_WRITE | PROT_EXEC, MAP_ANONYMOUS | MAP_PRIVATE, -1, 0);
mmap()
C lets us do crazy things
typedef int (*function)(int); ((function)data)(42);
function
data
It's weird, but it works.
One caveat is that memory fromcomes in page-sized chunks, which is 4KB on most systems but can be larger. If the JIT code is, say, four bytes in length, that can be wasteful, so it needs to be managed carefully. Once you have that memory, he asked, how do you actually execute it? It turns out that "":That first line creates a type definition named "", which is a pointer to a function that takes an integer argument and returns an integer. The second line casts thepointer to that type and then calls the function with an argument of 42 (and ignores the return value). "
He noted that the term "executable data" should be setting off alarm bells in people's heads; " if you're a Rust programmer, this is what we call 'unsafe code' " he said to laughter. Being able to write to memory that can be executed is " a scary thing; at best you shoot yourself in the foot, at worst it is a major security vulnerability ". For this reason, operating systems often require that memory not be in that state. He said that the memory should be mapped readable and writable, then filled in, and switched to readable and executable using mprotect() ; if there is a need to modify the data later, it can be switched back and forth between the two states.
Debugging and profiling
When code is being profiled using one of the Python profilers, code that has been compiled should call all of the same profiling hooks. The easiest way to do that, at least for now, is to not JIT code that has profiler hooks installed. In recent versions of Python, profiling is implemented by using the specializing adaptive interpreter to change certain bytecodes to other, instrumented versions of them, which will call the profiler hooks. If the tracing encounters one of these instrumented bytecodes, it can shut the JIT down for that part of the code, but it can still run in other, non-profiled parts of the code.
A related problem occurs when someone enables profiling for code that has already been JIT-compiled. In that case, Python needs to get out of the JIT code as quickly as possible. That is handled by placing special _CHECK_VALIDITY micro-ops just before " known safe points " where it can jump out of the JIT code and back to the interpreter. That micro-op checks a one-bit flag; if it is set, the execution bails out of the JIT code. That bit gets set when profiling is enabled, but it is also used when code executes that could change the JIT optimizations (e.g. a change of class attributes).
Something that just kind of falls out of that is the ability to support " the weirder features of Python debuggers ". The JIT code is created based on what the tracing has seen, but someone running pdb could completely upend that state in various ways (e.g. " duck = Goose() "). The validity bit can be used to avoid problems of that sort as well.
For native profilers and debuggers, such as perf and GDB, there is a need to unwind the stack through JIT frames, and interact with JIT frames, but " the short answer is that it's really really complicated ". There are lots of tools of this sort, for various platforms, that all work differently and each has its own APIs for registering debug information in different formats. The project members are aware of the problem, but are trying to determine which tools need to be supported and what level of support they actually need.
Looking ahead
The current Python release is 3.13; the JIT can be built into it by using the --enable-experimental-jit flag. For Python 3.14, which is out in beta form and will be released in October, the Windows and macOS builds have the JIT built-in, but it must be enabled by setting PYTHON_JIT=1 in the environment. He does not recommend enabling it for production code, but the team would love to hear about any results from using it: dramatic improvements or slowdowns, bugs, crashes, and so on. Other platforms, or people creating their own binaries, can enable the JIT with the same flag as for 3.13.
For 3.15, which is in a pre-alpha stage at this point, there are two GitHub issues they are focusing on: "Supporting stack unwinding in the JIT compiler" and "Make the JIT thread-safe". The first he had mentioned earlier with regard to support for native debuggers and profilers. The second is important since the free-threaded build of CPython seems to be working out well and is moving toward becoming the default—see PEP 779 ("Criteria for supported status for free-threaded Python"), which was recently accepted by the steering council. The Faster CPython developers think that making the JIT thread-safe can be done without too much trouble; " it's going to take a little bit of work and there's kind of a long tail of figuring out what optimizations are actually still safe to do in a free-threaded environment ". Both of those issues are outside of his domain of expertise, however, so he hoped that others who have those skills would be willing to help out.
In addition, there is a lot of ongoing performance work that is going into the 3.15 branch, of course. He noted, pointedly, that fast progress, especially on larger projects, will depend on the availability of resources. The words on his slide saying that changed to bold and he gave a significant cough to further emphasize the point.
As he wrapped up, he suggested PEP 659 ("Specializing Adaptive Interpreter") and PEP 744 ("JIT Compilation") for further information. For those who would rather watch something, instead of reading about it, he recommended videos of his talks (covered by LWN and linked above) from 2023 on the specializing adaptive interpreter and from 2024 on adding a JIT compiler. The YouTube video of this year's talk is available as well.
[Thanks to the Linux Foundation for its travel sponsorship that allowed me to travel to Pittsburgh for PyCon US.]
Index entries for this article Conference PyCon/2025 Python JIT
to post comments