In 2021 I found a huge memory leak in VS code, totalling around 64 GB when I first saw it, but with no actual limit on how high it could go. I found this leak despite two obstacles that should have made the discovery impossible:
The memory leak didn’t show up in Task Manager – there was no process whose memory consumption was increasing. I had never used VS Code. In fact, I have still never used it.
So how did this work? How did I find an invisible memory leak in a tool that I have never used?
This was during lockdown and my whole team was working from home. In order to maintain connection between teammates and in order to continue transferring knowledge from senior developers to junior developers we were doing regular pair-programming sessions. I was watching a coworker use VS Code for… I don’t remember what… and I noticed something strange.
So many of my blog posts start this way. “This doesn’t look right”, or “huh – that’s weird”, or some variation on that theme. In this case I noticed that the process IDs on her system had seven digits.
That was it. And as soon as I saw that I knew that there was a process-handle leak on her system and I was pretty sure that I would find it. Honestly, the rest of this story is pretty boring because it was so easy.
You see, Windows process IDs are just numbers. For obscure technical reasons they are always multiples of four. When a process goes away its ID is eligible for reuse immediately. Even if there is a delay before the process ID (PID) is reused there is no reason for the highest PID to be much more than four times the maximum number of processes that were running at one time. If we assume a system with 2,000 processes running (according to pslist my system currently has 261) then PIDs should be four decimal digits. Five decimal digits would be peculiar. But seven decimal digits? That implies at least a quarter-million processes. The PIDs I was seeing on her system were mostly around four million, which implies a million processes. Nope. I do not believe that there were that many processes.
It turns out that “when a process goes away its ID is eligible for reuse” is not quite right. If somebody still has a handle to that process then its PID will be retained by the OS. Forever. So it was quite obvious what was happening. Somebody was getting a handle to processes and then wasn’t closing them. It was a handle leak.
The first time I dealt with a process handle leak it was a complicated investigation as I learned the necessary techniques. That time I only realized that it was a handle leak through pure luck. Since then I’ve shipped tools to find process-handle and thread handle leaks, and have documented the techniques to investigate handle leaks of all kinds. Therefore this time I just followed my own recipe and had a call stack for the leaking code within the hour (this image stolen from the github issue):
The bug was pretty straightforward. A call to OpenProcess was made, and there was no corresponding call to CloseProcess. And because of this a boundless amount of memory – roughly 64 KiB for each missing CloseProcess call – was leaked. A tiny mistake, with consequences that could easily consume all of the memory on a high-end machine.
This is the buggy code (yay open source!):
void GetProcessMemoryUsage(ProcessInfo process_info[1024], uint32_t* process_count) { DWORD pid = process_info[*process_count].pid; HANDLE hProcess; PROCESS_MEMORY_COUNTERS pmc; hProcess = OpenProcess(PROCESS_QUERY_INFORMATION | PROCESS_VM_READ, false, pid); if (hProcess == NULL) { return; } if (GetProcessMemoryInfo(hProcess, &pmc, sizeof(pmc))) { process_info[*process_count].memory = (DWORD)pmc.WorkingSetSize; } }
And this is the code with the fix – the bold-faced line was added to fix the leak:
void GetProcessMemoryUsage(ProcessInfo& process_info) { DWORD pid = process_info.pid; HANDLE hProcess; PROCESS_MEMORY_COUNTERS pmc; hProcess = OpenProcess(PROCESS_QUERY_INFORMATION | PROCESS_VM_READ, false, pid); if (hProcess == NULL) { return; } if (GetProcessMemoryInfo(hProcess, &pmc, sizeof(pmc))) { process_info.memory = (DWORD)pmc.WorkingSetSize; } CloseHandle(hProcess); }
That’s it. One missing line of code is all that it takes.
The bug was found back when I still used Twitter so I reported my findings there (broken link) and somebody else then filed a github issue based on my report. I stopped using twitter a couple of years later and then my account got banned (due to not being used?) and then deleted, so now that bug report along with everything else I ever posted is gone. That’s pretty sad actually. Yet another reason for me to dislike the owner of Twitter.
It looks like the bug was fixed within a day or two of the report. Maybe The Great Software Quality Collapse hadn’t quite started then. Or maybe I got lucky.
Anyway, if you don’t want me posting embarrassing stories about your software on my blog or on bsky then be sure to leave the Handles column open in Task Manager and pay attention if you ever see it getting too high in a process that you are responsible for.
Sometimes I think it would be nice to have limits on resources in order to more automatically find mistakes like this. If processes were automatically crashed (with crash dumps) whenever memory or handles exceeded some limit then bugs like this would be found during testing. The limits could be set higher for software that needs it, but 10,000 handles and 4 GiB RAM would be more than enough for most software when operating correctly. The tradeoff would be more crashes in the short term but fewer leaks in the long term. I doubt it will ever happen, but if this mode existed as a per-machine opt-in then I would enable it.