Preparing for the .NET 10 GC

Preparing for the .NET 10 GC Maoni0 20 min read · Just now Just now -- Listen Share In .NET 9 we enabled DATAS by default. But .NET 9 is not an LTS release so for many people they will be getting DATAS for the first time when they upgrade to .NET 10. This was a tough decision because GC features are usually the kind that don’t require user intervention — but DATAS is a bit different. That’s why this post is titled “preparing for” instead of just “what’s new” 😊. If you’re using Server GC, you might notice a performance profile that’s more noticeably different than what you saw in previous runtime upgrades. Memory usage may look drastically different (very likely smaller) — and that may or may not be desirable. It all depends on whether the tradeoff is noticeable, and if it is, whether it aligns with your optimization goals. I’d recommend taking at least a quick look at your application performance metrics to see if you are happy with the results of this change. Many people will absolutely welcome it — but if you are not one of them, don’t panic. I encourage you to read on to see whether it makes sense to simply turn DATAS off or if a bit of tuning could make it work in your favor. I’ll talk about how we generally decide which performance features to add, why DATAS is so different from typical GC features, and the tuning changes introduced since my last DATAS blog post. I’ll also share two examples of how I tuned DATAS in first-party scenarios. If you’re mainly here to see which scenarios DATAS isn’t designed for — to help decide whether to turn it off — feel free to skip ahead to this section. General policies of adding GC performance features Most GC performance features — whether it’s a new GC flavor, a new mechanism that enables the GC to do something it couldn’t before, or optimizations that improve an existing mechanism — are typically lit up automatically when you upgrade to a new runtime version. We don’t require users to take action because these features are designed to improve a wide range of scenarios. In fact, that’s often why we choose to implement them: we analyze many scenarios to understand the most common problems, figure out what it would take to solve them, and then prioritize which ones to design and implement. Of course, with any performance changes, there’s always the risk of regressions — and for a framework used by millions, you’re guaranteed to regress someone. These regressions can be especially visible in microbenchmarks, where the behavior is so extreme that even small changes can cause wild swings in results. A recent example being the change we made in how we handle the free regions for UOH (ie, LOH + POH) generations. We changed from the budget based trimming policy to an age based because it’s more robust in general (so we don’t either quickly decommit memory and have to recommit again, or keep extra free regions around even after a long time because we continue not consuming nearly all of the UOH budgets). But this can totally change a microbenchmark that used to observe the primary memory go down to a very low value after one GC.Collect() now requires 3 GC.Collect() calls (because we have to wait for the UOH free regions to age out in 2 gen2 GCs and the 3rd one will put it on the decommit list). But for DATAS, we knew it was by definition not necessarily for a wide range of scenarios. As I mentioned in my last blog post, there were 2 specific kinds of scenarios that DATAS targeted. I’ll reiterate them here – 1. Bursty workloads running in memory constraint environments. DATAS aims to retract the heap size back when the application doesn’t require as much memory and grow it when the app requires more. This is especially important for apps running in containers with memory limits. 2. Small workloads using Server GC — for example, if someone wants to try out a small asp.net core app to see what the experience is like in .NET, DATAS aims provide a heap size much more inline with what the small app actually needs. I should give more explanation about 1). It’s not uncommon to see bursty workloads at all. If you have an app that handles requests, which is completely common, naturally you could have many more users during a specific time of the day than the rest of the day. However, the key here is the action that follows it — if you have memory freed up during the non peak hours, what would you do with this memory? It turns out that sometimes folks don’t actually know — they want to see the memory go down when the workload lightens, but they have no plans to do anything with this memory. And for some teams, they don’t need to the memory usage to go down because they already budgeted all that memory to their apps. I was just talking to a customer recently and when I asked them “if DATAS frees up memory for you, what would you use for it?”. The answer was “that’s a good question, we never thought about it”. For folks who do want to make use of the freed up memory, a common way is to use an orchestrated environment . DATAS makes this scenario more robust as heap sizes will be much more predictable, as I’ll explain below, which helps with setting sensible memory limits. For example, in k8s, you can determine appropriate request and limit values for both non-peak and peak workloads to better leverage HPA. I have also seen teams that schedule tasks to run when the machines/VMs have free memory — this is more involved (and these teams usually are equipped with a team of dedicated perf engineers) but gives them more control. Then there are plenty of teams that have dedicated fleets of machines and want to maximize their throughput during peak hours as much as possible. They do not want to tolerate any type of slow down. They are definitely not the target of DATAS which will almost always regress their throughput — when it comes to perf it’s rarely an all or none situation and I will discuss below how to make a decision if you should turn DATAS off. All these made it difficult to make DATAS the default because we know there are a lot of teams that don’t want to sacrifice throughput at all or don’t make use of freed up memory. I will discuss in detail below if you do want to look at the perf differences and make a decision if DATAS is for you or not (maybe when you see the memory reduction you will have ideas of using the freed up memory). Performance differences between DATAS and the traditional Server GC DATAS is a GC feature that I spent more time explaining to my coworkers than any other — being such a user visible feature, it naturally attracted more questions than pretty much any other GC features I added. And there were lots of misconceptions. Some thought that DATAS only affected startup; some assumed it would just “reduce memory by x% and throughput by y%”; some expected it to “magically reducing memory without any other perf differences” (okay, I added the “magically” part 😆); and etc. To understand the differences properly, we need to understand the difference in policies. First and foremost, Server GC does not adapt to the application size — it was never a goal. Server GC looks mostly at the survival rate of each generation and does GCs based on that (there are a number of other factors that affect when GCs are triggered but survival rate is one of the most significant). In the last DATAS post I talked about the number of heaps which can affect the heap size significantly, especially in workloads that allocate a lot of temporary data. Since Server GC creates the same number of heaps as the number of cores the process is allowed to use, it means you can see very different heap sizes when running the same app with a different number of cores (by running it on a machine with a different number of cores or let your process use different number of cores on the same machine). DATAS, on the other hand, aims to adapt to the application size which means you should see similar heap sizes even when the number of cores varies a lot. So there’s no “DATAS will reduce memory by X%” compared to Server GC. If we look at the “Max heap size” metric for asp.net benchmarks, it’s obvious that Server GC behaves very differently when running on a 28-core machine (28c) vs a 12-core machine (12c) – Press enter or click to view image in full size Careful readers will notice that the order of which color is on top is not consistent. For example, for MultipleQueriesPlatform, the max heap size is actually much larger for 12c than 28c. Looking at the data in more detail reveals that the max heap size happens at the very beginning of the test for the 12c case – Press enter or click to view image in full size (Heap size (before) is right before a GC before that GC could possibly shrink the heap size. So “Max Heap Size” would be the max of this metric) This is because at the beginning, there were a lot more allocations happened before the first GC happened on 28c with 28 heaps. So after that GC, a smaller survival rate was observed which caused the gen0 budget to be much smaller than on 12c. 12c quickly dropped to the steady state which has a much lower heap size than 28c. For steady state, these benchmarks always exhibit a much higher heap size on 28c. This illustrates 2 points –if you just measure “max heap size”, it can easily be affected by the non-steady state behavior; secondly, the heap size can vary a lot due to the machine the test runs on. Note that these effects can be magnified because we are looking at small benchmarks, but the reasoning applies to real-world apps. With DATAS we see this picture - Press enter or click to view image in full size The max heap sizes are very similar on 28c and 12c which is exactly what DATAS is for — it adapts to the application size. Do I need to care if I’m using Workstation GC? The answer depends on why you are using Workstation GC. If you are using Workstation GC because your workload simply does not call for using Server GC at all, then there’s no need to change. This could be due to your app being single threaded or the allocation is simply not stressful and you are totally fine with having one thread doing the collection work, in which case Workstation GC not only suffices but is exactly the correct choice to make. But if you are using it because Server GC’s memory usage was too large and you are just using Workstation to limit the memory usage, you could find DATAS very attractive because it can both limit the memory usage and make the GC pauses lower with more GC threads doing the collection work. How DATAS does its job If you understood how DATAS does it job, it would be natural to arrive at the recommendations below for deciding if DATAS is for you. You could also skip this section, but I always like to understand how something works if I care about it, so I can come to my own conclusions instead of just memorizing some rules. In the last blog post I mentioned some details of DATAS at the time (.NET 8), noting that it would likely change dramatically — and it did, both in design and implementation. The implementation we had in .NET 8 was mostly for functional — we spent very little time in tuning. The majority of the tuning work happened after .NET 8. The goal of DATAS is to adapt to the application size, or the LDS (Live Data Size). So there needs to be some way to adapt to it. Because the .NET GC is generational, it means we don’t collect the whole heap often. And since most full GCs we do are background GCs which don’t compact, it’s reasonable to approximate the LDS with the space objects take up in the old generations, i.e., (total size — fragmentation). Another convenient number to use when you do your perf investigations is to look at the promoted size when a full GC is done. In the last blog post I mentioned the conserve memory config is part of the DATAS implementation — that part did not change. But conserve memory only affects when full GCs are triggered. For apps that allocate very frequently, unless these are temporary UOH objects, most of the GCs are ephemeral GCs. And ephemeral generation sizes can be a significant portion of the whole heap especially for small heaps. After experimenting with various approaches, I settled on the approach of “adapting to the app size while maintaining reasonable performance” which consisted of 2 key components - 1) introduced a concept of “Budget Computed via DATAS (BCD)” which is calculated based on the application size and gives us an upper bound of the gen0 budget for that size, which can approximate the generation size for gen0 (since there’s pinning it may not be exactly the generation size for gen0). 2) within this upper bound, we can further reduce memory if we can still maintain reasonable performance. And we define this “reasonable performance” with a target Throughput Cost Percentage (TCP). This takes into consideration both GC pauses and how much allocating threads have to wait. But you can approximate TCP with % pause time in GC in steady state. The idea is to keep TCP around this target if we can, which means if the workload gets lighter, we’d be adjusting the gen0 budget smaller. And that in turn means gen0 will be smaller before the next GC, which translates to smaller heap size. The default target TCP is 2%. This can be changed via the GCDTargetTCP config. Let’s look at 2 example scenarios to see how this manifests. For simplicity, I’m ignoring background GCs, and I’ll use % pause time in GC to approximate TCP. Scenario A — I have an e-commerce app which stores the whole catalog in memory, and this remains the same during the process lifetime. This is our LDS. Now the process starts to process requests and for each request there’s memory allocated and only used for the duration of that request. During peak hours, it processes many concurrent requests. We hit our max budget which is our BCD. Let’s say this is 1gb, it means we are doing a GC each time 1GB is allocated. If we use the % pause time in GC to approximate TCP, let’s say during each second it allocates 1GB and observes one GC that has a 20ms pause. So the % time in GC is 2%. And that’s the same as our target TCP. When it’s outside the peak hours and handling way fewer concurrent requests, let’s say we allocate ~200MB per second. If we keep our 1GB budget, it means we are doing a GC every 5s. And our % time in GC would be (20ms / 5s = 0.4%), much lower than 2%. So to reach the target TCP we’d want to reduce the budget and trigger a GC much sooner. If we reduce the budget to 200MB, and we’ll still use 20ms as our GC pause just to make it simple (it’ll likely be shorter as it’s roughly proportional to the survival and there’s likely less survival out of 200MB vs 1GB), now we are achieving 2% TCP again. So for this scenario, the heap size is reduced by ~800MB when it’s outside peak hours. Depending on your total heap size, this can be a very significant reduction. Scenario B is built on top of A but we’ll throw in a cache that’s part of the LDS but gets smaller during lighter workload as we don’t need to cache as much. Because the LDS is smaller it means your BCD will be smaller as it’s a function of LDS. So during the lighter workload, the gen0 budget will be further reduced which again reflects the adapting to size nature. The conserve memory mechanism is still in effect too and would adjust the old generation budget and size accordingly. Notice that so far I have not talked about the number of heaps at all! This is completely taken care of by DATAS itself so you don’t need to worry about it. Previously, some of our customers were using the GCHeapCount config to specify the number of heaps for Server GC. But DATAS makes it more robust as it can take advantage of more heaps if needed (which usually means shorter individual pause times) and reduces the heap size when the LDS goes down, without your having to specify a heap count yourself. DATAS has specific events that indicate the actual TCP and LDS but that requires you to programmatically get them via the TraceEvent library. The approximations I mentioned above are sufficient for almost all perf investigations. When DATAS might not be applicable to your scenario If you read the previous sections, what’s listed below hopefully makes sense. 1) If you have no use for free memory, you don’t need DATAS This one should be obvious — why change it at all if you don’t have any use for the memory that gets freed up by DATAS anyway? You can turn DATAS off by the GCDynamicAdaptationMode config. I’ve come across a few first party teams who simply didn’t need DATAS — they have dedicated machines to run their processes and have no use for free memory as they don’t plan to run anything else on the machine. So they have no use for DATAS. One team did say “now we probably want to think about taking advantage of free memory” (they were not thinking about it because Server GC isn’t aggressive at reducing memory usage). So for them, they will disable DATAS for now but will enable it when they can take advantage of memory during non peak hours. 2) If startup perf is critical, DATAS is not for you DATAS always starts with 1 heap. We cannot predict how stressful your workload will be and since we are optimizing for size here, it starts with the smallest heap count which is 1. So if your startup perf is critical, you will see a regression because it takes time to go from 1 heap to multiple. 3) If you do not tolerate any throughput regression, DATAS may not be for you If this includes the startup throughput, as 2) also states, DATAS is not for you. However, some scenarios aren’t concerned with startup perf so DATAS may or may not be desirable. Let’s say your % pause time in GC is 1% with Server GC, you can just set the GCDTargetTCP config to 1. If you were restricting the heap count you could very possibly see a perf improvement because the pause time can be shorter with DATAS. If the adaptation to the size aspect is beneficial to you, using DATAS can be a much better choice. But as stated in 1) if you don’t have any use for the freed up memory anyway, it wouldn’t justify spending time on using DATAS. 4) If you are doing mostly gen2 GCs, DATAS may not be for you One case I haven’t spent much time tuning is when your scenario mostly does gen2 GCs (this is almost always due to excessive allocation of temporary large objects). If this is the case for you, and if you’ve tried DATAS and weren’t happy with the results, I would suggest to disable DATAS. You could investigate to see if you can make it work by following the tuning section if it’s justified to spend the time. Tuning DATAS if necessary I’ve tried DATAS on some first party workloads and in general it worked out great. I’ll show a couple of examples where the default parameters of DATAS weren’t great but tuning one or 2 configs made it work. Customer case 1 This is a server app running on dedicated machines. But they are in the process of containerizing it so there’s definitely merit to use DATAS. With DATAS they observed a 6.8% regression in throughput with a 10% reduction in working set. For now they’ve disabled DATAS — I will explain how I debugged it and determined what DATAS config to use to make it work if/when they want to enable DATAS. Because DATAS limits the largest gen0 budget based on the LDS, we want to see if we are hitting that limit. It’d be easiest if you captured a GC trace with DATAS and one without DATAS. If you are seeing more GCs triggered, that means most likely you are hitting that limit. You can approximate the TCP with what’s shown in the “% Pause Time” column, and gen0 budget with the “Gen0 Alloc MB” column. And you’d want to find the phase when you have the highest % pause time and see if you are triggering more GCs. So for this particular customer, here are some excerpts of the GC (I’ve trimmed down the columns of the GCStats view) – Without DATAS With DATAS Comparing their gen0 budget and % pause time in GC - So gen0 budget without DATAS is 2.6x with DATAS. Another useful thing we notice is the % Pause Time is basically exactly the target TCP — 2%. That tells us that this is working exactly as by design from DATAS’s POV. But without DATAS we got 2.6x budget so naturally we triggered GC less frequently and % pause time is 1.2 instead of 2.1. But if we want to enable DATAS and not regress throughput for this phase, we’d like to have DATAS use a larger gen0 budget. To do that we should understand how DATAS determines the BCD. Since we are adapting to the size, we want to multiply the size with something. But this should not be a constant value because when the size is very small, this multiplier should be quite large — if the LDS is only 2MB (which is totally possible for a tiny app), we wouldn’t want to trigger a GC for every 0.2MB of allocation — the overhead would be too high. Let’s say we want to allow 20MB of allocation before triggering a GC, that makes the multiplier 10. But if the LDS is 20GB, we wouldn’t want to allocate 200GB before doing a GC, which means we want a much smaller multiplier. This means a power function but we also want to clamp it between a min and max value - m = constant / sqrt (LDS); // default for max_m is 10 m = min (max_m, m); // default for min_m is 0.1 m = max (min_m, m); The actual formula for the power function is m = (20 - conserve_memory) / sqrt (LDS / 1000 / 1000); which can be simplified to m = (20 - conserve_memory) * 1000 / sqrt (LDS); m = (20 - 5) * 1000 / sqrt (LDS); m = 15000 / sqrt (LDS); So the constant is 15000, or we could just say it’s 15 if we use MB for size. here’re some example with different LDS - This constant, max_m and min_m can all be adjusted by configs. Please see the config page for detailed explanation. Now it’s quite obvious why DATAS came up with the gen0 budget and how we can adjust it. If we want to bring this up to the same budget without DATAS, we’d want to use the GCDGen0GrowthPercent config to increase the constant to 2.6x, and increase min_m with the GCDGen0GrowthMinFactor config so it’s not clamped to 0.1 — you don’t need to be very accurate since you just need to make it not be the limiting factor. So in this case if we use 15GB to approximate the LDS (the “Promoted (mb)” column for both gen2 GCs says ~15GB), and without DATAS the gen0 budget is 4.22GB. So min_m should be around (4.22/15 = 0.28). We can just set min_m to 300 which translates to 0.3 of LDS. Customer case 2 This is an asp.net app on a staging server from the customer that represents one of their key scenarios. I used a load test tool to generate variable workloads. The team was already using some GC configs - · GCHeapCount is set to 2 to use 2 heaps · Affinity is turned off with the GCNoAffinitize config. If the GCHeapCount config is specified, DATAS would be disabled because it’s telling the GC to not change the heap count. And since changing the heap count is one of the key mechanisms to adjust perf for DATAS, it’s an indication to disable DATAS. Because this is a process that co-exist with many others on the same machine, before DATAS was available they chose to give it 2 heaps to limit the memory usage while still getting reasonable throughput. But this is not flexible — when the load becomes higher the throughput can suffer with 2 heaps and also the GC pauses can be noticeably higher since there’s only 2 GC threads collecting. They can adjust the number of GC heaps but that means more work and since Server GC isn’t very aggressive at reducing memory usage they can end up with a much bigger heap than desired when the load is lighter. I’ll demonstrate how using DATAS makes this robust. When I made the load pretty high I could see that % pause time in GC is quite high — not surprising with just 2 heaps. So I enabled DATAS by simply getting rid of the GCHeapCount config (I kept the GCNoAffinitize config as I still wanted the GC threads to not be affinitized). I could see the % pause time in GC was also high because even with BCD we still ended up triggering GCs quite often. So I decided to make BCD 2x the default value with the GCDGen0GrowthPercent config (I didn’t need to use the GCDGen0GrowthMinFactor config since 2x is still well within our max_m/min_m clamping values). And now the process behaves in a much more desirable way with the following characteristics – the % pause time is dramatically lower. With the default DATAS the % pause time is basically comparable and the heap size is noticeably lower. Depending on your optimization goal this could be exactly what you want. DATAS is able to achieve this with smaller budgets and more GC threads doing the collection work. But I know for this customer, they don’t want % pause time in GC to be this high as it affects their throughput. I could also make DATAS use a smaller target TCP but in this case the default TCP seems quite sufficient. Press enter or click to view image in full size Press enter or click to view image in full size individual GC pauses are a lot lower since we have a lot more GC threads collecting. Press enter or click to view image in full size when the load becomes lighter (# of concurrent client threads went from 200 to 100), the heap also becomes smaller. And we are still maintaining a much lower % pause time in GC and individual GC pauses. Press enter or click to view image in full size Press enter or click to view image in full size I hope this helps with your DATAS tuning, if you need to do any. DATAS Events I expect most users never needing to look at these events, so I’ll keep it brief. The approximations that I mentioned above should suffice. For the small number of folks who want to do a detailed analysis for whatever reason, DATAS fires an event that accurately represents the metrics we discussed. Note that we only use these events programmatically, so they are not surfaced in PerfView’s Events view (all you’ll see is the GC/DynamicTraceEvent which shows you the name but not individual fields of that event). See this blog article for an example how to programmatically retrieve GC info as a list of TraceGC objects from a trace. LDS and TCP are indicated in the SizeAdaptationTuning event, assuming you have a gc object of the type TraceGC — // LDS gc.DynamicEvents().SizeAdaptationTuning?.TotalSOHStableSize // TCP gc.DynamicEvents().SizeAdaptationTuning?.TcpToConsider This event is not fired every GC since we only check to see if we need to change the tuning for DATAS every few GCs.

Preparing for the .NET 10 GC

Share this article

Related Articles