Applying the scaling laws to build VaultGemma The Gemma models are designed with responsibility and safety at their core. This makes them a natural foundation for developing a production-quality, DP-trained model like VaultGemma. Algorithmic advancements: Training at scale The scaling laws we derived above represent an important first step towards training a useful Gemma model with DP. We used the scaling laws to determine both how much compute we needed to train a compute-optimal 1B parameter Gemma 2-based model with DP, and how to allocate that compute among batch size, iterations, and sequence length to achieve the best utility. One prominent gap between the research underlying the scaling laws and the actual training of VaultGemma was our handling of Poisson sampling, which is a central component of DP-SGD. We initially used a straightforward method of loading data in uniform batches but then switched to Poisson sampling to get the best privacy guarantees with the least amount of noise. This method posed two main challenges: it created batches of different sizes, and it required a specific, randomized order for processing the data. We solved this by using our recent work on Scalable DP-SGD, which allows us to process data in fixed-size batches — either by adding extra padding or trimming them — while still maintaining strong privacy protections.