Optimizing FizzBuzz in Rust for Fun and Profit
My little cousin is learning Python, and wanted to know about what kinds of questions he could expect to get in coding interviews. I started him off with good ol' FizzBuzz: obsolete for any actual coding interview today (at least, I hope), but good at his level. After a little back and forth getting reminded about how to create range iterators and fixing the order, he landed on this implementation:
def basic_fizzbuzz(): for i in range(1, 101): if i%15 == 0: print("FizzBuzz") elif i%3 == 0: print("Fizz") elif i%5 == 0: print("Buzz") else: print(i)
A good first shot: it'd get him to the next question without much fuss.
But then, by coincidence, I watched an old Prime video and decided to put the question to him: how would you extend this to 7 = "Baz"?
He expanded the if-else chain: I asked him to find a way to do it without explosively increasing the number of necessary checks with each new term added. After some hints and more discussion, we arrived at this implementation:
def enhanced_fizzbuzzbaz(): for i in range(1, 101): s = "" if i%3 == 0: s+="Fizz" if i%5 == 0: s+="Buzz" if i%7 == 0: s+="Baz" if not s: print(i) else: print(s)
Nice and straightforward. But this did get me thinking: how would we make this as performant and extensible as possible?
Some quick-and-dirty benchmarking shows that each fizzbuzz (give or take some loop overhead) takes around 105 microseconds. Slow, yes, but not tripping my alarm bells just yet. Python is slow, say less.
Adding 'Baz' takes us up to 110 microseconds. Switching to our composable version shaves off a little bit and gets us to around 107 microseconds. Easier on the developer AND marginally more performant, just what I want to see.
I'm sure there are further optimizations to be made in Python... but Python's not the tool I reach for when I want to explore optimization potential.
So I took this to the natural conclusion of all software, and rewrote it in Rust.
pub fn enhanced_fizzrustbaz() { for i in 1..=100 { let mut s = String::new(); if i%3 == 0 { // this is sugar s = s.push_str("Fizz") // cf impl Add<&str> for String s+="Fizz" } if i%5 == 0 { s+="Buzz" } if i%7 == 0 { s+="Baz" } if s.is_empty() { println!("{i}"); } else { println!("{s}"); } } }
Thanks to some syntactic sugar, the implementation is virtually identical. How does this stack up to Python?
enhanced_fizzrustbaz time: [57.892 µs 58.948 µs 60.083 µs] Found 3 outliers among 100 measurements (3.00%) 2 (2.00%) high mild 1 (1.00%) high severe
Around 60 microseconds, not too shabby. Now, let's take a peek at what it's actually doing, and look at how we might speed it up:
First, it's iterating through an integer range: we could replace this with a .into_iter() for a small performance gain, since its size is known at compile time.
Then it performs a modulo and numerical comparison, evaluates, and potentially executes the string addition. There might be a clever way to eliminate some of these checks: we know, after all, that if n is a multiple of 3, n+1 and n+2 will not. But the obvious possibilities almost certainly won't be performant: integer modulo is a single CPU instruction, while a counter- or step-based method would require frequently referencing a different object for each branch, and we might end up taking a nice, regular pattern away from the branch predictor. We could try out manual loop unrolling, and that could be useful if, say, we were doing N=100000 on the original 3 and 5 cases: but at N=100 on 3, 5, and 7 we're not doing even a single completion.
What about the actual string addition, then? We start each loop body by heap-initializing an empty String, and then either don't use it or extend it with the composed words. We know what these words are and their length at compile time... could we make this heapless?
pub fn heapless_fizzrustbaz() { for i in 1..=100 { // max length of the string, assuming we extended beyond 100, would be 11 uf-8 characters let mut buf: [char; 11] = [' '; 11]; let mut start: usize = 0; if i%3 == 0 { let end = start + 4; let target_slice = &mut buf[start..end]; for (dest_char, src_char) in target_slice.iter_mut().zip("Fizz".chars()) { *dest_char = src_char; } start = end; // update the starting index } if i%5 == 0 { let end = start + 4; let target_slice = &mut buf[start..end]; for (dest_char, src_char) in target_slice.iter_mut().zip("Buzz".chars()) { *dest_char = src_char; } start = end; } if i%7 == 0 { let end = start + 3; let target_slice = &mut buf[start..end]; for (dest_char, src_char) in target_slice.iter_mut().zip("Baz".chars()) { *dest_char = src_char; } start = end; } if start == 0 { println!("{i}"); } else { let res = buf[0..start].iter().collect::(); println!("{res}"); } } }
This isn't quite heapless... we do end up collecting into a String from a buffer at the very end, but it's immutable, and the rest of our operations should be happening entirely on the stack. Does this deliver any performance improvements?
heapless_fizzrustbaz time: [58.466 µs 59.442 µs 60.435 µs] Found 3 outliers among 100 measurements (3.00%) 2 (2.00%) high mild 1 (1.00%) high severe
Nope! A small performance regression, if anything, though the difference is small. If I had to guess why, all our nonsense slicing, zipping, and collecting iterators is introducing more overhead than just allocating small Strings. We'd have been better off switching String::new() to String::with_capacity(11) .
But what's actually taking so long here? What actually needs to happen? 100 small heap allocations, 300 modulo operations and numerical comparisons, several dozen String additions, and then printing either a number or the string. The numerical operations should take much less than one microsecond all together, as should the looping overhead. Heap allocations and string additions are expensive by comparison, but not that expensive, especially if they're small and occur regularly.
So it can only be the printing process.
The only thing slower than going to the heap is going to the disk... or to the screen. What if we just remove the print line?
enhanced_fizzrustbaz_no_print time: [888.64 ns 890.28 ns 891.88 ns] Found 4 outliers among 100 measurements (4.00%) 3 (3.00%) high mild 1 (1.00%) high severe
Watch the units: nanoseconds, not microseconds. 98.5% of all runtime is spent printing. Without that step, the entire thing finishes in under a microsecond, string addition included.
In other words, fizzbuzz is overwhelmingly I/O bottlenecked.
But we're not taking our ball and going home, oh no sir. We have ways to deal with this.
Right now, we're printing one line at a time: what if we just saved all our outputs to a single buffer, and then write it all at once?
pub fn fizzrustbaz_buffered() { // set up a string buffer up front let mut output_buffer = String::new(); for i in 1..=100 { // set up a boolean for the local changes, since // we're no longer writing to a local variable let mut written = false; if i % 3 == 0 { output_buffer.push_str("Fizz"); written = true; } if i % 5 == 0 { output_buffer.push_str("Buzz"); written = true; } if i % 7 == 0 { output_buffer.push_str("Baz"); written = true; } if !written { write!(&mut output_buffer, "{i}").unwrap(); } output_buffer.push('
'); } // get a lock on the stdout and print all at once from bytes let mut handle = stdout().lock(); handle.write_all(output_buffer.as_bytes()).unwrap(); }
fizzrustbaz_buffered time: [45.809 µs 46.382 µs 46.955 µs] Found 12 outliers among 100 measurements (12.00%) 7 (7.00%) low mild 3 (3.00%) high mild 2 (2.00%) high severe
That shaved off around 14 microseconds, nearly a quarter of the runtime.
At this point, we're heavily, heavily bound not just to our I/O strategy, but from the program on the other side of the pipe: the terminal/console/emulator itself, which is outside the control of our specific Rust program. To show just how bottlenecked we are by it, here is the most significant optimization I could find: eliminating newlines.
fizzrustbaz_buffered_no_newline time: [9.0496 µs 9.2078 µs 9.3825 µs] Found 15 outliers among 100 measurements (15.00%) 6 (6.00%) high mild 9 (9.00%) high severe
The simple act of removing newlines, printing the entire buffer as one continuous, illegible block instead of allocating a new line to each number reduces our runtime by roughly 80%.
Homework: Implement an async variant that prints to stdout without blocking the computation loop. Is this faster? Why?
At this point, I'm out of ideas. The most impactful move would probably be to switch to a faster terminal... but I'm already running Ghostty! I thought it was a pretty performant terminal to being with!
If each character is utf-8 encoded, and assume each 'line' averages to roughly 5 characters, then at 9 microseconds per call, we're putting text to screen at a rate on the order of 0.5Gb/s which seems... low? At the very least, the specs for my computer indicate a 300Gb/s memory throughput, unless that spec is actually measuring something else entirely.
I'm sure there's further optimizations to be made here, but I'm out of my depth and this has grown well beyond my original plans for a humorous, low-effort post. So let's leave I/O and throughput to the side, and ask the question you Rustaceans have been waiting on since we started: can we parallelize this?
Well yes, obviously.
serial_contrived_no_print_fizz_100 time: [2.6293 µs 2.6392 µs 2.6546 µs] Found 3 outliers among 100 measurements (3.00%) 1 (1.00%) low severe 1 (1.00%) low mild 1 (1.00%) high severe parallel_contrived_no_print_fizz_100 time: [44.210 µs 44.497 µs 44.757 µs] Found 5 outliers among 100 measurements (5.00%) 2 (2.00%) low severe 3 (3.00%) low mild serial_contrived_no_print_fizz_100000 time: [2.9230 ms 2.9283 ms 2.9336 ms] Found 3 outliers among 100 measurements (3.00%) 3 (3.00%) high mild parallel_contrived_no_print_fizz_100000 time: [1.4538 ms 1.4577 ms 1.4616 ms] Found 10 outliers among 100 measurements (10.00%) 6 (6.00%) low mild 4 (4.00%) high mild
That said, it doesn't do us any favors until we get well above N=100 (startup seems to be around 40µs, which sounds about right).
Note that the single-threaded version on 100 is slower than the 900ns we time earlier: mainly because this one is actually saving the contents of each loop and collecting them into a vector.
This is a bit of a low point to end, so let's do one last thing to lift our spirits. Rather than emphasizing runtime performance, let's flex with some code golf--and what better way to play code golf than with an overcomplicated, Rube-Goldberg driver.
Let's build a procedural macro for fizzbuzz.
First, we'll define a couple of structs to make our lives easier:
/// Represents a single FizzBuzz mapping, e.g. `(3, "Fizz")` struct FizzBuzzRule { divisor: Expr, word: Expr, } /// Represents the full macro input, e.g. `100, [(3, "Fizz"), (5, "Buzz")]` struct FizzBuzzInput { max: LitInt, rules: Vec, }
Then a quick little parser:
/// Attempts to parse the ParseStream input as a FizzBuzzInput impl Parse for FizzBuzzInput { fn parse(input: ParseStream) -> Result { // taking the first token in the parsestream and parsing it as an integer literal let max: LitInt = input.parse()?; // TODO make sure that max is not 0 // parsing the comma input.parse::()?; // parsing the array of tuples containing the rules // there better not be any more tokens beyond this that we're missing! let rules_array; syn::bracketed!(rules_array in input); let rules_punctuated: Punctuated<(Expr, Token![,], Expr), Token![,]> = rules_array.parse_terminated(|stream| { let tuple_content; syn::parenthesized!(tuple_content in stream); let divisor: Expr = tuple_content.parse()?; tuple_content.parse::()?; let word: Expr = tuple_content.parse()?; Ok((divisor, Token), word)) }, Token![,])?; let rules = rules_punctuated .into_iter() .map(|(divisor, _, word)| FizzBuzzRule { divisor, word }) .collect(); Ok(FizzBuzzInput { max, rules }) } }
And finally, put it all together:
/// Produces a buffered FizzBuzz implementation at compile time from the input maximum and mappings #[proc_macro] pub fn fizzbuzz(input: TokenStream) -> TokenStream { // call the Parse implementation we wrote above let FizzBuzzInput { max, rules } = parse_macro_input!(input as FizzBuzzInput); // create the if checks programmatically let checks = rules.iter().map(|rule| { let divisor = &rule.divisor; let word = &rule.word; // the text inside the quote! macro is replicated verbatim (except for the #divisor and // #word variables) in the code generated at compile time quote! { if i % #divisor == 0 { write!(&mut output_buffer, "{}", #word).unwrap(); written = true; } } }); // assemble the final block of code we'll subsitute for the macro let expanded_code = quote! { { use std::io::{stdout, Write as IoWrite}; use std::fmt::Write as FmtWrite; // Avoid name collision // coerce the maximum to at least 1 let input_max = #max as usize; let max = input_max.max(1); // pre-allocate a buffer, 20 bytes per row (should be enough for small rules) let mut output_buffer = String::with_capacity(max * 20); for i in 1..=max { let mut written = false; // aka 'the checks tokenstream is going to be parsed as an iterator of things which // you should then also expand' // if an if statement had a type, this would be of type [IfStatement] #(#checks)* // because we're just writing everything to the buffer, we don't need to care about // distinguishing between the numbers and words if !written { write!(&mut output_buffer, "{}", i).unwrap(); } output_buffer.push('
'); } let mut handle = stdout().lock(); handle.write_all(output_buffer.as_bytes()).unwrap(); } }; // convert back to TokenStream to be generated and compiled at compile time TokenStream::from(expanded_code) }
And just like that, we have a nice, one-line, extensible, and fairly performant implementation of fizzbuzz built programmatically at compile time.
fizzbuzz!(105, [(3, "Elementary, "), (5, "my dear "), (7, "Watson")]);
Thank you for reading and following along with! If you found this amusing and slightly educational, please give the repo a star and share it in your communities.
Join the conversation on Lobste.rs.
And if you're the kind of nerd who read all the way to the end... I'll see you at RustConf2025.