Optimi-Zi(n)g Sudoku-Solving
26 July 2025 , in Olivier's log
One of the first program that I wrote in Zig (in September 2023) was a Sudoku-Solver, implementing the dancing-links (DLX) algorithm. I decided to revisit this program recently to experiment with benchmarking and try to increase its speed.
Dancing-Links (DLX) algorithm applied to sudoku
The Dancing-Links algorithm is an efficient backtracking algorithm to solve "exact-cover" problems, by using a matrix of 0 and 1s.
Dancing Links on Wikipedia
In an exact-cover problem, there is a bunch of constraints (modelled as columns of a matrix) and many partial solutions (modelled as lines of a matrix). The goal is to pick some solutions, so that all constraints are met exactly once. In the matrix-modelisation a 1 in a cell means that this partial solution meets the given constraint. Solving the exact-cover problem transposes to selecting rows, so that each column as exactly one 1 in it.
To convert a sudoku to an exact-cover problem, we generate 4*9*9 columns to represent the following constraints:
a given number must appear exactly once in each sudoku-row (9*9 constraints)
a given number must appear exactly once in each sudoku-column (9*9 constraints)
a given number must appear exactly once in each 3*3-block (9*9 constraints)
there must be a number in each sudoku-cell (9*9 cells)
Assigning a number to a sudoku-cell satisfies exactly 4 constraints. For instance putting a 1 in the top-left sudoku-cell satisfies:
the need for a 1 in the topmost row
the need for a 1 in the leftmost column
the need for a 1 in the top-left 3*3-block
the need for a number in the top-left cell
Since there are 81 cells we get 9*81 lines with exactly 4 1s per line.
Out of the 236.196 cells only 2.916 are 1s (sparse-matrix). Instead of having the whole matrix in memory, only the "1-cell" are instanciated, with a pointer to its "neighbors" (top, right, bottom, left).
Doing some napkin-math, the whole-matrix as bit-set array should take about 29KB of space. Using 4*64-bits pointers in the sparse version takes more than 3 times the space (~92KB). So for the standard 9*9 sudoku, the sparse matrix is not space-efficient. Maybe the space-usage could be improved by using indexes instead of 64-bits pointers?
At the beginning some cells already have assigned values, so we know that those lines must be part of the solution. Selecting a line for a solution means that its 4 constraints are met and that all other lines which have 1s in common columns with the selected line must be ignored. In the example above of the 1 in the top-left cell, we must now ignore:
all other 8 lines which set a 1 in the topmost row somewhere else
all other 8 lines which set a 1 in the leftmost column somewhere else
all other 8 lines which set a 1 in the top-left 3*3-block somewhere else (note that some of those lines were also matched by the previous descriptions)
all other 8 lines which set a number in the top-left cell
Solving the whole exact-cover problem corresponds to selecting lines until all constraints are met.
If a constraint cannot be met because all lines are already ignored, it means that there is no solution (so we likely made a wrong choice some steps earlier).
We now have the steps to solve any puzzle!
1. Setup the matrix, with all constraints (columns) and partial solutions (lines) 2. Select the pre-filled lines (its constraints are met and the other lines are ignored) 3. If all constraints are satisfied, SUCCESS! (return true) 4. Of the unsatisfied constraints, take the one with the fewest 1s. 5. If there is a constraint without any 1, FAILURE (return false to backtrack) 6. Otherwise consider this constraint met and loop over the lines with a 1: a) Consider the line to be part of the solution b) Ignore the other lines which have a 1 in common in any column with the current line c) Call 3. recursively. On failure, undo b) and try with the next line in a) d) If no line was suitable, FAILURE (return false to backtrack)
And this is how the whole implementation is done. The sudoku solver instanciates a DLX solver with the right number of columns, adds all partial solutions as lines, select the solutions corresponding to the pre-filled cells and calls the `solve` method. The list of partial solutions can then be translated back to sudoku cells before returning.
Optimizing allocations
Unfortunately I realized far too late that proper benchmarking needs a stable representative input corpus and precise tools (like hyperfine or poop).
Not being used to manually manage memory allocations (Go, Python, JavaScript, PHP... plenty of options!), I initially called `allocator.create()` for each needed cell (using an arena to deallocate them all at once at the end). It turns out that allocating 3000 times is reasonbally fast and the program was able to solve puzzles in 1ms on average.
A simple optimization consisted in allocating all the 4 cells of a line at once (so about 750 allocations per puzzle). This brought a nice speedup: 0.54ms on average.
But we can go further and allocate all the cells at once (so less than 100 allocations per puzzle), allowing to reach 0.46ms on average.
The remaning allocation were due to the ArrayList holding the solution lines. Since we know how many cells are empty at the start, we can pre-allocate its size (with `initCapacity`) and arrive at 3 allocations per puzzle, 0.30ms on average:
one allocation for the column headers
one allocation for the line cells
one allocation for the solution
Using the right tool
At that point I was stuck and tried to remove the recursion. The code was harder to understand and not faster. I tried other various blind adjustments, but they did not yield anything.
I then gave Hotspot a try and immediately saw that a modulo operation was suspiciously slow (`...%line_len' was slower than the instructions around, which was strange, because `line_len` is 4, so this should be very-fast bit-shifting). I added a `comptime` to the `line_len` argument and this line perfomance impact dropped. The overall performance did not improve significantly though.
Hotspot - The Linux perf GUI for performance analysis
At some point, I set the zig flag `-Doptimize=ReleaseFast` which gave an immediate boost: 0.13ms per puzzle.
In Hotspot the few allocations still accounted for a significant part of the flamegraph. So the next step was to remove those allocations as well.
Inspired by the standard library, I turned the DLX struct to an "unmanaged" usage: the caller must take care to provide memory for the Cells. This works fine for the sudoku solver, since the number of cells is puzzle-independant. Sudoku solving doesn't need the heap anymore and uses exclusively the stack.
The flamegraph profile now clearly shows two parts: the matrix preparation and the actual solving. Uninspired to further tweak the solving part (removing the recursion was not helpful) I attempted to optimize the preparation part.
The sudoku preparartion consists of two step: setting up the cells and consider the pre-filled numbers. The naive way is to set up all the cells and the eliminate the pre-filled lines. However since there are many such cells, setting up and then eliminating many lines is quite costly. An optimizition consists in skipping the setup for the partial solutions which are not needed.
After trying a couple of convoluted ways, I found a more straightforward techhique: mark some columns as already covered and skip the line addition, when such line contains an already covered column (eventually going back to the beginning of the line).
After an upgrade to zig 0.14.1, the program takes 0.061ms on average per puzzle (yes, my benchmarks are not really comparable, sorry!).
Conclusion
Zig code of the DLX-based sudoku solver
This was a fun project, where I learned quite a lot about zig (my main goal), how to measure and improve the performance of my software.
The current throughput can be improved by using multithreading. The puzzle-time stays the same, but solving 250k puzzles goes from 15s to 1.8s (on my machine, with 20 threads and 128 puzzles per chunk). So about 138k puzzles per second (or 500 million per hour), which seems pretty good compared to other approaches:
Python 3 Jupyter notebook, by Peter Norvig
In swift, by Mia Koring
I enjoyed revisting this DLX sudoku solver. I think that the logic could be adapted to solve the N-Queens problem (and the LinkedIn variant when removing the comptime type). Maybe for another post (hopefully with proper benchmarking).