Calculating the fibonacci numbers on GPU
21 Jun, 2025
In this blogpost we will show how to perform very fast calculation of the Fibonacci sequence using GPU programming. In this blogpost we will employ Thrust an NVIDIA library which uses concepts from modern C++ to make GPU programming easy.
Introduction
Scan is one of the fundamental examples for parallelizable algorithms. If you are interested in the foundations of the algorithm I refer you to a previous blogpost where I implemented scan in CUDA.
The scan is the operation which transforms
x = [ x 1 , . . . , x N ]
to
y = [ y 1 , . . . , y N ]
We call the inclusive iff
y i = x 0 ⊕ . . . ⊕ x i
and exclusive iff
y i = x 0 ⊕ . . . ⊕ x i − 1
where ⊕ is an associative binary operator.
Simple exlusive scan in thrust
This is how to perform an exlusive scan using thrust.
thrust :: device_vector < int > in { 1 , 2 , 3 }; thrust :: device_vector < int > out ( 3 );
By default we apply the plus operator. Here we use 0 as the initial element x 0 .
thrust :: exclusive_scan ( thrust :: device , in . begin (), in . end (), out . begin (), 0 );
This will output
out: 0 ; 1 ; 3 ;
We can similary apply the multiply operator:
thrust :: exclusive_scan ( thrust :: device , in . begin (), in . end (), out . begin (), 1 , [ = ] __host__ __device__ ( int cur , int prev ) { return cur * prev ; } );
where we use a simple lambda function to define the binary operation. This will output
out: 1;1;2;
Matrix scan operation
Thrust is very flexible and the matrix multiplication is associative. This allows us to perform a scan using matrices!
// 2 x 2 Matrix // a00, a01, a10, a11 using Matrix = thrust :: tuple < int , int , int , int > ; ... Matrix scale_two { thrust :: make_tuple ( 2 , 0 , 0 , 2 )}; Matrix identity_matrix { thrust :: make_tuple ( 1 , 0 , 0 , 1 )}; ... thrust :: exclusive_scan ( thrust :: device , in . begin (), in . end (), out . begin (), identity_matrix , matmul_lambda ); ...
This will output
out: [1, 0, 0, 1] [2, 0, 0, 2] [4, 0, 0, 4] [8, 0, 0, 8] [16, 0, 0, 16]
At each step our scan operation is simply a multiplication by the diagonal matrix with 2 on the diagonal. The result at step n simply contains the corresponding power of 2 . It is verify on paper that this is what we expect.
The matmul can be easily defined as a lambda as follows:
// Mult the current matrix from left to the previous one as scan op. auto matmul_lambda = [ = ] __host__ __device__ ( const Matrix & x , const Matrix & y ) { auto a00 = thrust :: get < 0 > ( x ); auto a01 = thrust :: get < 1 > ( x ); auto a10 = thrust :: get < 2 > ( x ); auto a11 = thrust :: get < 3 > ( x ); auto b00 = thrust :: get < 0 > ( y ); auto b01 = thrust :: get < 1 > ( y ); auto b10 = thrust :: get < 2 > ( y ); auto b11 = thrust :: get < 3 > ( y ); auto c00 = a00 * b00 + a01 * b10 ; auto c01 = a00 * b01 + a01 * b11 ; auto c10 = a10 * b00 + a11 * b10 ; auto c11 = a10 * b01 + a11 * b11 ; return thrust :: make_tuple ( c00 , c01 , c10 , c11 ); };
Fibonacci in terms of matrices
We are now ready to calculate the Fibonacci numbers on GPU. Note the following identity which you can convince yourself of by calculating it on a piece of paper.
That means we can simply calculate F n by performing the scan with the matrix Q and obtain the fibonacci number by taking one of the antidiagonal entries.
const int N = 32 ; ... Matrix fibonacci { thrust :: make_tuple ( 1 , 1 , 1 , 0 )}; Matrix identity_matrix { thrust :: make_tuple ( 1 , 0 , 0 , 1 )}; thrust :: device_vector < Matrix > in ( N , fibonacci ); thrust :: device_vector < Matrix > out ( N ); ... thrust :: exclusive_scan ( thrust :: device , in . begin (), in . end (), out . begin (), identity_matrix , matmul_lambda ); ...
This will output
out: 0;1;1;2;3;5;8;13;21;34;55;89;144;233;377;610;987;1597;2584;4181;6765;10946;17711;28657;46368;75025;121393;196418;317811;514229;832040;1346269;
which are indeed the correct results which we can verify for example using Wolfram Alpha:
Going large
To calculate extremly large fibonacci numbers we need to avoid integer overflow. A simple way to avoid that is to just take everything modulo a given number in the matrix multiplication. We can than verify that we obtain the correct result modulo this number using Wolfram Alpha. Using the power of GPUs we are able to calculate F 99999999 in only 17 milliseconds on my Consumer GPU NVIDIA GeForce RTX 3060 Mobile . You can verify the correctness yourself by checking out my code on Github. I obtain as a result
Last element of out (F(99999999) mod 9837): 7558
which can again be verified using Wolfram Alpha.
Conclusion
I hoped this blogpost illustrated what a powerful operation Scan is and how easy it is to implement quiet complicated operations using Thrust. The idea to calculate the Fibonacci numbers using Scan is from an exercise given by Guy Blelloch. I suggest to check this paper out for a deeper dive into the Scan operation.