Calculating 30000 Pi digits in 10 seconds using multi-threaded programming
Photo by Lyuboslav Karev
One way is to use the Ramanujan–Sato series, which can give a close approximation of the number pi:
Ramanujan-Sato series
A simple program
Prerequisites
Although I’ve tried to keep this article simple to understand, this article expects you to have some basic knowledge about what threads are, what multi-threading is and basic knowledge on the Rust language syntax.
A simple program
simple.rs (click here to view the code in Github)
Our program has calculated the first 16 digits of Pi. The real value of Pi however is: 3.1415926535897932
, from which our answer differs. Let's try to sum the first 10 elements of the series, to see if we get better accuracy. We can do it, by changing the max_elements
variable to be equal to 10.
… and it seems that our program has crashed. The reason seems to be: attempt to multiply with overflow
. It seems like when we try to multiply large numbers, the result doesn't fit in the 64-bit float variable that we have (this is called an overflow). How can we get past this ?
A simple program, with big numbers
When needing to support bigger numbers (or smaller numbers, with a lot more precision), we can use an external library, that provides integers and floating-point numbers with bigger precision, compared to the default ones. We will be using the rug library. From it, we will use the Float
type, which offers a lot more precision and space. One thing to note, is that when using Float
, we need to set the precision
(the amount of digits after the decimal point) we require. For now, we will set it to 10 000.
The code looks a lot uglier now, but at least it can calculate around 3 000 digits of Pi in a instant:
Let’s try to increase the amount of elements in the sum to a 1000. I’ll show only the first 5 digits of the calculated result, to save space in the output. Also I’ll be adding a timer, to measure how much time it takes to execute the program, so that we can compare the performance of the different versions of the code.
simple_big_nums.rs (Github link here)
By setting the precision to 100 000 and the amount of elements to 5000, we can calculate around 30 000 digits of Pi.
However, the time needed to calculate them, is around 3 minutes.
Recurrent calculations
Before we try and run this task on multiple threads, we could analyze what is taking too long to compute — in our case, the factorials are really slow to compute, and they take a lot of time. On top of that, we don’t always need to compute a factorial that we already know the value of. We should aim to reduce the amount of factorials we calculate. One thing I noticed, is that we could transform this series into a recurrent one.
Let’s try and calculate the value of two consecutive elements of the sum, to see if we get any meaningful relation between them. Let those elements be n=k
and n=k+1
We can divide them, to see if we can get anything meaningful.
Finally, we get:
The whole formula now looks like this:
The program using this recurrent calculations will now look like this:
recurrent_single_thread.rs (View on Github)
With this way, we again calculate 30 000 digits of Pi, from 5000 elements of the series, but this time, the calculations took 8.73s
Let’s increase the amount of elements we are calculating to 100 000, to see how long it will take.
Multi-threading introduction
Design approach
Before starting to write any code, we have to think about how we will be splitting the work, so that the amount of time needed to complete the work is even. We don’t want to be in a situation where one of the threads is doing heavy computational work, while the other threads are just sitting in idle.
One way to have a multi-threaded application, is to use the Master Worker pattern (you can learn more about it here). We have a main thread, which will hand out task, and workers, which as the name suggests, will do the actual work. This way, we can split the calculations of the elements of the series into equal parts — each thread gets some part of the sum to calculate. Once it’s done, the worker thread will report back to the main thread. The main thread will wait out for all threads to finish their work, and sum up the different parts of the series.
In theory this seems like a good approach — each thread does some work, those threads work in parallel, and we should be getting really fast results. However, there are two main problems that need to be resolved before starting to code the final solution — the balancing of the work & the way recurrent calculations work.
Splitting the work — how ?
In the single-threaded version of our code, the first thing we do, is to calculate a0 first, and from there, we could just use the formula we have: ak+1 = ak * Ck. For the multi-threaded version, let’s say that we have to calculate 100 elements of the series, and we have two threads — t1
, which will calculate from the 0th to 50th element, and t2
which will calculate from the 51st element to the 100th. The problem now becomes that t2
has to have the 50th element calculate, in order to start the calculations. However, t1
and t2
are going to run in parallel and the 50th element will not have been calculated until t1
finishes it's work. t2
can't wait for t1
to finish up, before it can start working (this will be the same situation we have currently - only one thread working at a time).
A solution to this, is to calculate the starting element without the recurrent formula, but with the initial one that we have — the same way we calculate a0 in the single-threaded version. This will create another problem, however…
A balancing act
Continuing with the scenario above, t1
has to initially compute a0 with the original formula, while t2
has to compute a50. A quick reminder of how we calculate those initial elements:
As you can see, computing a0 requires t1
to calculate the factorial of 4 times 0, which is 0, while t2
has to deal with 200, which is really slow to compute. From this, we can see that t1
has a lot less work than t2
, and our performance will be limited to the slowest thread currently (in this case, t2
). One thing to consider is, when dividing the tasks (or the work that will be done by each of the threads), is to not split the elements of the series into t
parts (where t
will be the amount of threads that we will be creating), but to divide it into smaller tasks. Then each thread get assigned a task and when the thread has completed it's work, it can take the next available task from the main thread.
In this scenario, we can divide the 100 elements of the series, into batches of 10 elements. t1
gets 0-9, while t2
gets 10-19. While t2
is sweating trying to calculate the big factorials, t1
can just go and take the next task, which in our case will be 20-29. This will continue until there are no more task to complete.
Final solution
In the final solution, the logic is split into two files — main.rs
and calc.rs
Let’s quickly go over the other notable changes in the code.
First thing we notice, is the addition of a command-line interface, for setting the amount of threads to start, and the amount of elements in the sum to be calculated. I’ve also added some logging, to signal when a thread has started and stopped working.
The communication between the threads is done using channels provided by Rust (you can find more about channels here and about inter-process communication here) . As you can notice, in calc.rs
, in the Task struct, we keep the sender part of the channel, so that we can send data to the main thread. Apart from that, we keep the start and end indexes, which are used to designate which part of the sum should be calculated. In the work
function, we calculate the initial element, and then we use calculate_a_n_from_previous()
to calculate all the remaining elements. Once we are done with that, we send over the result to the main thread.
In the main
function of the program, we set up our result variable, as well as a double-ended queue (also known as a "deque") to keep the start and end indexes of the different parts of the sum. In this queue, we keep a tuple, containing the start and end indexes of the task. When a new thread needs to be started, we can quickly get the first element of the queue. The amount of elements in those smaller sums is equal to (amount_of_elements / thread_count) / amount_divisor
- by default, the amount_divisor
is set to 2.
After the preparation is done, we start the first threads — each thread gets a task from the “deque” and stars calculating it’s part of the sum. When a thread is ready with its calculations, its result is taken from the channel, and added to the total result of the sum. The thread is stopped, and a new one is spawned after it, with a new task. If there are no new tasks, we wait for all threads to finish work and collect their results.
Once all elements of the sum have been calculated, we print out the result to the standard output.
Testing the code
When benchmarking a multi-threaded application, we take two values — the execution time of the program when a single thread is used (let’s call it T1) and when k
amount of threads are used (let's call it Tk). The speed-up Sk of the program is equal to T1/Tk. We could also measure the efficiency, which is calculated by dividing Sk to k. The closer this coefficient is to one, the better our multi-threader program is at utilizing the resources it has.
For testing, I used an AWS EC2 m5.8xlarge instance, with a 16-core processor, with hyper-threading disabled. When testing, I’ve build the release
version of the program, rather than the debug one. The debug builds usually contain some debugging tools built-in them, which could slow down performance. Building the release
target is done using cargo build --release
. Testing was done via the time_run.sh
script, which runs the program n
amount of times, and prints out the time it took to complete the program for each of the runs:
Results of the 5 runs with different amount of threads
Ideally, the speed-up should we equal to k
. As you can see, we are really close to the ideal speed-up when running up to 8 threads. When running on all 16 cores, we don't have the same speed-up, but still it's a really good result to have. The efficiency also shows that our program is good at using up the available resources.
Speed-up chart
Conclusion
We’ve created a program that calculates the sum of series using multi-threading programming. The program was written in the Rust programming language, which is famed for having good multi-threading support. With a bit of optimizations, we managed to calculate 30 000 digits of Pi in around 3 seconds !
The full code can be found here:
[GitHub - lyubolp/multithreaded-pi-calculations-rust: Program that calculates the number Pi with big…
Program that calculates the number Pi with big precision - GitHub - lyubolp/multithreaded-pi-calculations-rust: Program…github.com](https://github.com/lyubolp/multithreaded-pi-calculations-rust "github.com/lyubolp/multithreaded-pi-calcula..")
I’d love you hear you feedback in the comments bellow, as this is my first article that I have ever written.