Calculating 30000 Pi digits in 10 seconds using multi-threaded programming

Calculating 30000 Pi digits in 10 seconds using multi-threaded programming

Photo by Lyuboslav Karev

One way is to use the Ramanujan–Sato series, which can give a close approximation of the number pi:

Ramanujan-Sato series

A simple program

Prerequisites

Although I’ve tried to keep this article simple to understand, this article expects you to have some basic knowledge about what threads are, what multi-threading is and basic knowledge on the Rust language syntax.

A simple program

simple.rs (click here to view the code in Github)

Our program has calculated the first 16 digits of Pi. The real value of Pi however is: 3.1415926535897932, from which our answer differs. Let's try to sum the first 10 elements of the series, to see if we get better accuracy. We can do it, by changing the max_elements variable to be equal to 10.

… and it seems that our program has crashed. The reason seems to be: attempt to multiply with overflow. It seems like when we try to multiply large numbers, the result doesn't fit in the 64-bit float variable that we have (this is called an overflow). How can we get past this ?

A simple program, with big numbers

When needing to support bigger numbers (or smaller numbers, with a lot more precision), we can use an external library, that provides integers and floating-point numbers with bigger precision, compared to the default ones. We will be using the rug library. From it, we will use the Float type, which offers a lot more precision and space. One thing to note, is that when using Float, we need to set the precision (the amount of digits after the decimal point) we require. For now, we will set it to 10 000.

The code looks a lot uglier now, but at least it can calculate around 3 000 digits of Pi in a instant:

Let’s try to increase the amount of elements in the sum to a 1000. I’ll show only the first 5 digits of the calculated result, to save space in the output. Also I’ll be adding a timer, to measure how much time it takes to execute the program, so that we can compare the performance of the different versions of the code.

simple_big_nums.rs (Github link here)

By setting the precision to 100 000 and the amount of elements to 5000, we can calculate around 30 000 digits of Pi.

However, the time needed to calculate them, is around 3 minutes.

Recurrent calculations

Before we try and run this task on multiple threads, we could analyze what is taking too long to compute — in our case, the factorials are really slow to compute, and they take a lot of time. On top of that, we don’t always need to compute a factorial that we already know the value of. We should aim to reduce the amount of factorials we calculate. One thing I noticed, is that we could transform this series into a recurrent one.

Let’s try and calculate the value of two consecutive elements of the sum, to see if we get any meaningful relation between them. Let those elements be n=k and n=k+1

We can divide them, to see if we can get anything meaningful.

Finally, we get:

The whole formula now looks like this:

The program using this recurrent calculations will now look like this:

recurrent_single_thread.rs (View on Github)

With this way, we again calculate 30 000 digits of Pi, from 5000 elements of the series, but this time, the calculations took 8.73s

Let’s increase the amount of elements we are calculating to 100 000, to see how long it will take.

Multi-threading introduction

Design approach

Before starting to write any code, we have to think about how we will be splitting the work, so that the amount of time needed to complete the work is even. We don’t want to be in a situation where one of the threads is doing heavy computational work, while the other threads are just sitting in idle.

One way to have a multi-threaded application, is to use the Master Worker pattern (you can learn more about it here). We have a main thread, which will hand out task, and workers, which as the name suggests, will do the actual work. This way, we can split the calculations of the elements of the series into equal parts — each thread gets some part of the sum to calculate. Once it’s done, the worker thread will report back to the main thread. The main thread will wait out for all threads to finish their work, and sum up the different parts of the series.

In theory this seems like a good approach — each thread does some work, those threads work in parallel, and we should be getting really fast results. However, there are two main problems that need to be resolved before starting to code the final solution — the balancing of the work & the way recurrent calculations work.

Splitting the work — how ?

In the single-threaded version of our code, the first thing we do, is to calculate a0 first, and from there, we could just use the formula we have: ak+1 = ak * Ck. For the multi-threaded version, let’s say that we have to calculate 100 elements of the series, and we have two threads — t1, which will calculate from the 0th to 50th element, and t2 which will calculate from the 51st element to the 100th. The problem now becomes that t2 has to have the 50th element calculate, in order to start the calculations. However, t1 and t2 are going to run in parallel and the 50th element will not have been calculated until t1 finishes it's work. t2 can't wait for t1 to finish up, before it can start working (this will be the same situation we have currently - only one thread working at a time).

A solution to this, is to calculate the starting element without the recurrent formula, but with the initial one that we have — the same way we calculate a0 in the single-threaded version. This will create another problem, however…

A balancing act

Continuing with the scenario above, t1 has to initially compute a0 with the original formula, while t2 has to compute a50. A quick reminder of how we calculate those initial elements:

As you can see, computing a0 requires t1 to calculate the factorial of 4 times 0, which is 0, while t2 has to deal with 200, which is really slow to compute. From this, we can see that t1 has a lot less work than t2, and our performance will be limited to the slowest thread currently (in this case, t2). One thing to consider is, when dividing the tasks (or the work that will be done by each of the threads), is to not split the elements of the series into t parts (where t will be the amount of threads that we will be creating), but to divide it into smaller tasks. Then each thread get assigned a task and when the thread has completed it's work, it can take the next available task from the main thread.

In this scenario, we can divide the 100 elements of the series, into batches of 10 elements. t1 gets 0-9, while t2 gets 10-19. While t2 is sweating trying to calculate the big factorials, t1 can just go and take the next task, which in our case will be 20-29. This will continue until there are no more task to complete.

Final solution

In the final solution, the logic is split into two files — main.rs and calc.rs

main.rs (View on Github)

calc.rs (View on Github)

Let’s quickly go over the other notable changes in the code.

First thing we notice, is the addition of a command-line interface, for setting the amount of threads to start, and the amount of elements in the sum to be calculated. I’ve also added some logging, to signal when a thread has started and stopped working.

The communication between the threads is done using channels provided by Rust (you can find more about channels here and about inter-process communication here) . As you can notice, in calc.rs, in the Task struct, we keep the sender part of the channel, so that we can send data to the main thread. Apart from that, we keep the start and end indexes, which are used to designate which part of the sum should be calculated. In the work function, we calculate the initial element, and then we use calculate_a_n_from_previous() to calculate all the remaining elements. Once we are done with that, we send over the result to the main thread.

In the main function of the program, we set up our result variable, as well as a double-ended queue (also known as a "deque") to keep the start and end indexes of the different parts of the sum. In this queue, we keep a tuple, containing the start and end indexes of the task. When a new thread needs to be started, we can quickly get the first element of the queue. The amount of elements in those smaller sums is equal to (amount_of_elements / thread_count) / amount_divisor - by default, the amount_divisor is set to 2.

After the preparation is done, we start the first threads — each thread gets a task from the “deque” and stars calculating it’s part of the sum. When a thread is ready with its calculations, its result is taken from the channel, and added to the total result of the sum. The thread is stopped, and a new one is spawned after it, with a new task. If there are no new tasks, we wait for all threads to finish work and collect their results.

Once all elements of the sum have been calculated, we print out the result to the standard output.

Testing the code

When benchmarking a multi-threaded application, we take two values — the execution time of the program when a single thread is used (let’s call it T1) and when k amount of threads are used (let's call it Tk). The speed-up Sk of the program is equal to T1/Tk. We could also measure the efficiency, which is calculated by dividing Sk to k. The closer this coefficient is to one, the better our multi-threader program is at utilizing the resources it has.

For testing, I used an AWS EC2 m5.8xlarge instance, with a 16-core processor, with hyper-threading disabled. When testing, I’ve build the release version of the program, rather than the debug one. The debug builds usually contain some debugging tools built-in them, which could slow down performance. Building the release target is done using cargo build --release. Testing was done via the time_run.sh script, which runs the program n amount of times, and prints out the time it took to complete the program for each of the runs:

time_run.sh (View on Github)

Results of the 5 runs with different amount of threads

Ideally, the speed-up should we equal to k. As you can see, we are really close to the ideal speed-up when running up to 8 threads. When running on all 16 cores, we don't have the same speed-up, but still it's a really good result to have. The efficiency also shows that our program is good at using up the available resources.

Speed-up chart

Conclusion

We’ve created a program that calculates the sum of series using multi-threading programming. The program was written in the Rust programming language, which is famed for having good multi-threading support. With a bit of optimizations, we managed to calculate 30 000 digits of Pi in around 3 seconds !

The full code can be found here:

[GitHub - lyubolp/multithreaded-pi-calculations-rust: Program that calculates the number Pi with big…
Program that calculates the number Pi with big precision - GitHub - lyubolp/multithreaded-pi-calculations-rust: Program…github.com](https://github.com/lyubolp/multithreaded-pi-calculations-rust "github.com/lyubolp/multithreaded-pi-calcula..")

I’d love you hear you feedback in the comments bellow, as this is my first article that I have ever written.