It’s Pi day (14th of March, or 3.14 in American), so it’s time for a Pi special! Let’s calculate Pi with a network of Raspberry Pis…
My previous post covered building a physical cluster of Raspberry Pis and making sure they could communicate with one another. I also discussed what parallel processing is and how it can be used to speed up certain types of program.
In this post, we’ll write such a program and see if we can get it to work on the cluster.
Embarrasingly parallel problems
I mentioned ’embarrasingly parallel’ problems in my last post. They are problems which run significantly faster if computed in parallel than in series. Such problems have the following characteristics:
- They can be split into a large number of tasks.
- The tasks are not dependent on each other – ie they do not need to communicate with each other.
- Splitting the overall program into tasks is easy.
Here are a few examples of such problems. A little bit of thought and it is easy to see why they can be easily split up into smaller tasks which are executed simultaneously:
- Brute force attacks. If we are trying a large number of potential solutions, we can share all the solutions between multiple processors.
- Searching text for specific words or terms – just send sections of the text to different processors.
- Graphics – the individual pixels of a frame can be rendered independent of other pixels. This is why GPUs consist of thousands of independent processors.
In this post we’ll be looking at a simple Monte Carlo simulation. I’m starting with a really simple program so we can demonstrate the cluster working before moving onto something more complicated in a later post.
Monte Carlo simulations.
Engineers love these. A Monte Carlo simulation is simply a simulation run multiple times. If sufficient runs are made, the average result is likely to reflect the result in real life. They are highly useful since, using a simulation, it is possible to get an estimate on something without having to do complex probabilistic calculations.
The worst way of calculating Pi
There are numerous ways of calculating Pi. Doing it by Monte Carlo simulation is one of the slowest. But, it’s a dead-simple program to write, and it’s a great intro to Monte Carlo simulations. Here’s how to do it:
- Draw a circle of radius R.
- Draw a square that just encloses it (the side will be 2R in length).
- Draw a dot at a random point on the square.
- Repeat step 3 multiple times.
- Count the number of dots inside and outside the circle.
- Refer to the diagram below to calculate Pi:
The Monte Carlo aspect here is drawing the random dots. We can get a computer to do that millions of times. If we wanted to draw 10 million dots, it would be computationally simple to tell multiple processors to draw a fraction of this 10 million, and then count how many dots were in the circle for each processor. So, this is a really simple problem that can be done in parallel.
It is easy to see why this is a terrible way to calculate Pi. It takes 10 times more dots to get another decimal point of accuracy. In other words, if we use the simulation to calculate Pi to 3 decimal places, we’d have to run it for ten times longer to get 4 decimal places!
Although this isn’t great to calculate Pi, it is a fantastically useful method to calculate the area under the curve if you can’t work out how to integrate the curve’s function. Since I learned to program, I don’t really integrate anymore (there is a joke in there).
Here is the progression I followed, I used Python throughout for the code (scripts available for download at the end of this post):
Simulation in serial on a single node
First let’s write the simulation in ‘normal mode’. I wrote a script to run the Monte Carlo simulation with 16 million points (more on why it’s 16 million later). Whenever you write a program and don’t tell it to execute in parallel, this is how it runs – on a single core of your CPU. I logged into one of the Pis in my cluster via Putty and ran the program. Here are the results:
This is our benchmark time – we are going to try and improve it by re-running the program in parallel. It is worth noting the result was only accurate to 4 decimal places. You could calculate that quicker by hand. I said this is a terrible way to calculate Pi!
Simulation in parallel on a single node
Now let’s use just one computer but use all four cores of the CPU in parallel. Python has a built in library to do just this – called Multiprocessing.
Multiprocessing is designed to run programs in parallel across the cores of the CPU. The Raspberry Pi 3 has 4 cores, so we can alter the code to split the program into tasks and run it across the 4 cores (like using 4 fighter jets instead of one in the analogy in part 1).
Let’s see how quickly the process runs now:
Not bad – that’s almost one quarter of the time. But, we are using four times the processing power of the first example. Why isn’t it exactly a quarter of the time?
Remember the analogy in section 1 where the tasks were represented by boxes that went through a sorting depot and into vans? That is like the serial part of our program. Although most of the program can be run in parallel, some of it still needs to be run in serial on a single CPU. In our case here’s how the program is broken down:
- Get the required Python packages (SERIAL)
- Establish the size of the square and the circle (SERIAL)
- Split the ‘random dot placement’ part (SERIAL)
- Place the random dots – this part takes the longest (PARALLEL)
- Compile the results from all the CPUs (SERIAL)
- Add up the number of dots inside the circle (SERIAL)
- Calculate Pi from this (SERIAL)
So there are a few tasks that must be performed in serial. They are all very computationally simple tasks in this case, but they mean the whole program can never be run in parallel.
In addition to this, it takes time to physically send the results of each task to the ‘header’ CPU.
The expected speedup from parallelisation can be estimated by using Amdahl’s law (link at the end)
Simulation in parallel on multiple cores of multiple computers (AKA a supercomputer)
This is the exciting bit. We’ll do the above but send the tasks between all four of the Pis. Now we can see why my simulation uses 16 million iterations – 1 million per core. I picked this to avoid any weird rounding issues. Apparently Python deals with this kind of thing automatically but I wanted to be safe.
Unfortunately the Multiprocessing package doesn’t support parallel processing across a network so I had to use something else. I went for parallel Python because it looked the most simple to use.
I altered my script accordingly to split the process into tasks to be passed to the 4 Pis on the network.
Here’s how to use it:
- Download and install the Parallel Python package on each node.
- On the three compute nodes, run the PPServer program. This is a Python program which is part of the Parallel Python package. It makes the node act as a server ready to receive and process tasks from a parallel process.
- On the header node, run the script.
This is what my laptop screen looked like with the script running. You know things are serious when you see this:
The Parallel Python Server program is pretty neat. You can sit and watch the tasks arriving and being processed on each of the compute nodes. They are then sent back to the header node via the ethernet links between the 4 Pis.
It is worth noting that the header node was also functioning as a compute node – it’s CPU was running tasks too.
Let’s zoom into the header node and take a look at the result:
Not bad at all. But, it’s a little slower than we may first expect – more than one third of the time to run on a single machine in parallel. This is easily explained.
As discussed in the first post, the header node is a Raspberry Pi 2. Since each CPU took only one task, the three Pi3s had finished their tasks and sent them back, but the Pi2 has a slower processor so the process could not be completed until it had finished its tasks. In other words, no surprises here!
The best way to speed things up here would be to split the process into a larger number of tasks. That way, less of them would be sent to the Pi2. Parallel Python automatically distributes tasks according to processor availability. Of course, splitting the process into too many tasks would slow it down again since it would take additional time to send the tasks between the nodes. This is why actual supercomputers use fibre-optic links rather than ethernet connections.
Conclusions so far
We have successfully built a cluster of independent computers and run a Monte Carlo Simulation in parallel across all their CPUs. Here are the results:
|Program Configuration||Time to Run (seconds)|
|Series on one CPU core||410.3|
|Parallel across 4 CPU cores||109.1|
|Parallel across all CPU cores of a network of 4 Raspberry Pis||38.8|
So, we definitely have a ‘baby supercomputer’ here! By optimising the distribution of tasks we will be able to improve the runtime further in the future.
Now that we know the cluster is capable of functioning as intended, I’d like to use it to run some more complex programs. I’ve already used it for web-trawling applications (post coming up soon). I’d also like to use it to run the Quadratic Sieve algorithm, which is used to find prime factors which are essential for cryptography. I’d also eventually like to try some machine learning programming.
Parallel Python is great, but huge clusters need robust fault tolerance and need to be able to deal with large amounts of data reliably. Although my tiny cluster will never do that, I’d like to use it as a testbed. To do so, I’ll need to install Hadoop.
All the above are going to take a bit of reading up. The internet is full of wonderful communities that can help with any project, but sometimes you just need to do things the old fashioned way:
http://demonstrations.wolfram.com/MonteCarloEstimateForPi/ The Monte Carlo process without all the hassle of having to understand my code
http://www.parallelpython.com/ Parallel Python. The documentation is a bit vague but there are plenty examples floating around online
https://docs.python.org/2/library/multiprocessing.html#module-multiprocessing Python multiprocessing. If anything I found this more difficult to use than Parallel Python.
https://en.wikipedia.org/wiki/Amdahl%27s_law Amdahl’s law – the expected speedup by parallising a process.
Anyone wanting to run the simulation across a cluster needs to use the ‘parallel nodes’ script and run ppserver on each of the computer nodes.
single core in serial:
Multiple cores in parallel (program automatically detects the number of cores in your CPU):
Multiple nodes in parallel: