Parallel Programming / Introduction to MPI#

Today’s assignment.

Motivation for parallel programming#

A couple of references:

Evolution of processor performance#

When it comes to performance, one often hears about Moore’s Law, which in its original form refers to the number of processors doubling about every 18 - 24 months. The term is used more broadly these days, sometimes applied to pretty much any exponential growth law.

For the longest time, Moore’s Law also described a doubling of (single core) processor performance about 18 months. However, this Law has since saturated (mostly for power reasons), though in general performance per processor has kept growing in a similar fashion as before thanks to the introduction of multi-core processors. The performance of top supercomputers has been growing even faster, due to using more and more cores. Exaflops computers – the frontier today – have millions of cores (though due to the use of GPUs, the definition of “core” is evolving).

Unfortunately, this all makes the programmer’s / code developer’s life harder, since they cannot just keep using the same code and expect performance to increase in a similar way as it used to when processors just got faster and faster.

Review shared memory vs distributed memory#

During the last couple of weeks, we learned about OpenMP, which works on shared memory – every processor can access all of the memory of the program (though you can have private variables which are per-thread, so one thread can do something different from the next).

Parallelizing code with OpenMP turned out to be fairly simple in common cases - just adding some #pragma omp for pragmas to the code, and the compiler pretty much does all the work. We’ve already encountered two ways of how to parallelize a program: (1) exploiting parallelism in the algorithm (the quadrature, a.k.a. integrate example), where the work of calculating pi/4 gets split amongst processors and can be performed in parallel, and (2) exploiting data parallelism (the node-centered to cell-centered averaging example), where every processor works on a subset of the data. Previously, you may also have encountered at least one limitation of shared memory systems: The averaging for loop example didn’t scale all that well on the machine I ran it on, even though the algorithms is perfectly parallelizable and there are no load balancing issues, due to the fact that all the cores, well, share the memory. The other disadvantage, which is related, is that it is very difficult to scale shared memory systems to a large number of processors, because all of the processors need to be able to access the same physical memory, and that is difficult to achieve in hardware.

Going to a distributed memory model, every processor accesses its own, private memory, and making the system larger just involves adding additional nodes to the system, so that should scale perfectly. In reality, however, just having a lot of nodes often does not help to solve a problem, because the nodes have to work on the problem collaboratively. That requires them to talk to each other, by some kind of network or interconnect. And as you use more and more nodes, this communication may become more and more limiting, or may force you to invest more and more money to get a powerful network that can handle the load.

In a distributed memory system, nodes need to talk to each other, and the simplest conceptual way of doing so is by sending and receiving messages. And this is where MPI comes in.

MPI#

MPI stands for Message Passing Interface, and it is the specification for an API (Application Programming Interface) that standardizes sending and receiving messages, as well as associated housekeeping tasks. There are multiple software packages which implement MPI. They all have the same interface to the outside, though the actual implementation varies, and hence performance may vary, too, though it’s likely not very noticeable. A commonly used implementation of MPI is called OpenMPI, which should not be confused with OpenMP, which as we know by now deals with a totally different kind of parallelization.

Installing MPI#

You may already have MPI installed, in which case you’ll probably be able to just use it. You can try this:

[kai@macbook class20 (main)]$ mpicxx
clang: error: no input files

That indicates that the MPI “compiler” mpicxx exists, but since I didn’t tell it what to compile, it’s none too happy.

If you get something like this:

[kai@macbook class20 (main)]$ mpicxx
-bash: mpicxx: command not found

then you’ll need to install MPI. There are different implementations of MPI, I recommend openmpi. On Linux, use whatever package manager your distro uses. If you’re using Ubuntu / WSL2, this should work:

apt install libopenmpi-dev

On a Mac with MacPorts installed, there are many options, depending on which underlying compiler you want to use.

[kai@macbook class20 (main)]$ sudo port install openmpi

should work, or you can specifically ask for something like openmpi-clang90.

With Homebrew, you should be able to do

brew install open-mpi

Homework#

  • Get MPI installed on your machine, so that you can compile and run the test_mpi.cxx example. You should be able to run it with mpirun -n 4 ./test_mpi (or similar, depending on your system).

  • Nothing to hand in – we’ll pick up “MPI in 6 functions” next class.