# Parallel Programming / Introduction to MPI

Today's [assignment](https://classroom.github.com/a/ZdoLF1s3).

## Motivation for parallel programming

A couple of references:

- [Some slides from talks I gave on this topic](hpc.pdf)
- [Ulrich Drepper: Moore's Law and Dumb Programmers](parallelprog.pdf)
- [Kathy Yelick: 10 ways to waste a parallel computer](ISCA09-WasteParallelComputer.pdf)

### Evolution of processor performance

When it comes to performance, one often hears about
[Moore's Law](http://wikipedia.org/Moore's_law), which in its original form
refers to the number of processors doubling about every 18 - 24 months. The term
is used more broadly these days, sometimes applied to pretty much any
exponential growth law.

For the longest time, Moore's Law also described a doubling of (single core)
processor performance about 18 months. However, this Law has since saturated
(mostly for power reasons), though in general performance per processor has kept
growing in a similar fashion as before thanks to the introduction of multi-core
processors. The performance of top supercomputers has been growing even faster,
due to using more and more cores. Exaflops computers -- the frontier today --
have millions of cores (though due to the use of GPUs, the definition of
"core" is evolving).

Unfortunately, this all makes the programmer's / code developer's life harder,
since they cannot just keep using the same code and expect performance to
increase in a similar way as it used to when processors just got faster and
faster.

## Review shared memory vs distributed memory

During the last couple of weeks, we learned about OpenMP, which works on shared
memory -- every processor can access all of the memory of the program (though
you can have private variables which are per-thread, so one thread can do
something different from the next).

Parallelizing code with OpenMP turned out to be fairly simple in common cases -
just adding some `#pragma omp for` pragmas to the code, and the compiler pretty
much does all the work. We've already encountered two ways of how to parallelize
a program: (1) exploiting parallelism in the algorithm (the quadrature, a.k.a.
integrate example), where the work of calculating pi/4 gets split amongst
processors and can be performed in parallel, and (2) exploiting data parallelism
(the node-centered to cell-centered averaging example), where every processor
works on a subset of the data. Previously, you may also have encountered at
least one limitation of shared memory systems: The averaging for loop example
didn't scale all that well on the machine I ran it on, even though the
algorithms is perfectly parallelizable and there are no load balancing issues,
due to the fact that all the cores, well, _share_ the memory. The other
disadvantage, which is related, is that it is very difficult to scale shared
memory systems to a large number of processors, because all of the processors
need to be able to access the same physical memory, and that is difficult to
achieve in hardware.

Going to a distributed memory model, every processor accesses its own, private
memory, and making the system larger just involves adding additional nodes to
the system, so that should scale perfectly. In reality, however, just having a
lot of nodes often does not help to solve a problem, because the nodes have to
work on the problem collaboratively. That requires them to talk to each other,
by some kind of network or interconnect. And as you use more and more nodes,
this communication may become more and more limiting, or may force you to invest
more and more money to get a powerful network that can handle the load.

In a distributed memory system, nodes need to talk to each other, and the
simplest conceptual way of doing so is by sending and receiving messages. And
this is where MPI comes in.

## MPI

MPI stands for **Message Passing Interface**, and it is the specification for an
API (Application Programming Interface) that standardizes sending and receiving
messages, as well as associated housekeeping tasks. There are multiple software
packages which implement MPI. They all have the same interface to the outside,
though the actual implementation varies, and hence performance may vary, too,
though it's likely not very noticeable. A commonly used implementation of MPI is
called OpenMPI, which should not be confused with OpenMP, which as we know by
now deals with a totally different kind of parallelization.


### Installing MPI

You may already have MPI installed, in which case you'll probably be able to
just use it. You can try this:

```sh
[kai@macbook class20 (main)]$ mpicxx
clang: error: no input files
```

That indicates that the MPI "compiler" `mpicxx` exists, but since I didn't tell
it what to compile, it's none too happy.

If you get something like this:

```sh
[kai@macbook class20 (main)]$ mpicxx
-bash: mpicxx: command not found
```

then you'll need to install MPI. There are different implementations of MPI, I
recommend `openmpi`. On Linux, use whatever package manager your distro uses. If
you're using Ubuntu / WSL2, this should work:

```sh
apt install libopenmpi-dev
```

On a Mac with MacPorts installed, there are many options, depending on which
underlying compiler you want to use.

```sh
[kai@macbook class20 (main)]$ sudo port install openmpi
```

should work, or you can specifically ask for something like `openmpi-clang90`.

With Homebrew, you should be able to do

```sh
brew install open-mpi
```

## Homework

<!-- - As usual, finish the "Your turn" steps above, make commits and add comments to
  the Feedback PR. -->

  - Get MPI installed on your machine, so that you can compile and run the `test_mpi.cxx` example. You should be able to run it with `mpirun -n 4 ./test_mpi` (or similar, depending on your system).
  - Nothing to hand in -- we'll pick up "MPI in 6 functions" next class.