# MPI I: MPI in 6 functions, sending and receiving messages

We'll keep using last class's [assignment repo](https://classroom.github.com/a/ZdoLF1s3).

## MPI in 6 functions

Let's start with the universal "Hello world" example. Actually, we've already
dealt with that in OpenMP, and it's worthwhile to compare OpenMP and MPI, as
there are many things in common, but also important differences. Sign up for
today's assignment. You'll find a very simple `test_mpi.cxx`:

```cpp
#include <iostream>

int main()
{
  std::cout << "Hello, non-MPI!\n";
  return 0;
}
```

This program does not have any MPI in it yet, but you can still run it in
parallel, the same way that all MPI programs are run:

```sh
[kai@macbook ~]$ mpirun -n 4 ./test_mpi
...
```

#### Your turn

Make a guess what you think will happen, and then try and see.

`mpirun` is a tool that comes with MPI that will run a program in parallel, and
the number after `-n` is how many instances of the program you want to run. In
this case, we ran 4 instances of the program at the same time, and they, not too
surprisingly, all did the same thing, that is print out the two messages.

### `MPI_Comm_rank()` and `MPI_Comm_size()`

Our OpenMP hello program at least told us which thread is running, and how many
total we have -- and MPI can do that, too. Here are our first two MPI functions,
pretty similar to OpenMP's `omp_get_thread_num()` and `omp_get_num_threads()`:

```cpp
#include <iostream>
#include <mpi.h>

int main()
{
  int rank, size;

  MPI_Comm_rank(MPI_COMM_WORLD, &rank);
  MPI_Comm_size(MPI_COMM_WORLD, &size);

  std::cout << "Hi there " << rank << "/" << size << "\n";

  return 0;
}
```

There are basically just two differences to the OpenMP versions:

- (Almost) all MPI functions return an error code, by convention often called
  `ierr`. Because of that, they cannot actually return, e.g., the rank (which
  corresponds to the thread_id, just enumerating the processes), but rank gets
  passed by reference (or pointer, to be exact) and will be set by
  `MPI_Comm_rank()`.

- Most MPI functions take an argument called "communicator". We'll get to them
  later, for now we just always use `MPI_COMM_WORLD`.

However, it is important to note that MPI does not use **threads** at all, since the term **threads** refers to multiple threads of execution that have access to the same shared memory. Instead, it uses multiple processes called **MPI processes**.

Well, so far, everything looks pretty similar to OpenMP, but we're actually
still a bit away from even getting the simple example to work. In the following,
it'll probably be clear that MPI was designed to provide access to certain
functionality -- not necessarily to make everything easy.

All MPI functions are declared in the header `<mpi.h>`. They are all documented
by man pages, that is `man MPI_Comm_rank` will tell you more about that
function, and also show you how to call it from Fortran. Although you may prefer to google/chatgpt that information (see also the references below).

To compile the code, as you may have noticed, we're not using `g++` anymore, but
`mpicxx` instead. People refer to `mpicxx` as the MPI C++ compiler, which makes
it sound like it's really a special compiler. However, `mpicxx` is really just a
wrapper around GNU's C++ compiler`g++` (or potentially other compilers), when you're using MPI,
you're still using just standard C++ (or Fortran -- `mpifort` -- or C -- `mpicc`
--), as opposed to OpenMP, which actually needs specific compiler support. MPI
is really just a C (or Fortran) library which add a bunch of functions you can call
which have been written by other people for you, and those are declared in the
`<mpi.h>` header. But the language is still plain C / C++ / Fortran.

cmake has its own way of dealing with MPI -- similar to OpenMP, in fact, by
using `find_package(MPI)`. cmake may or may not end up using the compiler
wrappers -- it usually goes and calls the original compilers directly, adding
include paths and libraries for MPI just as it does for other libraries one may
be using (like googletest or xtensor).

#### Your turn

- Change `test_mpi.cxx` to the above and run it. You may find it still doesn't work right.
(What's the error message?)

The program complains that MPI was not initialized properly. Before we can use
any other MPI functions, we need to initialize MPI, and we should shut it down
in the end, too.

### `MPI_Init()` and `MPI_Finalize()`

There's really not too much to say about these two functions -- you should call
`MPI_Init` near the beginning of your program, and `MPI_Finalize()` near the
end.

```cpp
int main(int argc, char **argv)
{
  MPI_Init(&argc, &argv);

  int rank, size;
  MPI_Comm_rank(MPI_COMM_WORLD, &rank);
  MPI_Comm_size(MPI_COMM_WORLD, &size);

  std::cout << "Hi there " << rank << "/" << size << "\n";

  MPI_Finalize();
  return 0;
}
```

#### Your turn

* Run it -- what's the output?

```sh
[kai@lolcat mpi]$ mpirun -n 4 ./test_mpi
[...]
```

Four MPI functions later, and our "hello world" finally works!

Just to reiterate, `mpirun` runs the whole program `n` times in parallel. In
OpenMP, only parallel sections were run in parallel, in MPI it's always the
entire program, and `MPI_Init()`/`MPI_Finalize()` don't change that.

#### Your turn

I've made a more explicit example of the fork-join concept in OpenMP vs having
run MPI processes run concurrently throughout the entire lifetime of the
program: In `omp/test_openmp.cxx` you'll find the basic demonstration of how OpenMP has parallel sections from a previous class. Execution forks into multiple threads, which join again into the serial / single-threaded part of the program.

- Change your `test_mpi.cxx` to follow the `test_openmp.cxx` example, and compare the output of the two programs. 

- You can run the MPI program with `mpirun -n 4 ./test_mpi`, and the OpenMP program with `./test_openmp` (or `OMP_NUM_THREADS=4 ./test_openmp` to specify the number of threads). 
  Compare the output of the two programs -- what's similar, what's different? Discuss!

To do something useful, we want to be able to do what MPI stands for, that is,
pass messages. Which means, sending and receiving messages:

### `MPI_Send()` and `MPI_Recv()`

Let's first demonstrate the need for this. I've add a `omp/test_message.cxx`, which is an OpenMP program where one thread sets a variable `test` to 99, and that value is available toall other threads, since it's a shared variable.

```cpp
  int test = 0;
#pragma omp parallel shared(test)
  {
    int thread_id = omp_get_thread_num();
    int n_threads = omp_get_num_threads();
    if (thread_id == 0) {
      test = 99;
    }
#pragma omp barrier
    printf("Hello %d/%d test = %d\n", thread_id, n_threads, test);
  }
```

#### Your turn

-  Run `omp/test_message`. Think about the output you expect, and then run it and think
  whether the results make sense.

As a side remark, this example uses `#pragma omp barrier`, which we didn't
really cover before. What it does is to synchronize the threads at this time --
threads will in general be executed concurrently, but not synchronously. So it
would be possible that thread 1 already printed out the message saying what test
is before thread 0 set the value to 99. The barrier makes sure that all threads
have completed the part before the barrier before any of them continues to after
the barrier. So that makes sure that no thread prints out the message before
thread 0 is done setting it to 99.

- Let's do the same thing in MPI. Create a `message_mpi.cxx` with the following
  code in `main()` (you'll need to do some more work to turn this into a
  compileable program):

```c
  int rank, size;
  MPI_Comm_rank(MPI_COMM_WORLD, &rank);
  MPI_Comm_size(MPI_COMM_WORLD, &size);

  if (rank == 0) {
    test = 99;
  }

  printf("Hello %d/%d test = %d\n", rank, size, test);
  // or if you like it C++-y
  // std::cout << "Hello " << rank << "/" << size << "test = " << test << "\n";
```

#### Your turn

- Run `message_mpi` and compare it to `message_openmp`. Is the variable `test` in
the MPI code shared or private?

Since, as it turns out, all variables in an MPI program are private, `test` is
private on each process: if process 0 sets it, no other process will know about
the new value unless the change is explicitly communicated using `MPI_Send()`
and `MPI_Recv()`.

#### Your turn

Change the code as follows:

```cpp
  int test = 0;

  if (rank == 0) {
    test = 99;
    MPI_Send(&test, 1, MPI_INT, 3, 123, MPI_COMM_WORLD);
  } else if (rank == 3) {
    MPI_Recv(&test, 1, MPI_INT, 0, 123, MPI_COMM_WORLD,
             MPI_STATUS_IGNORE);
  }
```

In this example, processor 0 sets test to 99. It then sends the value of `test`
to processor 3. `MPI_Send()` takes quite a couple of arguments, which makes it a
little difficult to digest. The first argument is an array of data. Since we
only have only one single value, we need to use & to make it appear like an
array (of length 1). Next, we have to pass the number of elements in the array,
ie. 1. Then comes the type of data, in this example it's `MPI_INT`, since we're
passing an integer. There's also `MPI_DOUBLE`, etc. Of course MPI needs to know
where the message should be sent, and in this case I decided I want to send it
to processor 3. Finally, we have a tag (123) and a communicator. All you need to
know about the tag for now is that it's an integer, and the corresponding
`MPI_Recv` will need to have that same tag.

The `MPI_Recv()` is fortunately pretty much the same. The received message will
be written into the space indicated by the first argument, in this case also the
test variable (processor 3's version of it). In `MPI_Recv()` you don't need to
say what the target processor is (it's the one that is running the `MPI_Recv()`
call), but where it came from, in this example it was processor 0 that sent it.
There's an additional status argument, `MPI_STATUS_IGNORE` is good enough for
that.

#### Your turn

- You know the spiel: think about what you expect from the code above, then run it
and compare.

```
kai@macbook mpi $ mpirun -n 4 message_mpi
[...]
```

Processor 0 has test = 99, since this is where we set the value in the first
place. It was then sent to processor 3, which therefore also knows it's 99. And
the other processors didn't take part in the communication, so they know nothing
about it...

And that's it. You've learned about the 6 functions, and everything, can be done
in terms of them, though not necessarily in the best or fastest way. So there
are really some more functions we'll need to learn about, too.

#### Your turn

- Change `message_mpi.cxx` so that the value of `test` on rank 0 gets
  communicated to all other processes. You can start by assuming that the code
  is run with 4 processes total, but it'd be nice if it worked for any number of
  processes, too.

## More on Using `MPI_Send()` and `MPI_Recv()`

For reference, here is my `message_mpi.cxx`, which only sends the `test`
value to proc 3, not to all procs yet:

```cxx
#include <iostream>
#include <mpi.h>

int main(int argc, char** argv)
{
  MPI_Init(&argc, &argv);

  int rank, size;
  MPI_Comm_rank(MPI_COMM_WORLD, &rank);
  MPI_Comm_size(MPI_COMM_WORLD, &size);

  int test = 0;

  if (rank == 0) {
    test = 99;
    MPI_Send(&test, 1, MPI_INT, 3, 123, MPI_COMM_WORLD);
  } else if (rank == 3) {
    MPI_Recv(&test, 1, MPI_INT, 0, 123, MPI_COMM_WORLD, MPI_STATUS_IGNORE);
  }

  printf("Hello %d/%d test = %d\n", rank, size, test);
  MPI_Finalize();
  return 0;
}
```

### Pairing `MPI_Send()` and `MPI_Recv()`

As you've seen, `MPI_Send()` and `MPI_Recv()` are fundamental, and really the
most important, functions in MPI, as they allow you to send a message from one
process to another, that's what "message passing" is all about. They need to be
paired properly, ie., if proc `a` is passing a message to proc `b`, proc `a`
needs to `MPI_Send()` to `b`, and proc `b` needs to have a corresponding
`MPI_Recv()` from a. The datatypes need to match, as do the tags. It's okay to
have a "short" receive, ie., you may ask to receive, e.g., 20 integers but if
the sender only sends 10, that's okay and the `MPI_Recv()` will succeed --
however, the message can't be longer than the receive buffer you supply.

In the example so far, we've sent a single value, in the general case one often
wants to send more than one value, ie., send a whole array of values, e.g.:

```c
  int a[5] = { 0, 1, 2, 3, 4 };
  MPI_Send(a, 5, MPI_INT, rank_dest, tag, MPI_COMM_WORLD);
```

and similarly on the receiving end.

#### Your turn

- Change your `message_mpi` code to send and receive large arrays, say 10000
  values, instead of the single integer "test".

### MPI debugging

As you may have noticed when doing the work from last class, MPI programs often
show an unfortunate characteristics of hanging forever, which is typically due
to bugs in the code you (or I) wrote. One debugging technique is to write down a
diagram showing the messages as arrows as they go from one process to the next.
Then compare this to what you wrote in the code. This is particularly useful in
conjunction with a second method (which could be automated, too): Before and
after you `MPI_Send()` or `MPI_Recv()`, print out a note saying what you're
going to do. That gives you a way to see where the code hangs, and also to
compare the logic of how messages are passed by your code to how you intended
them to be passed (the diagram).

<!--
#### MPE (outdated)

*The following does work on fishercat, where I've installed [MPE](https://www.mcs.anl.gov/research/projects/perfvis/download/index.htm#MPE). It's a pretty outdated and rarely used tool, so I kept this section below in case you're interested, but hopefully later in this class we'll get to look at some more modern performance tools instead.*

If you don't want to put all the print statements into your code,
there's an automated way, if you're working on fishercat: Try the
following in the class repo, which contains the examples from last
class):

```sh
[kai@fishercat mpi]$ module load mpe2
[kai@fishercat mpi]$ cmake -DCMAKE_C_COMPILER="mpecc" -DCMAKE_C_FLAGS="-mpitrace" ..
[kai@fishercat mpi]$ make
[kai@fishercat mpi]$ mpirun -n 4 ./message_mpi_2
```

If you aren't running on fishercat, you may not have easy access to
`mpe2`, and you probably don't have to worry about it too much -- it
can be useful, but in general it prints out so much that one quickly
gets lost. You can even look at things graphically, we'll (hopefully)
get back to that in more detail later.
-->

### MPI debugging with print statements

Here's another low-tech way I use a lot for debugging MPI code -- I put these
helpers into a header file (e.g. `mpi_debug.h`) for easy access.

```cxx

#include <mpi.h>

#define mprintf(fmt...) do { int __rank; MPI_Comm_rank(MPI_COMM_WORLD, &__rank); { printf("[%d] ", __rank); printf(fmt); } } while(0)

#define MHERE do { int __rank; MPI_Comm_rank(MPI_COMM_WORLD, &__rank); printf("[%d] HERE: in %s() at %s:%d\n", __rank, __FUNCTION__, __FILE__, __LINE__); } while(0)

```

You can then just sprinkle `MHERE;` around your code, and it's going to print
out a message saying where this MHERE happened (or not, if the code hangs before
even getting to it).

And `mprintf()` works just like `printf()`, but it'll prefix the message with
`[###]`, where `###` is the process rank that printed the message, which can be
useful.

### Prefixing output with the MPI process rank

The above macros have the advantage that they work with every MPI, and give you
always the same formatting. Various MPI implementations also support prefixing
output with where it came from, without having to change the code, but the
details vary.

For example, with OpenMPI, you can use

```
mpirun --tag-output -n 4 ...
```

### Your turn: the integrate example

- Now we're in a position to go back to the examples we previously used to introduce OpenMP. Take the `test_integration` (pi/4) example,
  and parallelize it
  using MPI using the basic message passing functions we learned about so far. It
  may be helpful to make it first run on two processes, and from there go to an
  arbitrary number.

  You should only use the 6 basic MPI functions that we learned about (and maybe
  `MPI_Wtime()`, see below), ie., don't use `MPI_Reduce()` even if you know
  about it.

  Ideally, the code should work correctly if it's run on an arbitrary number of
  procs -- at the very least, it should complain if, e.g., you're calculating
  the sum of 100 trapezoids on 3 procs and it doesn't know how to handle that.
  Failing explicitly is much better than just giving an incorrect result.

  Run your parallel program with a varying number of processes and time its
  execution. `time` might be good enough for that, though of course it is usally
  better to add timing directly into the program, too. There's a MPI function to
  help with that, very similar to the `Wtime()` one I provided earlier in this
  class.

```c
  double my_time = MPI_Wtime();
```

- Analyze the parallel scaling of the code.

### Some references on MPI

- <https://mpitutorial.com>
- <https://www.dursi.ca/post/hpc-is-dying-and-mpi-is-killing-it.html>

## Homework

- Finish the "Your turn" steps above, make commits and add comments to the Feedback PR.
<!-- 
- Since we didn't get all that far in Class 22, the combined homework for Class
  21 and Class 22 is the "Your turn" from class 21, and the first "Your turn"
  here, that is:

> Change your `message_mpi` code to send and receive large arrays, say 10000
> values, instead of the single integer "test".

- Reading the notes about MPI debugging may come in handy, though. -->
