# MPI I: MPI in 6 functions, sending and receiving messages We'll keep using last class's [assignment repo](https://classroom.github.com/a/ZdoLF1s3). ## MPI in 6 functions Let's start with the universal "Hello world" example. Actually, we've already dealt with that in OpenMP, and it's worthwhile to compare OpenMP and MPI, as there are many things in common, but also important differences. Sign up for today's assignment. You'll find a very simple `test_mpi.cxx`: ```cpp #include int main() { std::cout << "Hello, non-MPI!\n"; return 0; } ``` This program does not have any MPI in it yet, but you can still run it in parallel, the same way that all MPI programs are run: ```sh [kai@macbook ~]$ mpirun -n 4 ./test_mpi ... ``` #### Your turn Make a guess what you think will happen, and then try and see. `mpirun` is a tool that comes with MPI that will run a program in parallel, and the number after `-n` is how many instances of the program you want to run. In this case, we ran 4 instances of the program at the same time, and they, not too surprisingly, all did the same thing, that is print out the two messages. ### `MPI_Comm_rank()` and `MPI_Comm_size()` Our OpenMP hello program at least told us which thread is running, and how many total we have -- and MPI can do that, too. Here are our first two MPI functions, pretty similar to OpenMP's `omp_get_thread_num()` and `omp_get_num_threads()`: ```cpp #include #include int main() { int rank, size; MPI_Comm_rank(MPI_COMM_WORLD, &rank); MPI_Comm_size(MPI_COMM_WORLD, &size); std::cout << "Hi there " << rank << "/" << size << "\n"; return 0; } ``` There are basically just two differences to the OpenMP versions: - (Almost) all MPI functions return an error code, by convention often called `ierr`. Because of that, they cannot actually return, e.g., the rank (which corresponds to the thread_id, just enumerating the processes), but rank gets passed by reference (or pointer, to be exact) and will be set by `MPI_Comm_rank()`. - Most MPI functions take an argument called "communicator". We'll get to them later, for now we just always use `MPI_COMM_WORLD`. However, it is important to note that MPI does not use **threads** at all, since the term **threads** refers to multiple threads of execution that have access to the same shared memory. Instead, it uses multiple processes called **MPI processes**. Well, so far, everything looks pretty similar to OpenMP, but we're actually still a bit away from even getting the simple example to work. In the following, it'll probably be clear that MPI was designed to provide access to certain functionality -- not necessarily to make everything easy. All MPI functions are declared in the header ``. They are all documented by man pages, that is `man MPI_Comm_rank` will tell you more about that function, and also show you how to call it from Fortran. Although you may prefer to google/chatgpt that information (see also the references below). To compile the code, as you may have noticed, we're not using `g++` anymore, but `mpicxx` instead. People refer to `mpicxx` as the MPI C++ compiler, which makes it sound like it's really a special compiler. However, `mpicxx` is really just a wrapper around GNU's C++ compiler`g++` (or potentially other compilers), when you're using MPI, you're still using just standard C++ (or Fortran -- `mpifort` -- or C -- `mpicc` --), as opposed to OpenMP, which actually needs specific compiler support. MPI is really just a C (or Fortran) library which add a bunch of functions you can call which have been written by other people for you, and those are declared in the `` header. But the language is still plain C / C++ / Fortran. cmake has its own way of dealing with MPI -- similar to OpenMP, in fact, by using `find_package(MPI)`. cmake may or may not end up using the compiler wrappers -- it usually goes and calls the original compilers directly, adding include paths and libraries for MPI just as it does for other libraries one may be using (like googletest or xtensor). #### Your turn - Change `test_mpi.cxx` to the above and run it. You may find it still doesn't work right. (What's the error message?) The program complains that MPI was not initialized properly. Before we can use any other MPI functions, we need to initialize MPI, and we should shut it down in the end, too. ### `MPI_Init()` and `MPI_Finalize()` There's really not too much to say about these two functions -- you should call `MPI_Init` near the beginning of your program, and `MPI_Finalize()` near the end. ```cpp int main(int argc, char **argv) { MPI_Init(&argc, &argv); int rank, size; MPI_Comm_rank(MPI_COMM_WORLD, &rank); MPI_Comm_size(MPI_COMM_WORLD, &size); std::cout << "Hi there " << rank << "/" << size << "\n"; MPI_Finalize(); return 0; } ``` #### Your turn * Run it -- what's the output? ```sh [kai@lolcat mpi]$ mpirun -n 4 ./test_mpi [...] ``` Four MPI functions later, and our "hello world" finally works! Just to reiterate, `mpirun` runs the whole program `n` times in parallel. In OpenMP, only parallel sections were run in parallel, in MPI it's always the entire program, and `MPI_Init()`/`MPI_Finalize()` don't change that. #### Your turn I've made a more explicit example of the fork-join concept in OpenMP vs having run MPI processes run concurrently throughout the entire lifetime of the program: In `omp/test_openmp.cxx` you'll find the basic demonstration of how OpenMP has parallel sections from a previous class. Execution forks into multiple threads, which join again into the serial / single-threaded part of the program. - Change your `test_mpi.cxx` to follow the `test_openmp.cxx` example, and compare the output of the two programs. - You can run the MPI program with `mpirun -n 4 ./test_mpi`, and the OpenMP program with `./test_openmp` (or `OMP_NUM_THREADS=4 ./test_openmp` to specify the number of threads). Compare the output of the two programs -- what's similar, what's different? Discuss! To do something useful, we want to be able to do what MPI stands for, that is, pass messages. Which means, sending and receiving messages: ### `MPI_Send()` and `MPI_Recv()` Let's first demonstrate the need for this. I've add a `omp/test_message.cxx`, which is an OpenMP program where one thread sets a variable `test` to 99, and that value is available toall other threads, since it's a shared variable. ```cpp int test = 0; #pragma omp parallel shared(test) { int thread_id = omp_get_thread_num(); int n_threads = omp_get_num_threads(); if (thread_id == 0) { test = 99; } #pragma omp barrier printf("Hello %d/%d test = %d\n", thread_id, n_threads, test); } ``` #### Your turn - Run `omp/test_message`. Think about the output you expect, and then run it and think whether the results make sense. As a side remark, this example uses `#pragma omp barrier`, which we didn't really cover before. What it does is to synchronize the threads at this time -- threads will in general be executed concurrently, but not synchronously. So it would be possible that thread 1 already printed out the message saying what test is before thread 0 set the value to 99. The barrier makes sure that all threads have completed the part before the barrier before any of them continues to after the barrier. So that makes sure that no thread prints out the message before thread 0 is done setting it to 99. - Let's do the same thing in MPI. Create a `message_mpi.cxx` with the following code in `main()` (you'll need to do some more work to turn this into a compileable program): ```c int rank, size; MPI_Comm_rank(MPI_COMM_WORLD, &rank); MPI_Comm_size(MPI_COMM_WORLD, &size); if (rank == 0) { test = 99; } printf("Hello %d/%d test = %d\n", rank, size, test); // or if you like it C++-y // std::cout << "Hello " << rank << "/" << size << "test = " << test << "\n"; ``` #### Your turn - Run `message_mpi` and compare it to `message_openmp`. Is the variable `test` in the MPI code shared or private? Since, as it turns out, all variables in an MPI program are private, `test` is private on each process: if process 0 sets it, no other process will know about the new value unless the change is explicitly communicated using `MPI_Send()` and `MPI_Recv()`. #### Your turn Change the code as follows: ```cpp int test = 0; if (rank == 0) { test = 99; MPI_Send(&test, 1, MPI_INT, 3, 123, MPI_COMM_WORLD); } else if (rank == 3) { MPI_Recv(&test, 1, MPI_INT, 0, 123, MPI_COMM_WORLD, MPI_STATUS_IGNORE); } ``` In this example, processor 0 sets test to 99. It then sends the value of `test` to processor 3. `MPI_Send()` takes quite a couple of arguments, which makes it a little difficult to digest. The first argument is an array of data. Since we only have only one single value, we need to use & to make it appear like an array (of length 1). Next, we have to pass the number of elements in the array, ie. 1. Then comes the type of data, in this example it's `MPI_INT`, since we're passing an integer. There's also `MPI_DOUBLE`, etc. Of course MPI needs to know where the message should be sent, and in this case I decided I want to send it to processor 3. Finally, we have a tag (123) and a communicator. All you need to know about the tag for now is that it's an integer, and the corresponding `MPI_Recv` will need to have that same tag. The `MPI_Recv()` is fortunately pretty much the same. The received message will be written into the space indicated by the first argument, in this case also the test variable (processor 3's version of it). In `MPI_Recv()` you don't need to say what the target processor is (it's the one that is running the `MPI_Recv()` call), but where it came from, in this example it was processor 0 that sent it. There's an additional status argument, `MPI_STATUS_IGNORE` is good enough for that. #### Your turn - You know the spiel: think about what you expect from the code above, then run it and compare. ``` kai@macbook mpi $ mpirun -n 4 message_mpi [...] ``` Processor 0 has test = 99, since this is where we set the value in the first place. It was then sent to processor 3, which therefore also knows it's 99. And the other processors didn't take part in the communication, so they know nothing about it... And that's it. You've learned about the 6 functions, and everything, can be done in terms of them, though not necessarily in the best or fastest way. So there are really some more functions we'll need to learn about, too. #### Your turn - Change `message_mpi.cxx` so that the value of `test` on rank 0 gets communicated to all other processes. You can start by assuming that the code is run with 4 processes total, but it'd be nice if it worked for any number of processes, too. ## More on Using `MPI_Send()` and `MPI_Recv()` For reference, here is my `message_mpi.cxx`, which only sends the `test` value to proc 3, not to all procs yet: ```cxx #include #include int main(int argc, char** argv) { MPI_Init(&argc, &argv); int rank, size; MPI_Comm_rank(MPI_COMM_WORLD, &rank); MPI_Comm_size(MPI_COMM_WORLD, &size); int test = 0; if (rank == 0) { test = 99; MPI_Send(&test, 1, MPI_INT, 3, 123, MPI_COMM_WORLD); } else if (rank == 3) { MPI_Recv(&test, 1, MPI_INT, 0, 123, MPI_COMM_WORLD, MPI_STATUS_IGNORE); } printf("Hello %d/%d test = %d\n", rank, size, test); MPI_Finalize(); return 0; } ``` ### Pairing `MPI_Send()` and `MPI_Recv()` As you've seen, `MPI_Send()` and `MPI_Recv()` are fundamental, and really the most important, functions in MPI, as they allow you to send a message from one process to another, that's what "message passing" is all about. They need to be paired properly, ie., if proc `a` is passing a message to proc `b`, proc `a` needs to `MPI_Send()` to `b`, and proc `b` needs to have a corresponding `MPI_Recv()` from a. The datatypes need to match, as do the tags. It's okay to have a "short" receive, ie., you may ask to receive, e.g., 20 integers but if the sender only sends 10, that's okay and the `MPI_Recv()` will succeed -- however, the message can't be longer than the receive buffer you supply. In the example so far, we've sent a single value, in the general case one often wants to send more than one value, ie., send a whole array of values, e.g.: ```c int a[5] = { 0, 1, 2, 3, 4 }; MPI_Send(a, 5, MPI_INT, rank_dest, tag, MPI_COMM_WORLD); ``` and similarly on the receiving end. #### Your turn - Change your `message_mpi` code to send and receive large arrays, say 10000 values, instead of the single integer "test". ### MPI debugging As you may have noticed when doing the work from last class, MPI programs often show an unfortunate characteristics of hanging forever, which is typically due to bugs in the code you (or I) wrote. One debugging technique is to write down a diagram showing the messages as arrows as they go from one process to the next. Then compare this to what you wrote in the code. This is particularly useful in conjunction with a second method (which could be automated, too): Before and after you `MPI_Send()` or `MPI_Recv()`, print out a note saying what you're going to do. That gives you a way to see where the code hangs, and also to compare the logic of how messages are passed by your code to how you intended them to be passed (the diagram). ### MPI debugging with print statements Here's another low-tech way I use a lot for debugging MPI code -- I put these helpers into a header file (e.g. `mpi_debug.h`) for easy access. ```cxx #include #define mprintf(fmt...) do { int __rank; MPI_Comm_rank(MPI_COMM_WORLD, &__rank); { printf("[%d] ", __rank); printf(fmt); } } while(0) #define MHERE do { int __rank; MPI_Comm_rank(MPI_COMM_WORLD, &__rank); printf("[%d] HERE: in %s() at %s:%d\n", __rank, __FUNCTION__, __FILE__, __LINE__); } while(0) ``` You can then just sprinkle `MHERE;` around your code, and it's going to print out a message saying where this MHERE happened (or not, if the code hangs before even getting to it). And `mprintf()` works just like `printf()`, but it'll prefix the message with `[###]`, where `###` is the process rank that printed the message, which can be useful. ### Prefixing output with the MPI process rank The above macros have the advantage that they work with every MPI, and give you always the same formatting. Various MPI implementations also support prefixing output with where it came from, without having to change the code, but the details vary. For example, with OpenMPI, you can use ``` mpirun --tag-output -n 4 ... ``` ### Your turn: the integrate example - Now we're in a position to go back to the examples we previously used to introduce OpenMP. Take the `test_integration` (pi/4) example, and parallelize it using MPI using the basic message passing functions we learned about so far. It may be helpful to make it first run on two processes, and from there go to an arbitrary number. You should only use the 6 basic MPI functions that we learned about (and maybe `MPI_Wtime()`, see below), ie., don't use `MPI_Reduce()` even if you know about it. Ideally, the code should work correctly if it's run on an arbitrary number of procs -- at the very least, it should complain if, e.g., you're calculating the sum of 100 trapezoids on 3 procs and it doesn't know how to handle that. Failing explicitly is much better than just giving an incorrect result. Run your parallel program with a varying number of processes and time its execution. `time` might be good enough for that, though of course it is usally better to add timing directly into the program, too. There's a MPI function to help with that, very similar to the `Wtime()` one I provided earlier in this class. ```c double my_time = MPI_Wtime(); ``` - Analyze the parallel scaling of the code. ### Some references on MPI - - ## Homework - Finish the "Your turn" steps above, make commits and add comments to the Feedback PR.