MPI I: MPI in 6 functions, sending and receiving messages#

We’ll keep using last class’s assignment repo.

MPI in 6 functions#

Let’s start with the universal “Hello world” example. Actually, we’ve already dealt with that in OpenMP, and it’s worthwhile to compare OpenMP and MPI, as there are many things in common, but also important differences. Sign up for today’s assignment. You’ll find a very simple test_mpi.cxx:

#include <iostream>

int main()
{
  std::cout << "Hello, non-MPI!\n";
  return 0;
}

This program does not have any MPI in it yet, but you can still run it in parallel, the same way that all MPI programs are run:

[kai@macbook ~]$ mpirun -n 4 ./test_mpi
...

Your turn#

Make a guess what you think will happen, and then try and see.

mpirun is a tool that comes with MPI that will run a program in parallel, and the number after -n is how many instances of the program you want to run. In this case, we ran 4 instances of the program at the same time, and they, not too surprisingly, all did the same thing, that is print out the two messages.

MPI_Comm_rank() and MPI_Comm_size()#

Our OpenMP hello program at least told us which thread is running, and how many total we have – and MPI can do that, too. Here are our first two MPI functions, pretty similar to OpenMP’s omp_get_thread_num() and omp_get_num_threads():

#include <iostream>
#include <mpi.h>

int main()
{
  int rank, size;

  MPI_Comm_rank(MPI_COMM_WORLD, &rank);
  MPI_Comm_size(MPI_COMM_WORLD, &size);

  std::cout << "Hi there " << rank << "/" << size << "\n";

  return 0;
}

There are basically just two differences to the OpenMP versions:

  • (Almost) all MPI functions return an error code, by convention often called ierr. Because of that, they cannot actually return, e.g., the rank (which corresponds to the thread_id, just enumerating the processes), but rank gets passed by reference (or pointer, to be exact) and will be set by MPI_Comm_rank().

  • Most MPI functions take an argument called “communicator”. We’ll get to them later, for now we just always use MPI_COMM_WORLD.

However, it is important to note that MPI does not use threads at all, since the term threads refers to multiple threads of execution that have access to the same shared memory. Instead, it uses multiple processes called MPI processes.

Well, so far, everything looks pretty similar to OpenMP, but we’re actually still a bit away from even getting the simple example to work. In the following, it’ll probably be clear that MPI was designed to provide access to certain functionality – not necessarily to make everything easy.

All MPI functions are declared in the header <mpi.h>. They are all documented by man pages, that is man MPI_Comm_rank will tell you more about that function, and also show you how to call it from Fortran. Although you may prefer to google/chatgpt that information (see also the references below).

To compile the code, as you may have noticed, we’re not using g++ anymore, but mpicxx instead. People refer to mpicxx as the MPI C++ compiler, which makes it sound like it’s really a special compiler. However, mpicxx is really just a wrapper around GNU’s C++ compilerg++ (or potentially other compilers), when you’re using MPI, you’re still using just standard C++ (or Fortran – mpifort – or C – mpicc –), as opposed to OpenMP, which actually needs specific compiler support. MPI is really just a C (or Fortran) library which add a bunch of functions you can call which have been written by other people for you, and those are declared in the <mpi.h> header. But the language is still plain C / C++ / Fortran.

cmake has its own way of dealing with MPI – similar to OpenMP, in fact, by using find_package(MPI). cmake may or may not end up using the compiler wrappers – it usually goes and calls the original compilers directly, adding include paths and libraries for MPI just as it does for other libraries one may be using (like googletest or xtensor).

Your turn#

  • Change test_mpi.cxx to the above and run it. You may find it still doesn’t work right. (What’s the error message?)

The program complains that MPI was not initialized properly. Before we can use any other MPI functions, we need to initialize MPI, and we should shut it down in the end, too.

MPI_Init() and MPI_Finalize()#

There’s really not too much to say about these two functions – you should call MPI_Init near the beginning of your program, and MPI_Finalize() near the end.

int main(int argc, char **argv)
{
  MPI_Init(&argc, &argv);

  int rank, size;
  MPI_Comm_rank(MPI_COMM_WORLD, &rank);
  MPI_Comm_size(MPI_COMM_WORLD, &size);

  std::cout << "Hi there " << rank << "/" << size << "\n";

  MPI_Finalize();
  return 0;
}

Your turn#

  • Run it – what’s the output?

[kai@lolcat mpi]$ mpirun -n 4 ./test_mpi
[...]

Four MPI functions later, and our “hello world” finally works!

Just to reiterate, mpirun runs the whole program n times in parallel. In OpenMP, only parallel sections were run in parallel, in MPI it’s always the entire program, and MPI_Init()/MPI_Finalize() don’t change that.

Your turn#

I’ve made a more explicit example of the fork-join concept in OpenMP vs having run MPI processes run concurrently throughout the entire lifetime of the program: In omp/test_openmp.cxx you’ll find the basic demonstration of how OpenMP has parallel sections from a previous class. Execution forks into multiple threads, which join again into the serial / single-threaded part of the program.

  • Change your test_mpi.cxx to follow the test_openmp.cxx example, and compare the output of the two programs.

  • You can run the MPI program with mpirun -n 4 ./test_mpi, and the OpenMP program with ./test_openmp (or OMP_NUM_THREADS=4 ./test_openmp to specify the number of threads). Compare the output of the two programs – what’s similar, what’s different? Discuss!

To do something useful, we want to be able to do what MPI stands for, that is, pass messages. Which means, sending and receiving messages:

MPI_Send() and MPI_Recv()#

Let’s first demonstrate the need for this. I’ve add a omp/test_message.cxx, which is an OpenMP program where one thread sets a variable test to 99, and that value is available toall other threads, since it’s a shared variable.

  int test = 0;
#pragma omp parallel shared(test)
  {
    int thread_id = omp_get_thread_num();
    int n_threads = omp_get_num_threads();
    if (thread_id == 0) {
      test = 99;
    }
#pragma omp barrier
    printf("Hello %d/%d test = %d\n", thread_id, n_threads, test);
  }

Your turn#

  • Run omp/test_message. Think about the output you expect, and then run it and think whether the results make sense.

As a side remark, this example uses #pragma omp barrier, which we didn’t really cover before. What it does is to synchronize the threads at this time – threads will in general be executed concurrently, but not synchronously. So it would be possible that thread 1 already printed out the message saying what test is before thread 0 set the value to 99. The barrier makes sure that all threads have completed the part before the barrier before any of them continues to after the barrier. So that makes sure that no thread prints out the message before thread 0 is done setting it to 99.

  • Let’s do the same thing in MPI. Create a message_mpi.cxx with the following code in main() (you’ll need to do some more work to turn this into a compileable program):

  int rank, size;
  MPI_Comm_rank(MPI_COMM_WORLD, &rank);
  MPI_Comm_size(MPI_COMM_WORLD, &size);

  if (rank == 0) {
    test = 99;
  }

  printf("Hello %d/%d test = %d\n", rank, size, test);
  // or if you like it C++-y
  // std::cout << "Hello " << rank << "/" << size << "test = " << test << "\n";

Your turn#

  • Run message_mpi and compare it to message_openmp. Is the variable test in the MPI code shared or private?

Since, as it turns out, all variables in an MPI program are private, test is private on each process: if process 0 sets it, no other process will know about the new value unless the change is explicitly communicated using MPI_Send() and MPI_Recv().

Your turn#

Change the code as follows:

  int test = 0;

  if (rank == 0) {
    test = 99;
    MPI_Send(&test, 1, MPI_INT, 3, 123, MPI_COMM_WORLD);
  } else if (rank == 3) {
    MPI_Recv(&test, 1, MPI_INT, 0, 123, MPI_COMM_WORLD,
             MPI_STATUS_IGNORE);
  }

In this example, processor 0 sets test to 99. It then sends the value of test to processor 3. MPI_Send() takes quite a couple of arguments, which makes it a little difficult to digest. The first argument is an array of data. Since we only have only one single value, we need to use & to make it appear like an array (of length 1). Next, we have to pass the number of elements in the array, ie. 1. Then comes the type of data, in this example it’s MPI_INT, since we’re passing an integer. There’s also MPI_DOUBLE, etc. Of course MPI needs to know where the message should be sent, and in this case I decided I want to send it to processor 3. Finally, we have a tag (123) and a communicator. All you need to know about the tag for now is that it’s an integer, and the corresponding MPI_Recv will need to have that same tag.

The MPI_Recv() is fortunately pretty much the same. The received message will be written into the space indicated by the first argument, in this case also the test variable (processor 3’s version of it). In MPI_Recv() you don’t need to say what the target processor is (it’s the one that is running the MPI_Recv() call), but where it came from, in this example it was processor 0 that sent it. There’s an additional status argument, MPI_STATUS_IGNORE is good enough for that.

Your turn#

  • You know the spiel: think about what you expect from the code above, then run it and compare.

kai@macbook mpi $ mpirun -n 4 message_mpi
[...]

Processor 0 has test = 99, since this is where we set the value in the first place. It was then sent to processor 3, which therefore also knows it’s 99. And the other processors didn’t take part in the communication, so they know nothing about it…

And that’s it. You’ve learned about the 6 functions, and everything, can be done in terms of them, though not necessarily in the best or fastest way. So there are really some more functions we’ll need to learn about, too.

Your turn#

  • Change message_mpi.cxx so that the value of test on rank 0 gets communicated to all other processes. You can start by assuming that the code is run with 4 processes total, but it’d be nice if it worked for any number of processes, too.

More on Using MPI_Send() and MPI_Recv()#

For reference, here is my message_mpi.cxx, which only sends the test value to proc 3, not to all procs yet:

#include <iostream>
#include <mpi.h>

int main(int argc, char** argv)
{
  MPI_Init(&argc, &argv);

  int rank, size;
  MPI_Comm_rank(MPI_COMM_WORLD, &rank);
  MPI_Comm_size(MPI_COMM_WORLD, &size);

  int test = 0;

  if (rank == 0) {
    test = 99;
    MPI_Send(&test, 1, MPI_INT, 3, 123, MPI_COMM_WORLD);
  } else if (rank == 3) {
    MPI_Recv(&test, 1, MPI_INT, 0, 123, MPI_COMM_WORLD, MPI_STATUS_IGNORE);
  }

  printf("Hello %d/%d test = %d\n", rank, size, test);
  MPI_Finalize();
  return 0;
}

Pairing MPI_Send() and MPI_Recv()#

As you’ve seen, MPI_Send() and MPI_Recv() are fundamental, and really the most important, functions in MPI, as they allow you to send a message from one process to another, that’s what “message passing” is all about. They need to be paired properly, ie., if proc a is passing a message to proc b, proc a needs to MPI_Send() to b, and proc b needs to have a corresponding MPI_Recv() from a. The datatypes need to match, as do the tags. It’s okay to have a “short” receive, ie., you may ask to receive, e.g., 20 integers but if the sender only sends 10, that’s okay and the MPI_Recv() will succeed – however, the message can’t be longer than the receive buffer you supply.

In the example so far, we’ve sent a single value, in the general case one often wants to send more than one value, ie., send a whole array of values, e.g.:

  int a[5] = { 0, 1, 2, 3, 4 };
  MPI_Send(a, 5, MPI_INT, rank_dest, tag, MPI_COMM_WORLD);

and similarly on the receiving end.

Your turn#

  • Change your message_mpi code to send and receive large arrays, say 10000 values, instead of the single integer “test”.

MPI debugging#

As you may have noticed when doing the work from last class, MPI programs often show an unfortunate characteristics of hanging forever, which is typically due to bugs in the code you (or I) wrote. One debugging technique is to write down a diagram showing the messages as arrows as they go from one process to the next. Then compare this to what you wrote in the code. This is particularly useful in conjunction with a second method (which could be automated, too): Before and after you MPI_Send() or MPI_Recv(), print out a note saying what you’re going to do. That gives you a way to see where the code hangs, and also to compare the logic of how messages are passed by your code to how you intended them to be passed (the diagram).

MPI debugging with print statements#

Here’s another low-tech way I use a lot for debugging MPI code – I put these helpers into a header file (e.g. mpi_debug.h) for easy access.


#include <mpi.h>

#define mprintf(fmt...) do { int __rank; MPI_Comm_rank(MPI_COMM_WORLD, &__rank); { printf("[%d] ", __rank); printf(fmt); } } while(0)

#define MHERE do { int __rank; MPI_Comm_rank(MPI_COMM_WORLD, &__rank); printf("[%d] HERE: in %s() at %s:%d\n", __rank, __FUNCTION__, __FILE__, __LINE__); } while(0)

You can then just sprinkle MHERE; around your code, and it’s going to print out a message saying where this MHERE happened (or not, if the code hangs before even getting to it).

And mprintf() works just like printf(), but it’ll prefix the message with [###], where ### is the process rank that printed the message, which can be useful.

Prefixing output with the MPI process rank#

The above macros have the advantage that they work with every MPI, and give you always the same formatting. Various MPI implementations also support prefixing output with where it came from, without having to change the code, but the details vary.

For example, with OpenMPI, you can use

mpirun --tag-output -n 4 ...

Your turn: the integrate example#

  • Now we’re in a position to go back to the examples we previously used to introduce OpenMP. Take the test_integration (pi/4) example, and parallelize it using MPI using the basic message passing functions we learned about so far. It may be helpful to make it first run on two processes, and from there go to an arbitrary number.

    You should only use the 6 basic MPI functions that we learned about (and maybe MPI_Wtime(), see below), ie., don’t use MPI_Reduce() even if you know about it.

    Ideally, the code should work correctly if it’s run on an arbitrary number of procs – at the very least, it should complain if, e.g., you’re calculating the sum of 100 trapezoids on 3 procs and it doesn’t know how to handle that. Failing explicitly is much better than just giving an incorrect result.

    Run your parallel program with a varying number of processes and time its execution. time might be good enough for that, though of course it is usally better to add timing directly into the program, too. There’s a MPI function to help with that, very similar to the Wtime() one I provided earlier in this class.

  double my_time = MPI_Wtime();
  • Analyze the parallel scaling of the code.

Some references on MPI#

Homework#

  • Finish the “Your turn” steps above, make commits and add comments to the Feedback PR.