MPI IV: MPI pitfalls and non-blocking communication#
Today’s assignment.
The assignment repo is based on last class’s repo, and contains the additional series of commits that I talked about last class, about organizing code (in this case related to MPI 1-d domain decomposition) in order to avoid duplication and separate computational/parallelization aspects from the numerical implementation.
MPI pitfalls#
You may well already have noticed that when using MPI communications, it’s not too difficult to get our programs to misbehave (of course, it’s not fair to only blame MPI, it’s typically a problem with how we use it). This may take the form of your program crashing with some error message, or it may lead to the program hanging indefinitely, doing nothing (nothing useful, anyway). In which case, Ctrl-C will interrupt and kill it, by the way.
Sometimes the problems can be quite subtle, and a code which works just fine on one machine, or for one case, stops working when moving to another machine, or when changing parameters, like the size of your domain.
#include <mpi.h>
#include <vector>
int main(int argc, char** argv)
{
MPI_Init(&argc, &argv);
int rank, size;
MPI_Comm_rank(MPI_COMM_WORLD, &rank);
MPI_Comm_size(MPI_COMM_WORLD, &size);
// Make sure we're being run on (exactly) 2 procs
if (size != 2) {
if (rank == 0) {
std::cerr << "This program needs to be run on exactly 2 MPI processes.\n";
}
MPI_Abort(MPI_COMM_WORLD, 1);
}
if (argc != 2) {
// This is a nice way of checking for an error
if (rank == 0) {
std::cerr << "Usage: " << argv[0] << " <N>\n"
<< "where <N> is the size of the message to exchange.\n";
}
MPI_Abort(MPI_COMM_WORLD, 1);
}
int N = atoi(argv[1]);
std::vector<double> buf_send(N), buf_recv(N);
if (rank == 0) {
MPI_Send(buf_send.data(), N, MPI_DOUBLE, 1, 1234, MPI_COMM_WORLD);
MPI_Recv(buf_recv.data(), N, MPI_DOUBLE, 1, 1234, MPI_COMM_WORLD,
MPI_STATUS_IGNORE);
} else if (rank == 1) {
MPI_Send(buf_send.data(), N, MPI_DOUBLE, 0, 1234, MPI_COMM_WORLD);
MPI_Recv(buf_recv.data(), N, MPI_DOUBLE, 0, 1234, MPI_COMM_WORLD,
MPI_STATUS_IGNORE);
}
MPI_Finalize();
return 0;
}
Your turn#
This program demonstrates one of the problems you may encounter. It requires to be run on exactly two processors, otherwise it’ll complain right away. It just exchanges messages between the two processors, and it makes it easy to change the size of the message on the command line. I think you may find it’s quite similar in principle to the code you (and I) wrote to fill ghost points, though it sends more than just a single ghost point value.
Add the code above to the repo, by copy & pasting it to a new file
mpi_dl_test.cxxin the assignment repo and adding it to the build. Run it like
[kai@mbpro mpi]$ mpirun -n 2 ./mpi_dl_test 10
In this case 10 double precision numbers will be sent between the two procs.
Try different message sizes by changing the ‘10’ on the command line to other numbers. What do you observe? Why? (If you don’t find anything weird, you may have to change your message size more substantially.) You may want to put in some print statements to see what’s up.
How can you fix the problem?
Your turn#
Fix the
mpi_dl_testcode so that it’s always correct.
UNH’s “Premise” supercomputer#
UNH operates a cluster called “Premise”, which is more or less a traditional cluster of commodity hardware. It is available for use by UNH students, faculty and staff, and is a good place to get a feel for using a real supercomputer, learn how to use a batch system, have a system that one can actually perform some parallel scaling analysis without interference, or use it for your final project.
It’s optional, but if you want to get more information and possibly get an account, you can check out more information here: https://premise.sr.unh.edu/, including how to request an account. If you do want to request an account, please let me know as well, so I can help coordinate with the Research Computing Center on how to make that work efficiently.
Homework#
As usual, follow the steps above, make commits and add comments in the feedback pull request. After the solutions are posted, add a reflection statement.