OpenMP III: Private vs Shared Variables#

Today’s assignment contains my (serial) implementation of the trapezoidal rule integration example as the starting point.

Parallelizing with OpenMP#

Distributed memory#

Nodes in a supercomputer or cluster are basically independent computers which are connected via some kind of network / interconnect. As such, if one runs a program in parallel, every node will run that same program (or sometimes a different one) independently. The program on node001 has its own memory, variables, arrays, etc., and no way to directly see the the variables or arrays that the program on node002 works with. This kind of system is referred to as “distributed memory” – each node has its own memory, and only that node can access it. The only way to solve a problem collaboratively is for the nodes to explicitly talk to each other via the network, sending messages across the wire (or fiber). The standard API (Application Programming Interface) for doing so is MPI (Message Passing Interface), which is basically a library that abstracts away the specifics of the kind of network, the network protocol, the operating system’s API to communicate via network etc, and simply allows to send and receive messages (and a bunch of other things, of course…)

Shared memory#

The situation is different if you have, e.g., a dual processor system, or even multiple cores in a single processor. For those kind of systems, each processor / core has access to all of the system’s memory. For example, our new Cray “Marvin” has 2 AMD EPYC 7702 64-Core processors per node.

Now, even if all processors/cores can access all of the RAM, that doesn’t actually mean that one instance of your code can access the data of a second instance of your code running on the same machine. Modern operating systems have the concept of processes. Each process is basically one instance of a program together with its data. You can have a lot of processes run on a system, more than the number of cores, in which case they will share the available CPU time by being assigned timeslices. Processes are isolated from each other – normally a process cannot spy at another process’s memory, and if one process crashes, the rest of the system should be unaffected.

A newer concept is that of a thread. A process may create multiple threads, which can execute in parallel – given CPU availability. Having multiple threads basically means that multiple pieces of code execute at the same time. However, threads do share data – each thread can access the entire memory of the process. So, multiple threads can work on the same data at the same time – which is good, because together they may be able to get the work done faster, but it also opens the door for a lot of subtle bugs which occur only once in a while depending on specific timing.

The probably most commonly used interface to use threads in your program is the so-called pthreads or POSIX threads API. However, it is fairly involved to use and very generic in purpose, so people came up with a much simpler way geared at scientific applications – OpenMP, where the compiler does most of the work for you, you just have to guide it using #pragmas or special comments in Fortran.

The basic model for OpenMP is a called “fork-join”. A program runs single-threaded until a parallel section is encountered. The execution flow then forks into multiple threads which all do their own thing, typically each solving part of the problem, and when all the parts are done, the threads join and execution continues as a single thread.

Motivation: test_integration#

You previously implemented the trapezoidal rule example, which calculated the area under \(\sqrt{1-x^2}\), though in any case, I put it into today’s assignment repo. By the way, what’s the right answer?

Parallelizing it is easy – just add #pragma omp parallel for before the for-loop.

Your turn#

  • Add timing to the test_integration example, and adjust the problem size such that it takes a measurable amount of time to run.

  • Parallelize the integration code and run it with different numbers of threads. What do you observe? (Hint: make sure it actually runs in parallel.)

Presumably, you will observe two things aren’t so great about this:

  • We were hoping to get our answer twice as fast each time we double the numbers of threads. That didn’t quite happen – but at least things actually did get faster. So that’s a somewhat minor problem.

  • Unfortunately, our answer changed, ie., became incorrect. That’s obviously a major problem.

OpenMP programming model#

In the following, you’ll look at some examples, which you can add to your repo and try out. In order to avoid the compiler optimization interfering, it may be helpful to also (or maybe first) look at what happens without optimization, ie.:

cmake -DCMAKE_BUILD_TYPE=Debug ..
cmake --config Debug .. # should do the same thing

or using VS Code’s “CMake: Select Variant” command to choose the Debug build.

Your turn#

Besides making sure that OpenMP really works, the test_openmp program is also useful to understand the basics of the fork-join model used in OpenMP.

  • Compile the “test_openmp” program and run it. What is its output?

#pragma parallel#

In C, the block (or single line) following the #pragma parallel will execute as multiple threads, the code that follows afterwards continues single-threaded. As you can see, there are also a bunch of special omp functions that allow you to get more info, for example the current thread id. In most cases, those won’t even be needed – but they’re used here for demonstration purposes.

In Fortran, things are pretty similar, the pragma’s get replaced by !$omp, and you need additional end markers since there is no concept of blocks in Fortran.

As we already saw last time, some environment variables can be used to change OpenMP behavior, e.g., to specify the number of threads the parallel section should run with:

[kai@fishercat build]$ OMP_NUM_THREADS=2 ./test_openmp
Hi, just starting.
Hello 0/2
Hello 1/2
just about done...

Private vs shared variables#

So far, we learned that OpenMP will fork a single threaded program into multiple threads of execution within a parallel region (#pragma omp parallel), see also the diagram on [http://en.wikipedia.org/wiki/OpenMP wikipedia]. OpenMP is said to follow a “shared memory” paradigm, ie., every thread shares the whole process’s memory, so if one thread modifies a variable and then the next thread looks at its value, it will see the modified value. Now, if all variables were like that, it’d be difficult to get much work done, because threads would keep scribbling over each other’s variables. So OpenMP actually knows about two kinds of variables, private and shared ones. This is pretty similar to how automatic (local) variables in a function are specific to a particular instance of the function being called: If a function is called again, e.g., recursively, its local variables will exist a second time, with potentially different values. On the other hand, after a function returns, its variables are all gone, they don’t have any memory, so next the function is called, the values will be undefined (unless, of course, they are declared “static”, this changes the behavior). Variables that are declared within a parallel section in OpenMP will automatically be private, like thread and n_threads in the example above. Private variables will not retain their values after a parallel section is done executing.

Your turn#

  • You can try declaring those variables outside of the parallel section, and see what happens. You can also experiment with the code below, change between shared and private. What do you expect? Is that what happens?

#include <stdio.h>
#include "omp.h"

int
main(int argc, char **argv)
{
  printf("Hi, just starting\n");

  int test = -1;
#pragma omp parallel shared(test)
  {
    int thread = omp_get_thread_num();
    int n_threads = omp_get_num_threads();
    printf("Hi, I'm thread %d of %d test=%d\n", thread, n_threads, test);
    test = thread;
  }
  printf("test = %d\n", test);
  printf("I'm about to be done.\n");
  return 0;
}

When you do this investigation, turn off optimization in the compiler (see above) to get consistent results.

If a variable, like test above, is not listed as private nor shared, what happens? How about thread, n_threads? Explicitly declare them shared / private and see what happens.

Does the behavior change with optimization enabled? If so, what might be the reason?

Your turn#

I’ve rewritten the avg() function from the class repo slightly to look as follows (but no, I didn’t check it in, so to try it, you’ll need to change it yourself):

xt::xtensor<double, 1> avg(const xt::xtensor<double, 1>& f_nc)
{
  int i;
  double val1, val2, sum;

  xt::xtensor<double, 1> f_cc = xt::empty<double>({N});

#pragma omp parallel for
  for (int i = 0; i < N; i++) {
    val1 = f_nc(i);
    val2 = f_nc(i + 1);
    sum = .5 * (val1 + val2);
    f_cc(i) = sum;
  }

  return f_cc;
}

The only difference is that I introduced a couple of temporary variables, and declared all variables at the beginning of the functions, as one had to do in standard C until C99 came out.

  • How do you think these variables should be declared for OpenMP – shared or private? What would happen if you picked the other kind?

  • What do you think we should use for the test_integrate example? Should the sum there be calculated as a private variable? Or would a shared variable work?

Race conditions#

It’s quite easy to parallelize programs using OpenMP. An important reason why this is the case is the shared-memory paradigm, ie., all threads can access all of the memory freely, so the programmer doesn’t need to make any special provisions. However, this programming model also creates a hazard for subtle and sometimes not so subtle bugs in your code, as you’ve already seen for the integrate example which quickly calculates the wrong answer.

This happens because different threads may access the same memory location at the same time. This is not usually a problem for read-only access – all threads will just see the same data values. That’s what happened in the vector average case, actually, two threads may be reading the same input value at the same time as needed to calculate their result. The result, however, was then written to a unique memory location, ie., one thread would calculate the ith average for one particular value of i and write it into f_cc(i), and no other thread would ever write into that same location. So the write accesses are not a problem there, either, since no concurrent writes to the same location occur.

In the integration example, however, multiple threads would add their trapezoid result to the shared sum variable, which clearly involves writing the result there. The entire “add” operation in fact actually happens as read-modify-write, and if that is done concurrently, things may (and eventually will) go wrong.

Let’s say the value in sum is 0.4 at some point. Two threads want to add their trapezoid area, thread 0 calculated 0.1 and thread 1 calculated 0.2 for their respective trapezoid. The resulting new value of sum should, of course, be 0.7. However, what really happens might be the following:

Thread 0 reads the old value of sum (0.4). It then adds its value to it, obtaining 0.4 + 0.1 = 0.5. It then writes the new value back, so sum is now 0.5. If thread 1 started its business now, everything would be great. But the point of doing this with multiple threads is of course to do it concurrently, not sequentially. So thread 1 may in fact run at the same time, and read the old value of sum (0.4), too. It then does the add, obtaining 0.4 + 0.2 = 0.6. It now writes its answer back. The final result in sum depends on which thread wins the “race”. It may be 0.5 or 0.6, but either way, that’s wrong. Having code where the actual result depends on the timing of thread execution is never a good thing, the code is said to have a “race condition”, and that pretty much means it’s buggy. It’s possible that things will work out right almost all the time, which leads to really evil bugs… (very hard to find)

Homework#

  • Finish the in-class exercises (“Your turn”) and write a report in the usual fashion (describe what you did, copy&paste the output you got, try to make sense of what happened).

Further Reading#