# OpenMP III: Private vs Shared Variables

Today's [assignment](https://classroom.github.com/a/Rmr-NWbz) contains my (serial) implementation of the trapezoidal rule integration example as the starting point.

## Parallelizing with OpenMP

### Distributed memory

Nodes in a supercomputer or cluster are basically independent computers which
are connected via some kind of network / interconnect. As such, if one runs a
program in parallel, every node will run that same program (or sometimes a
different one) independently. The program on node001 has its own memory,
variables, arrays, etc., and no way to directly see the the variables or arrays
that the program on node002 works with. This kind of system is referred to as
"distributed memory" -- each node has its own memory, and only that node can
access it. The only way to solve a problem collaboratively is for the nodes to
explicitly talk to each other via the network, sending messages across the wire
(or fiber). The standard API (Application Programming Interface) for doing so is
MPI (Message Passing Interface), which is basically a library that abstracts
away the specifics of the kind of network, the network protocol, the operating
system's API to communicate via network etc, and simply allows to send and
receive messages (and a bunch of other things, of course...)

### Shared memory

The situation is different if you have, e.g., a dual processor system, or even
multiple cores in a single processor. For those kind of systems, each processor
/ core has access to all of the system's memory. For example, our new Cray
"Marvin" has 2 AMD EPYC 7702 64-Core processors per node.

Now, even if all processors/cores can access all of the RAM, that doesn't
actually mean that one instance of your code can access the data of a second
instance of your code running on the same machine. Modern operating systems have
the concept of processes. Each process is basically one instance of a program
together with its data. You can have a lot of processes run on a system, more
than the number of cores, in which case they will share the available CPU time
by being assigned timeslices. Processes are isolated from each other -- normally
a process cannot spy at another process's memory, and if one process crashes,
the rest of the system should be unaffected.

A newer concept is that of a thread. A process may create multiple threads,
which can execute in parallel -- given CPU availability. Having multiple threads
basically means that multiple pieces of code execute at the same time. However,
threads do share data -- each thread can access the entire memory of the
process. So, multiple threads can work on the same data at the same time --
which is good, because together they may be able to get the work done faster,
but it also opens the door for a lot of subtle bugs which occur only once in a
while depending on specific timing.

The probably most commonly used interface to use threads in your program is the
so-called pthreads or POSIX threads API. However, it is fairly involved to use
and very generic in purpose, so people came up with a much simpler way geared at
scientific applications -- OpenMP, where the compiler does most of the work for
you, you just have to guide it using `#pragma`s or special comments in Fortran.

The basic model for OpenMP is a called "fork-join". A program runs
single-threaded until a parallel section is encountered. The execution flow then
forks into multiple threads which all do their own thing, typically each solving
part of the problem, and when all the parts are done, the threads join and
execution continues as a single thread.

## Motivation: test_integration

You previously implemented the trapezoidal rule example, which calculated the area under
$\sqrt{1-x^2}$, though in any case, I put it into today's assignment repo. By the way, what's the right answer?

Parallelizing it is easy -- just add `#pragma omp parallel for` before the
for-loop.

#### Your turn

* Add timing to the test_integration example, and adjust the problem size such that it takes a measurable amount of time to run.

* Parallelize the integration code and run it with different numbers of threads. What do you observe? (Hint: make sure it actually runs in parallel.)

<!-- ```sh
$ OMP_NUM_THREADS=1 ./test_integration
took 0.330129 sec.
integral is approximately 0.785398
$ OMP_NUM_THREADS=2 ./test_integration
took 0.243158 sec.
integral is approximately 0.384251
$ OMP_NUM_THREADS=4 ./test_integration
took 0.143603 sec.
integral is approximately 0.212945
``` -->

Presumably, you will observe two things aren't so great about this:

- We were hoping to get our answer twice as fast each time we double the numbers
  of threads. That didn't quite happen -- but at least things actually did get
  faster. So that's a somewhat minor problem.

- Unfortunately, our answer changed, ie., became incorrect. That's obviously a
  major problem.

## OpenMP programming model

In the following, you'll look at some examples, which you can add to your repo
and try out. In order to avoid the compiler optimization interfering, it may be
helpful to also (or maybe first) look at what happens without optimization, ie.:

```sh
cmake -DCMAKE_BUILD_TYPE=Debug ..
cmake --config Debug .. # should do the same thing
```

or using VS Code's "CMake: Select Variant" command to choose the Debug build.

#### Your turn

Besides making sure that OpenMP really works, the `test_openmp` program is also
useful to understand the basics of the fork-join model used in OpenMP.

- Compile the "test_openmp" program and run it. What is its output?

### `#pragma parallel`

In C, the block (or single line) following the `#pragma parallel` will execute
as multiple threads, the code that follows afterwards continues single-threaded.
As you can see, there are also a bunch of special omp functions that allow you
to get more info, for example the current thread id. In most cases, those won't
even be needed -- but they're used here for demonstration purposes.

In Fortran, things are pretty similar, the pragma's get replaced by `!$omp`, and
you need additional end markers since there is no concept of blocks in Fortran.

As we already saw last time, some environment variables can be used to change
OpenMP behavior, e.g., to specify the number of threads the parallel section
should run with:

```sh
[kai@fishercat build]$ OMP_NUM_THREADS=2 ./test_openmp
Hi, just starting.
Hello 0/2
Hello 1/2
just about done...
```

### Private vs shared variables

So far, we learned that OpenMP will fork a single threaded program into multiple
threads of execution within a parallel region (`#pragma omp parallel`), see also
the diagram on [http://en.wikipedia.org/wiki/OpenMP wikipedia]. OpenMP is said
to follow a "shared memory" paradigm, ie., every thread shares the whole
process's memory, so if one thread modifies a variable and then the next thread
looks at its value, it will see the modified value. Now, if all variables were
like that, it'd be difficult to get much work done, because threads would keep
scribbling over each other's variables. So OpenMP actually knows about two kinds
of variables, private and shared ones. This is pretty similar to how automatic
(local) variables in a function are specific to a particular instance of the
function being called: If a function is called again, e.g., recursively, its
local variables will exist a second time, with potentially different values. On
the other hand, after a function returns, its variables are all gone, they don't
have any memory, so next the function is called, the values will be undefined
(unless, of course, they are declared "static", this changes the behavior).
Variables that are declared within a parallel section in OpenMP will
automatically be private, like `thread` and `n_threads` in the example above.
Private variables will not retain their values after a parallel section is done
executing.

#### Your turn

- You can try declaring those variables outside of the parallel section, and see
  what happens. You can also experiment with the code below, change between
  `shared` and `private`. What do you expect? Is that what happens?

```c
#include <stdio.h>
#include "omp.h"

int
main(int argc, char **argv)
{
  printf("Hi, just starting\n");

  int test = -1;
#pragma omp parallel shared(test)
  {
    int thread = omp_get_thread_num();
    int n_threads = omp_get_num_threads();
    printf("Hi, I'm thread %d of %d test=%d\n", thread, n_threads, test);
    test = thread;
  }
  printf("test = %d\n", test);
  printf("I'm about to be done.\n");
  return 0;
}
```

When you do this investigation, turn off optimization in the compiler (see
above) to get consistent results.

If a variable, like `test` above, is not listed as `private` nor `shared`, what
happens? How about `thread`, `n_threads`? Explicitly declare them `shared` /
`private` and see what happens.

Does the behavior change with optimization enabled? If so, what might be the
reason?

#### Your turn

I've rewritten the `avg()` function from the class repo slightly to look as
follows (but no, I didn't check it in, so to try it, you'll need to change it
yourself):

```cpp
xt::xtensor<double, 1> avg(const xt::xtensor<double, 1>& f_nc)
{
  int i;
  double val1, val2, sum;

  xt::xtensor<double, 1> f_cc = xt::empty<double>({N});

#pragma omp parallel for
  for (int i = 0; i < N; i++) {
    val1 = f_nc(i);
    val2 = f_nc(i + 1);
    sum = .5 * (val1 + val2);
    f_cc(i) = sum;
  }

  return f_cc;
}

```

The only difference is that I introduced a couple of temporary variables, and
declared all variables at the beginning of the functions, as one had to do in
standard C until C99 came out.

- How do you think these variables should be declared for OpenMP -- shared or
  private? What would happen if you picked the other kind?

- What do you think we should use for the `test_integrate` example? Should the
  sum there be calculated as a private variable? Or would a shared variable
  work?

## Race conditions

It's quite easy to parallelize programs using OpenMP. An important reason why
this is the case is the shared-memory paradigm, ie., all threads can access all
of the memory freely, so the programmer doesn't need to make any special
provisions. However, this programming model also creates a hazard for subtle and
sometimes not so subtle bugs in your code, as you've already seen for the
integrate example which quickly calculates the wrong answer.

This happens because different threads may access the same memory location at
the same time. This is not usually a problem for read-only access -- all threads
will just see the same data values. That's what happened in the vector average
case, actually, two threads may be reading the same input value at the same time
as needed to calculate their result. The result, however, was then written to a
unique memory location, ie., one thread would calculate the ith average for one
particular value of `i` and write it into `f_cc(i)`, and no other thread would
ever write into that same location. So the write accesses are not a problem
there, either, since no concurrent writes to the same location occur.

In the integration example, however, multiple threads would add their trapezoid
result to the shared `sum` variable, which clearly involves writing the result
there. The entire "add" operation in fact actually happens as read-modify-write,
and if that is done concurrently, things may (and eventually will) go wrong.

Let's say the value in `sum` is `0.4` at some point. Two threads want to add
their trapezoid area, thread 0 calculated `0.1` and thread 1 calculated `0.2`
for their respective trapezoid. The resulting new value of sum should, of
course, be `0.7`. However, what really happens might be the following:

Thread 0 reads the old value of `sum` (`0.4`). It then adds its value to it,
obtaining `0.4 + 0.1 = 0.5`. It then writes the new value back, so sum is now
`0.5`. If thread 1 started its business now, everything would be great. But the
point of doing this with multiple threads is of course to do it concurrently,
not sequentially. So thread 1 may in fact run at the same time, and read the old
value of sum (`0.4`), too. It then does the add, obtaining `0.4 + 0.2 = 0.6`. It
now writes its answer back. The final result in sum depends on which thread wins
the "race". It may be `0.5` or `0.6`, but either way, that's wrong. Having code
where the actual result depends on the timing of thread execution is never a
good thing, the code is said to have a "race condition", and that pretty much
means it's buggy. It's possible that things will work out right almost all the
time, which leads to really evil bugs... (very hard to find)

<!-- ### First attempt to avoid the race condition

#### Your turn

Now that we understand that updating the shared variable `sum` is the root of
the problem, we can update the code to calculate the partial sums in a private
variable, and then only calculate the whole sum in the end:

```c
  double t1 = WTime();
  double sum = 0.;
#pragma omp parallel
  {
    double local_sum = 0.;
#pragma omp for
    for (int i = 0; i < N; i++) {
      double x0 = i * dx;
      double x1 = (i+1) * dx;
      local_sum += .5 * (f(x0) + f(x1)) * dx;
    }
    sum += local_sum;
  }
  double t2 = WTime();
```

The `#pragma omp parallel for` has been split into the two primitives it
actually contains, `#pragma omp parallel` to start a parallel region executing
with multiple threads, and then `#pragma omp for` to distribute the following
for loop onto those threads.

- Does this code work right? Can you fix it by adding the `local_sum`s to `sum`?
  Does it give the right answer then? Is it correct? How's the performance?

### Critical sections

A way to deal with concurrent writes / updates to the same memory location is to
actually avoid updating them concurrently. OpenMP provides a way to serialize a
section within a parallel region, using `#pragma omp critical`:

```c
  double t1 = WTime();
  double sum = 0.;
#pragma omp parallel for
  for (int i = 0; i < N; i++) {
    double x0 = i * dx;
    double x1 = (i+1) * dx;
    double val = .5 * (f(x0) + f(x1)) * dx;
#pragma omp critical
    sum += val;
  }
  double t2 = WTime();
```

The code following `#pragma omp critical` is executed in a special way: (In this
case, "the special way" just applies to the one line following the pragma. To
have multiple instructions in the critical section, enclose them with
`{ ... }`). OpenMP guarantees that at most one thread will execute the code in
the critical section at a time. So if two threads want to execute `sum += val;`,
one will win and run first. The other thread has to wait until the first thread
is all the way done, only then will it execute.

#### Your turn

- Does the code above work right? Is it correct? How's the performance? If it
  requires more patience than you have, be creative.

### Atomic operations

Atomic operations are simple operations that can be executed as one complete
unit, essentially having the same effect as doing that operation in a critical
section. Processors provide certain operations as atomics, typically simple
operations like adding two integers or doing logical and/or. `sum += val;` can
be preceded be `#pragma omp atomic` instead of `#pragma omp critical`, having
exactly the same effect (guaranteeing correctness of the code), but better
performance.

#### Your turn

- With (**only!**) what we've learned so far, create a version of the
  integration code that's actually correct but still provides good (parallel)
  performance.

### OpenMP reductions

One of the recurring themes in parallel programming is that you want to reduce
what's happening across all of your threads (or processors) to a single number.
For example, each processor may measure how long it takes to do its share of the
work. In the end, you want to know the total time that all processors spent, so
you have to reduce the many numbers to just one, the sum of them all. OpenMP
supports some basic reductions, like, e.g., the sum.

Here is an example on how to use it:

```c
#include <stdio.h>
#include "openmp.h"

int
main(int argc, char **argv)
{
  // single-threaded
  printf("Hi, just starting.\n");

  int sum = 0;
  int num_threads;

  // start running multiple threads
#pragma omp parallel reduction(+:sum)
  {
    int thread_id = omp_get_thread_num();
    num_threads = omp_get_num_threads();
    printf("My number is %d\n", thread_id);
    sum = thread_id;
  }
  // back to single-threaded
  printf("The sum across all threads is %d, expected: %d\n",
  sum, num_threads * (num_threads - 1) / 2);

  return 0;
}
```

Instead of doing timing, where it's hard to predict what exactly the numbers
will be, I chose to do something deterministic: As we already know, each thread
has a unique id: 0, 1, 2, ... The variable `sum` is declared as reduction
variable, where the operation is '+'. This means in the parallel part of the
program, `sum` will be treated as a private variable: Every thread has its own
`sum` variable, which it can do with whatever it wants, in this case it gets
assigned the thread id. At the end of the parallel section, the actual reduction
comes into action: OpenMP will take all the individual `sum` variables, and add
their values to the pre-existing `sum` variable, that had been initialized to 0.
So after the end of the parallel section of the program, we expect that the
`sum` variable now contains the sum of all of the thread ids, and you get to try
whether it really works that way. A related question: What's up with the formula
I'm printing as the "expected" result?

#### Your turn

- Fix up the `test_integrate` program to be correct, that is, to always give the
  same, correct, result, using an OpenMP reduction. -->

### Homework

- Finish the in-class exercises ("Your turn") and write a report in the usual fashion
  (describe what you did, copy&paste the output you got, try to make sense of
  what happened).
<!-- - Investigate the performance of the different approaches to parallelize the integration code, and write about your resultas in your report. -->

## Further Reading

- [OpenMP tutorial resources](https://www.openmp.org/resources/tutorials-articles/)
- [OpenMP data sharing attributes](http://jakascorner.com/blog/2016/06/omp-data-sharing-attributes.html)