# OpenMP IV: critical sections, reductions

You can continue to work in the assignment repo from Tuesday.

## Fixing the race condition in `test_integration`

### First attempt to avoid the race condition

#### Your turn

Now that we understand that updating the shared variable `sum` is the root of
the problem, we can update the code to calculate the partial sums in a private
variable, and then only calculate the whole sum in the end:

```c
  double t1 = Wtime();
  double sum = 0.;
#pragma omp parallel
  {
    double local_sum = 0.;
#pragma omp for
    for (int i = 0; i < N; i++) {
      double x0 = i * dx;
      double x1 = (i+1) * dx;
      local_sum += .5 * (f(x0) + f(x1)) * dx;
    }
    printf("Thread %d has local sum %f\n", omp_get_thread_num(), local_sum);
  }
  double t2 = Wtime();
```

The `#pragma omp parallel for` has been split into the two primitives it
actually contains, `#pragma omp parallel` to start a parallel region executing
with multiple threads, and then `#pragma omp for` to distribute the following
for loop onto those threads.

- Does this code work right? Can you fix it by adding the `local_sum`s to `sum`?
  Does it give the right answer then? Is it correct? How's the performance?

### Critical sections

A way to deal with concurrent writes / updates to the same memory location is to
actually avoid updating them concurrently. OpenMP provides a way to serialize a
section within a parallel region, using `#pragma omp critical`. Going back to the original code, we can just put the update of `sum` into a critical section:

```c
  double t1 = Wtime();
  double sum = 0.;
#pragma omp parallel for
  for (int i = 0; i < N; i++) {
    double x0 = i * dx;
    double x1 = (i+1) * dx;
    double val = .5 * (f(x0) + f(x1)) * dx;
#pragma omp critical
    sum += val;
  }
  double t2 = Wtime();
```

The code following `#pragma omp critical` is executed in a special way: (In this
case, "the special way" just applies to the one line following the pragma. To
have multiple instructions in the critical section, enclose them with
`{ ... }`). OpenMP guarantees that at most one thread will execute the code in
the critical section at a time. So if two threads want to execute `sum += val;`,
one will win and run first. The other thread has to wait until the first thread
is all the way done, only then will it execute.

#### Your turn

- Does the code above work right? Is it correct? How's the performance? If it
  requires more patience than you have, be creative.

### Atomic operations

Atomic operations are simple operations that can be executed as one complete
unit, essentially having the same effect as doing that operation in a critical
section. Processors provide certain operations as atomics, typically simple
operations like adding two integers or doing logical and/or. `sum += val;` can
be preceded by `#pragma omp atomic` instead of `#pragma omp critical`, having
exactly the same effect (guaranteeing correctness of the code), but potentially better
performance, if the processor hardware supports this operation as an atomic.

#### Your turn

- With (**only!**) what we've learned so far, create a version of the
  integration code that's actually correct but still provides good (parallel)
  performance.

### OpenMP reductions

One of the recurring themes in parallel programming is that you want to reduce
what's happening across all of your threads (or processors) to a single number.
For example, each processor may measure how long it takes to do its share of the
work. In the end, you want to know the total time that all processors spent, so
you have to reduce the many numbers to just one, the sum of them all. OpenMP
supports some basic reductions, like, e.g., the sum.

Here is an example on how to use it:

```c
#include <stdio.h>
#include "openmp.h"

int
main(int argc, char **argv)
{
  // single-threaded
  printf("Hi, just starting.\n");

  int sum = 0;
  int num_threads;

  // start running multiple threads
#pragma omp parallel reduction(+:sum)
  {
    int thread_id = omp_get_thread_num();
    num_threads = omp_get_num_threads();
    printf("My number is %d\n", thread_id);
    sum = thread_id;
  }
  // back to single-threaded
  printf("The sum across all threads is %d, expected: %d\n",
  sum, num_threads * (num_threads - 1) / 2);

  return 0;
}
```

Instead of doing timing, where it's hard to predict what exactly the numbers
will be, I chose to do something deterministic: As we already know, each thread
has a unique id: 0, 1, 2, ... The variable `sum` is declared as reduction
variable, where the operation is '+'. This means in the parallel part of the
program, `sum` will be treated as a private variable: Every thread has its own
`sum` variable, which it can do with whatever it wants, in this case it gets
assigned the thread id. At the end of the parallel section, the actual reduction
comes into action: OpenMP will take all the individual `sum` variables, and add
their values to the pre-existing `sum` variable, that had been initialized to 0.
So after the end of the parallel section of the program, we expect that the
`sum` variable now contains the sum of all of the thread ids, and you get to try
whether it really works that way. A related question: What's up with the formula
I'm printing as the "expected" result?

#### Your turn

- Fix up the `test_integration` program to be correct, that is, to always give the
  same, correct, result, using an OpenMP reduction.

## Work sharing

As we've learned before a simple `#pragma omp parallel` forks the execution into
multiple threads, so that the code that follow will execute n times in parallel.
If I just did the following

```cpp
#pragma omp parallel
{
  for (int i = 0; i < 10; i++) {
    do_something(i);
  }
}

```

and I ran this with `OMP_NUM_THREADS=2`, I would have two threads both doing all
10 iterations of the loop from `i = 0` to `i = 9`. So I'd just be doing
everything twice, which is probably not what I want. I'd rather have one thread
do half of the work, and the other thread do the other half. That's what
`#pragma omp for` does, it uses the available threads and splits up the work
between the threads:

```cpp
#pragma omp parallel
{
#pragma omp for
  for (int i = 0; i < 10; i++) {
    do_something(i);
  }
}

```

(and one can combine both into a single `#pragma omp parallel for` statement).

### OpenMP parallel sections

`omp for` is by far the most commonly used way to distribute work between the
available threads. There is of course always the possibility to implement a
custom way of doing one's work sharing (ie., inside a parallel region, one
could get the current thread's number using `omp_get_thread_num()`, and then
write custom code for any given thread).

For completeness, OpenMP also provides the concepts of "sections". Here is an
example:

```cpp
#pragma omp parallel sections
{
#pragma omp section
  solve_ocean()
#pragma omp section
  solve_air()
#pragma omp section
  write_output()
}
```

This examples assumes that the three functions (`solve_ocean()`, etc) can all
run concurrently and independently (which is probably not realistic, but let's
assume so). OpenMP will attempt to run each section concurrently in its own
thread (if there are not enough available threads, it may not be able to do so)

Talking about completeness, in this class we have only encountered the basic
features of OpenMP, which have been around for along time. In recent years,
OpenMP has evolved a lot to better support SIMD vectorizations, GPUs, memory
spaces etc. We don't have enough time in this class to get into this very much,
but it's worth being aware of.

## Loop Scheduling

Let's get back to the `#pragma omp for` parallel loops:

```cpp
#pragma omp parallel for
  for (int i = 0; i < 10; i++) {
    do_something(i);
  }
```

Again, if have two available threads, we'd expect half of the iterations be
executed by one thread, and the other half by the other.

What isn't clear, though, is how the work is split up. Does thread #0 do
`i = 0, 1, 2, 3, 4` and thread #1 do `i = 5, 6, 7, 8, 9`? Or maybe one does the
even `i`'s, and the other one the odd ones? What happens if the number of
iterations isn't evenly divisible by the number of threads?

Of course one could go try googling these questions. But, we might as well to
some experiments ourselves.

#### Your turn

- Write a little test that runs a loop like the one above. You don't have to do
  real work, but you do want to now which thread is doing what. So why not just
  print it? We've encountered `omp_get_thread_num()` before, so let's put that
  to use.

- Try figuring out the answers to the questions I posed above.

### Loop scheduling options

Now that you've found what OpenMP does by default, let's change it up. It turns
out that one can change the "loop scheduling" using

```
#pragma omp parallel for schedule(kind [,chunk size])
```

where `kind` is one of [`static`,`dynamic`,`guided`,...]. The options are
described in, e.g.,
<http://jakascorner.com/blog/2016/06/omp-for-scheduling.html>.

- Do some experimenting with `static` and different chunk sizes to figure out
  what's going on. You can try `dynamic` or `guided`, too, but it's probably not
  too meaningful in your simple dummy loop, where no actual work happens.

## Homework

- Finish the in-class exercises ("Your turn") and write a report in the usual fashion
  (describe what you did, copy&paste the output you got, try to make sense of
  what happened).

- Investigate the performance of the different approaches to parallelize the integration code, and write about your resultas in your report.