OpenMP IV: critical sections, reductions#

You can continue to work in the assignment repo from Tuesday.

Fixing the race condition in test_integration#

First attempt to avoid the race condition#

Your turn#

Now that we understand that updating the shared variable sum is the root of the problem, we can update the code to calculate the partial sums in a private variable, and then only calculate the whole sum in the end:

  double t1 = Wtime();
  double sum = 0.;
#pragma omp parallel
  {
    double local_sum = 0.;
#pragma omp for
    for (int i = 0; i < N; i++) {
      double x0 = i * dx;
      double x1 = (i+1) * dx;
      local_sum += .5 * (f(x0) + f(x1)) * dx;
    }
    printf("Thread %d has local sum %f\n", omp_get_thread_num(), local_sum);
  }
  double t2 = Wtime();

The #pragma omp parallel for has been split into the two primitives it actually contains, #pragma omp parallel to start a parallel region executing with multiple threads, and then #pragma omp for to distribute the following for loop onto those threads.

  • Does this code work right? Can you fix it by adding the local_sums to sum? Does it give the right answer then? Is it correct? How’s the performance?

Critical sections#

A way to deal with concurrent writes / updates to the same memory location is to actually avoid updating them concurrently. OpenMP provides a way to serialize a section within a parallel region, using #pragma omp critical. Going back to the original code, we can just put the update of sum into a critical section:

  double t1 = Wtime();
  double sum = 0.;
#pragma omp parallel for
  for (int i = 0; i < N; i++) {
    double x0 = i * dx;
    double x1 = (i+1) * dx;
    double val = .5 * (f(x0) + f(x1)) * dx;
#pragma omp critical
    sum += val;
  }
  double t2 = Wtime();

The code following #pragma omp critical is executed in a special way: (In this case, “the special way” just applies to the one line following the pragma. To have multiple instructions in the critical section, enclose them with { ... }). OpenMP guarantees that at most one thread will execute the code in the critical section at a time. So if two threads want to execute sum += val;, one will win and run first. The other thread has to wait until the first thread is all the way done, only then will it execute.

Your turn#

  • Does the code above work right? Is it correct? How’s the performance? If it requires more patience than you have, be creative.

Atomic operations#

Atomic operations are simple operations that can be executed as one complete unit, essentially having the same effect as doing that operation in a critical section. Processors provide certain operations as atomics, typically simple operations like adding two integers or doing logical and/or. sum += val; can be preceded by #pragma omp atomic instead of #pragma omp critical, having exactly the same effect (guaranteeing correctness of the code), but potentially better performance, if the processor hardware supports this operation as an atomic.

Your turn#

  • With (only!) what we’ve learned so far, create a version of the integration code that’s actually correct but still provides good (parallel) performance.

OpenMP reductions#

One of the recurring themes in parallel programming is that you want to reduce what’s happening across all of your threads (or processors) to a single number. For example, each processor may measure how long it takes to do its share of the work. In the end, you want to know the total time that all processors spent, so you have to reduce the many numbers to just one, the sum of them all. OpenMP supports some basic reductions, like, e.g., the sum.

Here is an example on how to use it:

#include <stdio.h>
#include "openmp.h"

int
main(int argc, char **argv)
{
  // single-threaded
  printf("Hi, just starting.\n");

  int sum = 0;
  int num_threads;

  // start running multiple threads
#pragma omp parallel reduction(+:sum)
  {
    int thread_id = omp_get_thread_num();
    num_threads = omp_get_num_threads();
    printf("My number is %d\n", thread_id);
    sum = thread_id;
  }
  // back to single-threaded
  printf("The sum across all threads is %d, expected: %d\n",
  sum, num_threads * (num_threads - 1) / 2);

  return 0;
}

Instead of doing timing, where it’s hard to predict what exactly the numbers will be, I chose to do something deterministic: As we already know, each thread has a unique id: 0, 1, 2, … The variable sum is declared as reduction variable, where the operation is ‘+’. This means in the parallel part of the program, sum will be treated as a private variable: Every thread has its own sum variable, which it can do with whatever it wants, in this case it gets assigned the thread id. At the end of the parallel section, the actual reduction comes into action: OpenMP will take all the individual sum variables, and add their values to the pre-existing sum variable, that had been initialized to 0. So after the end of the parallel section of the program, we expect that the sum variable now contains the sum of all of the thread ids, and you get to try whether it really works that way. A related question: What’s up with the formula I’m printing as the “expected” result?

Your turn#

  • Fix up the test_integration program to be correct, that is, to always give the same, correct, result, using an OpenMP reduction.

Work sharing#

As we’ve learned before a simple #pragma omp parallel forks the execution into multiple threads, so that the code that follow will execute n times in parallel. If I just did the following

#pragma omp parallel
{
  for (int i = 0; i < 10; i++) {
    do_something(i);
  }
}

and I ran this with OMP_NUM_THREADS=2, I would have two threads both doing all 10 iterations of the loop from i = 0 to i = 9. So I’d just be doing everything twice, which is probably not what I want. I’d rather have one thread do half of the work, and the other thread do the other half. That’s what #pragma omp for does, it uses the available threads and splits up the work between the threads:

#pragma omp parallel
{
#pragma omp for
  for (int i = 0; i < 10; i++) {
    do_something(i);
  }
}

(and one can combine both into a single #pragma omp parallel for statement).

OpenMP parallel sections#

omp for is by far the most commonly used way to distribute work between the available threads. There is of course always the possibility to implement a custom way of doing one’s work sharing (ie., inside a parallel region, one could get the current thread’s number using omp_get_thread_num(), and then write custom code for any given thread).

For completeness, OpenMP also provides the concepts of “sections”. Here is an example:

#pragma omp parallel sections
{
#pragma omp section
  solve_ocean()
#pragma omp section
  solve_air()
#pragma omp section
  write_output()
}

This examples assumes that the three functions (solve_ocean(), etc) can all run concurrently and independently (which is probably not realistic, but let’s assume so). OpenMP will attempt to run each section concurrently in its own thread (if there are not enough available threads, it may not be able to do so)

Talking about completeness, in this class we have only encountered the basic features of OpenMP, which have been around for along time. In recent years, OpenMP has evolved a lot to better support SIMD vectorizations, GPUs, memory spaces etc. We don’t have enough time in this class to get into this very much, but it’s worth being aware of.

Loop Scheduling#

Let’s get back to the #pragma omp for parallel loops:

#pragma omp parallel for
  for (int i = 0; i < 10; i++) {
    do_something(i);
  }

Again, if have two available threads, we’d expect half of the iterations be executed by one thread, and the other half by the other.

What isn’t clear, though, is how the work is split up. Does thread #0 do i = 0, 1, 2, 3, 4 and thread #1 do i = 5, 6, 7, 8, 9? Or maybe one does the even i’s, and the other one the odd ones? What happens if the number of iterations isn’t evenly divisible by the number of threads?

Of course one could go try googling these questions. But, we might as well to some experiments ourselves.

Your turn#

  • Write a little test that runs a loop like the one above. You don’t have to do real work, but you do want to now which thread is doing what. So why not just print it? We’ve encountered omp_get_thread_num() before, so let’s put that to use.

  • Try figuring out the answers to the questions I posed above.

Loop scheduling options#

Now that you’ve found what OpenMP does by default, let’s change it up. It turns out that one can change the “loop scheduling” using

#pragma omp parallel for schedule(kind [,chunk size])

where kind is one of [static,dynamic,guided,…]. The options are described in, e.g., http://jakascorner.com/blog/2016/06/omp-for-scheduling.html.

  • Do some experimenting with static and different chunk sizes to figure out what’s going on. You can try dynamic or guided, too, but it’s probably not too meaningful in your simple dummy loop, where no actual work happens.

Homework#

  • Finish the in-class exercises (“Your turn”) and write a report in the usual fashion (describe what you did, copy&paste the output you got, try to make sense of what happened).

  • Investigate the performance of the different approaches to parallelize the integration code, and write about your resultas in your report.