OpenMP IV: critical sections, reductions#
You can continue to work in the assignment repo from Tuesday.
Fixing the race condition in test_integration#
First attempt to avoid the race condition#
Your turn#
Now that we understand that updating the shared variable sum is the root of
the problem, we can update the code to calculate the partial sums in a private
variable, and then only calculate the whole sum in the end:
double t1 = Wtime();
double sum = 0.;
#pragma omp parallel
{
double local_sum = 0.;
#pragma omp for
for (int i = 0; i < N; i++) {
double x0 = i * dx;
double x1 = (i+1) * dx;
local_sum += .5 * (f(x0) + f(x1)) * dx;
}
printf("Thread %d has local sum %f\n", omp_get_thread_num(), local_sum);
}
double t2 = Wtime();
The #pragma omp parallel for has been split into the two primitives it
actually contains, #pragma omp parallel to start a parallel region executing
with multiple threads, and then #pragma omp for to distribute the following
for loop onto those threads.
Does this code work right? Can you fix it by adding the
local_sums tosum? Does it give the right answer then? Is it correct? How’s the performance?
Critical sections#
A way to deal with concurrent writes / updates to the same memory location is to
actually avoid updating them concurrently. OpenMP provides a way to serialize a
section within a parallel region, using #pragma omp critical. Going back to the original code, we can just put the update of sum into a critical section:
double t1 = Wtime();
double sum = 0.;
#pragma omp parallel for
for (int i = 0; i < N; i++) {
double x0 = i * dx;
double x1 = (i+1) * dx;
double val = .5 * (f(x0) + f(x1)) * dx;
#pragma omp critical
sum += val;
}
double t2 = Wtime();
The code following #pragma omp critical is executed in a special way: (In this
case, “the special way” just applies to the one line following the pragma. To
have multiple instructions in the critical section, enclose them with
{ ... }). OpenMP guarantees that at most one thread will execute the code in
the critical section at a time. So if two threads want to execute sum += val;,
one will win and run first. The other thread has to wait until the first thread
is all the way done, only then will it execute.
Your turn#
Does the code above work right? Is it correct? How’s the performance? If it requires more patience than you have, be creative.
Atomic operations#
Atomic operations are simple operations that can be executed as one complete
unit, essentially having the same effect as doing that operation in a critical
section. Processors provide certain operations as atomics, typically simple
operations like adding two integers or doing logical and/or. sum += val; can
be preceded by #pragma omp atomic instead of #pragma omp critical, having
exactly the same effect (guaranteeing correctness of the code), but potentially better
performance, if the processor hardware supports this operation as an atomic.
Your turn#
With (only!) what we’ve learned so far, create a version of the integration code that’s actually correct but still provides good (parallel) performance.
OpenMP reductions#
One of the recurring themes in parallel programming is that you want to reduce what’s happening across all of your threads (or processors) to a single number. For example, each processor may measure how long it takes to do its share of the work. In the end, you want to know the total time that all processors spent, so you have to reduce the many numbers to just one, the sum of them all. OpenMP supports some basic reductions, like, e.g., the sum.
Here is an example on how to use it:
#include <stdio.h>
#include "openmp.h"
int
main(int argc, char **argv)
{
// single-threaded
printf("Hi, just starting.\n");
int sum = 0;
int num_threads;
// start running multiple threads
#pragma omp parallel reduction(+:sum)
{
int thread_id = omp_get_thread_num();
num_threads = omp_get_num_threads();
printf("My number is %d\n", thread_id);
sum = thread_id;
}
// back to single-threaded
printf("The sum across all threads is %d, expected: %d\n",
sum, num_threads * (num_threads - 1) / 2);
return 0;
}
Instead of doing timing, where it’s hard to predict what exactly the numbers
will be, I chose to do something deterministic: As we already know, each thread
has a unique id: 0, 1, 2, … The variable sum is declared as reduction
variable, where the operation is ‘+’. This means in the parallel part of the
program, sum will be treated as a private variable: Every thread has its own
sum variable, which it can do with whatever it wants, in this case it gets
assigned the thread id. At the end of the parallel section, the actual reduction
comes into action: OpenMP will take all the individual sum variables, and add
their values to the pre-existing sum variable, that had been initialized to 0.
So after the end of the parallel section of the program, we expect that the
sum variable now contains the sum of all of the thread ids, and you get to try
whether it really works that way. A related question: What’s up with the formula
I’m printing as the “expected” result?
Your turn#
Fix up the
test_integrationprogram to be correct, that is, to always give the same, correct, result, using an OpenMP reduction.
Work sharing#
As we’ve learned before a simple #pragma omp parallel forks the execution into
multiple threads, so that the code that follow will execute n times in parallel.
If I just did the following
#pragma omp parallel
{
for (int i = 0; i < 10; i++) {
do_something(i);
}
}
and I ran this with OMP_NUM_THREADS=2, I would have two threads both doing all
10 iterations of the loop from i = 0 to i = 9. So I’d just be doing
everything twice, which is probably not what I want. I’d rather have one thread
do half of the work, and the other thread do the other half. That’s what
#pragma omp for does, it uses the available threads and splits up the work
between the threads:
#pragma omp parallel
{
#pragma omp for
for (int i = 0; i < 10; i++) {
do_something(i);
}
}
(and one can combine both into a single #pragma omp parallel for statement).
OpenMP parallel sections#
omp for is by far the most commonly used way to distribute work between the
available threads. There is of course always the possibility to implement a
custom way of doing one’s work sharing (ie., inside a parallel region, one
could get the current thread’s number using omp_get_thread_num(), and then
write custom code for any given thread).
For completeness, OpenMP also provides the concepts of “sections”. Here is an example:
#pragma omp parallel sections
{
#pragma omp section
solve_ocean()
#pragma omp section
solve_air()
#pragma omp section
write_output()
}
This examples assumes that the three functions (solve_ocean(), etc) can all
run concurrently and independently (which is probably not realistic, but let’s
assume so). OpenMP will attempt to run each section concurrently in its own
thread (if there are not enough available threads, it may not be able to do so)
Talking about completeness, in this class we have only encountered the basic features of OpenMP, which have been around for along time. In recent years, OpenMP has evolved a lot to better support SIMD vectorizations, GPUs, memory spaces etc. We don’t have enough time in this class to get into this very much, but it’s worth being aware of.
Loop Scheduling#
Let’s get back to the #pragma omp for parallel loops:
#pragma omp parallel for
for (int i = 0; i < 10; i++) {
do_something(i);
}
Again, if have two available threads, we’d expect half of the iterations be executed by one thread, and the other half by the other.
What isn’t clear, though, is how the work is split up. Does thread #0 do
i = 0, 1, 2, 3, 4 and thread #1 do i = 5, 6, 7, 8, 9? Or maybe one does the
even i’s, and the other one the odd ones? What happens if the number of
iterations isn’t evenly divisible by the number of threads?
Of course one could go try googling these questions. But, we might as well to some experiments ourselves.
Your turn#
Write a little test that runs a loop like the one above. You don’t have to do real work, but you do want to now which thread is doing what. So why not just print it? We’ve encountered
omp_get_thread_num()before, so let’s put that to use.Try figuring out the answers to the questions I posed above.
Loop scheduling options#
Now that you’ve found what OpenMP does by default, let’s change it up. It turns out that one can change the “loop scheduling” using
#pragma omp parallel for schedule(kind [,chunk size])
where kind is one of [static,dynamic,guided,…]. The options are
described in, e.g.,
http://jakascorner.com/blog/2016/06/omp-for-scheduling.html.
Do some experimenting with
staticand different chunk sizes to figure out what’s going on. You can trydynamicorguided, too, but it’s probably not too meaningful in your simple dummy loop, where no actual work happens.
Homework#
Finish the in-class exercises (“Your turn”) and write a report in the usual fashion (describe what you did, copy&paste the output you got, try to make sense of what happened).
Investigate the performance of the different approaches to parallelize the integration code, and write about your resultas in your report.