# OpenMP V: Work Sharing, Performance Example


Today's assignment: <https://classroom.github.com/a/aAuBgeSH>

## Work sharing

As we've seen before a simple `#pragma omp parallel` forks the execution into multiple threads, so that the code that follow will execute n times in parallel. Now in order to actually get a benefit out of having multiple threads running, we need to distribute the work between the threads. We've seen two ways of doing that before: `omp for` will take the iterations in a for loop and split them up between the available threads. `omp sections` allows you to split up different sections of code between the threads.

### Parallelizing nested loops

The by far most commonly used way of sharing work is `omp for`. It is worth noting that often enough, numerical codes don't just have single loops, but they are often nested loops, like this one:

```cpp
for (int iy = 0; iy < MY; iy++) {
  for (int ix = 0; ix < MX; ix++) {
    // ...
  }
}
```

In that case, one can still use `omp for` to parallelize just one of the loops, but there is also the possibility to parallelize both loops at the same time with `omp for collapse`. 

### Your turn

* Try out the different ways of parallelizing one vs the other loop, and both loops at the same time. Similar to the previous assignment, you can add some print statememnts to see which threads are doing which iterations.

* What do you think is the best way to parallelize nested loops? Does it depend on the situation?

### Other work sharing constructs

It is worth noting that there is always the option of implementing one's own way of sharing work between threads by using `omp_get_thread_num()` to get the current thread's number, and then writing custom code for each thread. This is more flexible than the built-in constructs, but it also means that one has to do more work, and it's easier to make mistakes. Also, if one just uses OpenMP via pragmas, one has the advantage of developing a code that can be compiled with OpenMP and will run in parallel, but if OpenMP is not available, it will still compile and run, albeit in serial. If one writes custom code for each thread, one would have to make sure that the code can also compile and run without OpenMP, which can be a bit more work.

Finally, "modern" Fortran (ie., Fortran90 and later) supports various constructs that, e.g., work on a whole array at a time. So instead of writing a loop to do a vector addition like this:

```fortran
do i = 1, N
  c(i) = a(i) + b(i)
end do
```

one can just write

```fortran
c = a + b
```

The former could be parallelized with `!$omp parallel for`, but for the latter, there is a special OpenMP construct called `!$omp workshare`, which will automatically parallelize the array operation, or a `do concurrent loop` and other similar constructs. There's really no equivalent to this in plain C/C++, but for some students who might encounter Fortran codes, this might be good to know.

For C++, we've been using xtensor, which allows us to write code that looks a lot like the Fortran array operations, e.g., we can write `c = a + b` for vector addition, and xtensor will take care of doing the right thing. xtensor actually does support sharing the work in such operations using OpenMP as well -- it's enabled by `XTENSOR_USE_OPENMP`.

## Performance / Scalability Example: Mandelbrot set

In today's assignment, you'll find a program `mandelbrot.cxx`, which computes
the "mandelbrot set". The goal is to parallelize this code with OpenMP and look
at how well that works. The version I provided is serial, and it's always good
to start with checking out that it works fine. If you build it (the usual cmake
deal) and run it, it should produce a file `mandelbrot.asc`, that you can plot
with gnuplot:

```sh
$ gnuplot
gnuplot> set pm3d map
gnuplot> set palette defined (0 0 0 .5, 1 1 1 1, 2 1 1 0, 3 0 0 0, 4 0 0 0)
gnuplot> splot "mandelbrot.asc"
```

Actually, I provided a script to make it even easier to make a plot:

```sh
gnuplot ../mandelbrot.gpl
```

will generate the image `mandelbrot.png`. (Or you can use your own favorite
plotting tool -- the data is in ASCII format.)

![Mandelbrot](mandelbrot.png)

We don't really care about the mathematics of the mandelbrot set here, but it
certainly can do pretty pictures, see, as usual
[wikipedia](http://en.wikipedia.org/wiki/Mandelbrot_set). Nevertheless, some
basic understanding of how the algorithm works will be helpful. Since the
algorithm is in the code, I let you guys read the code to figure out how the
color at each pixel is determined.

#### Your turn

- Add timing code, so that you can measure how long the actual calculation of
  the Mandelbrot set takes (as opposed to writing the calculated data to the
  file). If you don't get repeatable numbers, you may have to choose a larger
  resolution (change `MX` and `MY`).

- If you double the resolution in both x and y direction, how much slower do you
  expect the calculation to run? Try it out, present your results, and discuss.

- Parallelize the code using OpenMP. Which loops are good / not so good
  candidates for parallelization?
  Make sure that the code still works correctly after parallelization.

- Perform a scaling study of your parallelized code. You should go up from 1 to
  at least 8 threads, and in order to get meaningful results, ideally you'd use a
  computer that has at least 8 cores. If you don't have at least 4 cores, work
  together with someone in your team that does, or ask me for help.

  Create a plot with the scaling results, and compare to what you'd expect for
  perfect scalability. Discuss -- in particular, can you explain the reason why
  your results deviate from perfect scalability?

- This will probably depend on how far you figure out the explanation for the
  scalability you observe above: Try to improve the scalability of this code.

- This assignment requires some amount of coding, but also making plots and
  explaining results. The coding part obviously should be checked into git. For
  the rest, you can include the written parts and plots in the repo, but as
  usual I prefer if you do do the discussion in your pull request on github,
  which will allow you to include images as well (just drag&drop).