# OpenMP I: Parallel Loops

## Today's assignment(s)

There are actually two github classrooms for today. The [main one](https://classroom.github.com/a/1zFy0cup) I really should have provided last week. Anyway, since I didn't, I created it now, to cover class 14 and 15, and it'll only be due after spring break. It's the place to put the "Prepare for Parallel Programming" part from last class, and also the work for today's class. If you have done the work from last class already, you can just push it to the new repo, and then add the work for today's class on top of that.

However, to make sure that everyone is one the same page for today's class, I put up a [second assignment](https://classroom.github.com/a/0wcyx2_c) for today's class, where I've done the "Prepare for Parallel Programming" work to make sure we're all ready for today (you'll find it in your repo once you sign up, or [here](https://github.com/UNH-HPC-2026/unh-hpc-2026-class-15-in-class-only-class-15-inclass)). This repo is just for temporary use during class today, and you don't need to do any work in it outside of class. The homework for today's class should be done in the first repo, which may include redoing (or merging) things you do today in class.

## HPC / Parallel Programming

### Motivation for parallel programming

We'll do this in more detail after the break.

Parallelism is ubiquitous in High Performance Computing now (and actually in
computing in general -- even your cell phone has multiple processor cores). It
occurs at a number of different levels:

- Between nodes in a supercomputer, each node having one or more multi-core
  processors.

- Within a node, that is using the cores of one or more multi-core processors.

- SIMD (Single Instruction Multiple Data) instructions and registers allow to
  perform operations on small "vectors" of numbers simultaneously.

- Instruction-level parallelism / superscalar architectures: Modern processors
  can actually execute multiple instructions at the same time using different
  on-chip units.

In this class, we will focus on the first two kinds of parallelism, that is
between nodes and between cores on a chip, starting with the second.

## Parallelizing the "averaging" example

[If you have the code from last class ready, you can use it to do today's
in-class work. If not, just work in today's assignment repo.]

Our goal in class today is for you to parallelize your code (or the code I'm
providing in the class repo).

The class repo contains two versions: `test_average`, and
`test_average_fortran`. `test_average` is based on xtensor, so that we can get
used to working with it. Quite frankly, basic C arrays would have worked just
fine here. I also made a Fortran90 version, so that I'll be able to talk about
how to do things in Fortran as well -- however, if you don't use Fortran otherwise and anticipate you never will, you can just ignore the Fortran version and focus on the C++ version.

```{note}

The starter code includes a Fortran version of the example, and the build is set up accordingly. This can cause trouble if you don't have a working
Fortran compiler (like `gfortran`) installed. Unless you care about Fortran, a workaround is to remove the "Fortran" from the `project` statement in `CMakeLists.txt`: `project(OpenMP LANGUAGES CXX)`, and then also to comment out the `add_executable(test_average_fortran ...)` statement.

```
The Fortran90 version actually is ahead of the other versions, ie., it contains
part of the work that you should do for your version today in class. In
particular, it already contains simple timing code.

By the way, the code actually works, too:

![test_average.png](test_average.png)

You want to be able to check whether things work correctly as you go. I still like
[gnuplot](http://www.gnuplot.info/) for quick and easy simple plotting, in this
case you just need to do:

```sh
[kai@mbpro openmp]$ gnuplot
[...]
gnuplot> plot "f_nc.asc", "f_cc.asc"
```

Unfortunately, with WSL2, running gnuplot in Linux tends to not be as quick and
easy as it should be, since it doesn't know how to pop up a window to show the
plot. More generally, while gnuplot can be used to make publication-quality
figures, I'm pretty much only using it as a "quick 'n dirty" way to quickly
throw up a plot, and I'd recommend using python/matplotlib or Matlab, or
whatever your favorite plotting tool is in any case.

### Timing

Pretty much the whole reason for parallelizing code is to improve performance,
so we want to quantify that. As I mentioned, I already put in a timing function
in the Fortran code. For your code, you can use the `Wtime()` function, which I
put into this repo, too -- we've seen it before, it's now in `wtime.h`.

#### Your turn

- Add timing statements to measure how long the actual averaging takes. You will
  likely find out that it's really fast and so you don't get good measurements.
  In order to get reproducible numbers, you probably want to increase your
  number of grid points, and if that's still not enough, just write a loop
  around the call, so that you call the averaging function many times instead of
  just once. The loop is generally a good idea, so that you can get multiple
  timing numbers and see how consistent there are, and whether there are any
  special start-up costs.

## OpenMP

OpenMP (Open Multi-Processing) is a standard that extends C, C++, and Fortran to
support some easy-to-use shared-memory parallelism. We'll get into what exactly
that means later.

OpenMP doesn't change the programming language at all, it only provides
directives to the compiler that certain parts of the code should be executed in
parallel. These directives come in the form of pragmas (C) or special comments
(Fortran), which means that you can still compile the code with a regular
compiler, and it will still work, though of course it won't be parallel then.

### OpenMP: Parallelizing "for" loops

In our vector average example, almost all of the time we care about is spent in
the for loop that calculates the average one-by-one. There's no reason why this
couldn't be done in parallel (other than that a regular compiler wouldn't know
how to), so all we need to do is to tell the OpenMP compiler to parallelize that
loop. This is done by adding the single line directives, e.g.:

```diff
diff --git a/class16/test_average.cxx b/class14/test_average.cxx
index f7bd89b..fbe4bda 100644
--- a/class16/test_average.cxx
+++ b/class16/test_average.cxx
@@ -31,6 +31,7 @@ vector avg(const vector& f_nc)
 {
   auto f_cc = xt::empty<double>({N});

+#pragma omp parallel for
   for (int i = 0; i < N; i++) {
     f_cc(i) = .5 * (f_nc(i) + f_nc(i + 1));
   }

```

In Fortran:

```diff
diff --git a/test_average_fortran.f90 b/test_average_fortran.f90
index de09ff8..94b5d3d 100644
--- a/test_average_fortran.f90
+++ b/test_average_fortran.f90
@@ -10,6 +10,7 @@ subroutine vector_average(cc, nc, n)

   integer :: i

+!$omp parallel do
   do i = 0, n-1
      cc(i) = .5 * (nc(i) + nc(i+1))
   end do
+!$omp end parallel do
```

(As you can see, for Fortran I shouldn't say "for" loops, but "do" loops, but
other than that everything's the same).

In Fortran 77, the directive would be "`c$omp parallel do`". In both cases, the
`c` or `!` just starts a comment, and the `$omp` is a special keyword for the
compiler to actually consider this to be an OpenMP directive rather than a
regular comment.

#### Your turn

- Parallelize your code, or the sample code, by adding the requisite
  `#pragma omp ...`. Run it again. How does the timing change?

### Enabling OpenMP compiler support

In order to use OpenMP, you need a compiler that supports OpenMP. Fortunately
our standard compilers (`gcc`, `gfortran`) on Linux already have built-in
support, but it needs to be enabled. I believe the compilers that come in modern
MacOS now also have built-in support for OpenMP, though for the longest time
they were lacking in that regard. If one is calling the compiler by hand,
enabling OpenMP simple requires adding the flag `-fopenmp` to the command line.
When using cmake, as we do, I've already put what's needed into
`CMakeLists.txt`, though I left it commented out, so really, all you need to do
is to activate those lines:

```diff
diff --git a/CMakeLists.txt b/CMakeLists.txt
index a2ac45a..c09d8a7 100644
--- a/CMakeLists.txt
+++ b/CMakeLists.txt
@@ -28,9 +28,14 @@ CPMAddPackage(
   VERSION 0.24.5
 )

+#find_package(OpenMP REQUIRED)
+
 add_executable(test_integration test_integration.cxx)
+#target_link_libraries(test_integration OpenMP::OpenMP_CXX)

 add_executable(test_average test_average.cxx)
 target_link_libraries(test_average xtensor)
+#target_link_libraries(test_average OpenMP::OpenMP_CXX)

 add_executable(test_average_fortran test_average_fortran.F90 vector_average_fortran.F90)
+#target_link_libraries(test_average_fortran OpenMP::OpenMP_Fortran)
```

Sometimes, you want to support OpenMP optionally, ie., use it if it's available
but otherwise just build the code without parallelization. So here's an example
how to do that with cmake:

```diff
diff --git a/class15/CMakeLists.txt b/class15/CMakeLists.txt
index dce5d85..f581ad2 100644
--- a/class16/CMakeLists.txt
+++ b/class16 /CMakeLists.txt
@@ -5,6 +5,8 @@ project(LinearAlgebra)

 enable_testing()

+find_package(OpenMP)
+
 # If we knew GoogleTest is installed, we can just use the following single line
 find_package(GTest QUIET)

@@ -63,3 +65,17 @@ add_executable(test_integration test_integration.cxx)

 add_executable(test_average test_average.cxx)
 target_link_libraries(test_average PRIVATE xtensor)
+if (OpenMP_CXX_FOUND)
+  target_link_libraries(test_average PRIVATE OpenMP::OpenMP_CXX)
+endif()
+
[....]
```

The difference here is that there is no `REQUIRED`, as opposed to
the`find_package(OpenMP REQUIRED)` in the previous snippet. That way cmake will
be fine with not being able to find OpenMP, but in that case we can't "link" to
OpenMP unconditionally, we can only do so if it's available.

In the assignment repo, I used `find_package(OpenMP REQUIRED)`, which will make
cmake abort with an error if OpenMP is not detected. I did this because in the
next couple of classes, we really do require OpenMP to work, and if your
compiler doesn't support it, you'll need to install a compiler that does support
OpenMP. This shouldn't happen on Linux/WSL2, but on a Mac, it might
happen. If you (like me) use a Mac and Macports, you can do
`sudo port install gcc13` to install, e.g., GCC 13 (pretty much any version should do,
and using clang instead should work as well).

In general, if you're on a Mac I'd recommend using some kind of package manager (like Macports, Homebrew or Conda) to install compilers and other tools, since the ones that come with MacOS are often pretty old and/or missing features like OpenMP support. Using a package manager helps make sure that you get a consistent set of tools that work together, and also makes it easier to update things when needed.

```sh

If you want cmake to use a non-default compiler you can do so by saying, e.g.,

```sh
[kai@macbook build (master)]$ cmake -DCMAKE_CXX_COMPILER=/path/to/g++ ..
```

Though if you're using VS code, it will give you a choice of "kits" that it
detects automatically, which should give you a choice of the compilers you have
available.

#### Your turn

- Now enable OpenMP parallelization in the `CMakeLists.txt`.

- Run the program again. How would you expect the timing to change? What do you
  observe?

You can tell OpenMP how many processor cores you want to use, by saying

```sh
[kai@macbook linear_algebra]$ OMP_NUM_THREADS=<n> ./test_average # or ./test_average_fortran
```

where you replace `<n>` by the number of cores you want to use (of course, you
can do this with your own program, too). Vary `<n>`, and plot the execution time
vs `<n>`. What do you observe, and why? Can you learn anything about the CPU on your laptop?

### Homework

- If you have worked in today's assignment repo with my starter code, start over
  in the main Class 15 repo, starting with the steps there.

- Finish the in-classs exercises above, and as usual write a brief report
  (including copy&paste) about it.

- Parallelize your `test_integrate` example using OpenMP. Measure how the
  timing changes. Do you notice anything else?

### Notes on gnuplot

TL;DR: No matter how one does it, it's never as easy as it should be...

- First of all, I used gnuplot here because it's quick and easy (and free). It
  is fairly powerful by now, but I also think there are better tools out there
  (like matlab, python/matplotlib and others). However, a lot of the problems
  persist no matter what tool you want to use for plotting.

- Second, for the homework while it's not essential to the main goal of running in parallel, it'd be good to add a plot. It's a good opportunity to figure out how you want to do your plotting, and to get it set up.
  It's something you'll definitely need in the future, so it's worth the practice.

- One way to deal with visualization/plotting is to run the plotting tool where
  the data is. That's actually often not the best way, in particular when the
  data is at a remote computing center. Even when the data is on your laptop,
  but in WSL2 or some docker container, running a tool like gnuplot inside of
  WSL2/... is problematic because those are typically not designed to be able to
  open graphical windows on your Desktop. For WSL2 in particular, though, I
  linked to a youtube video and Medium notes that describe how to set things up
  so that GUI apps work in WSL2 back when we started this class. So that's
  definitely one option, setting things up accordingly and then doing
  `apt install gnuplot` in WSL2 and using it right there.

- Other ways typically involving copying data (or creating the plots as image
  files and copy those image files). That's always an option, but of course can
  be a bit of a hassle, too. You can copy the data files out of WSL2 into
  Windows and then do the plotting there (there should be gnuplot for Windows,
  too, if you want to use that). In the case of WSL2, you can also take
  advantage that you can access the Linux filesystems from Windows and vice
  versa. That's actually a pretty good option.

- If you want to plot with gnuplot on WSL2 but graphical windows don't work, you
  can make, e.g., png files directly by saying

```
> set term png
> set output "plot.png"
> plot ...
```