OpenMP I: Parallel Loops#

Today’s assignment(s)#

There are actually two github classrooms for today. The main one I really should have provided last week. Anyway, since I didn’t, I created it now, to cover class 14 and 15, and it’ll only be due after spring break. It’s the place to put the “Prepare for Parallel Programming” part from last class, and also the work for today’s class. If you have done the work from last class already, you can just push it to the new repo, and then add the work for today’s class on top of that.

However, to make sure that everyone is one the same page for today’s class, I put up a second assignment for today’s class, where I’ve done the “Prepare for Parallel Programming” work to make sure we’re all ready for today (you’ll find it in your repo once you sign up, or here). This repo is just for temporary use during class today, and you don’t need to do any work in it outside of class. The homework for today’s class should be done in the first repo, which may include redoing (or merging) things you do today in class.

HPC / Parallel Programming#

Motivation for parallel programming#

We’ll do this in more detail after the break.

Parallelism is ubiquitous in High Performance Computing now (and actually in computing in general – even your cell phone has multiple processor cores). It occurs at a number of different levels:

  • Between nodes in a supercomputer, each node having one or more multi-core processors.

  • Within a node, that is using the cores of one or more multi-core processors.

  • SIMD (Single Instruction Multiple Data) instructions and registers allow to perform operations on small “vectors” of numbers simultaneously.

  • Instruction-level parallelism / superscalar architectures: Modern processors can actually execute multiple instructions at the same time using different on-chip units.

In this class, we will focus on the first two kinds of parallelism, that is between nodes and between cores on a chip, starting with the second.

Parallelizing the “averaging” example#

[If you have the code from last class ready, you can use it to do today’s in-class work. If not, just work in today’s assignment repo.]

Our goal in class today is for you to parallelize your code (or the code I’m providing in the class repo).

The class repo contains two versions: test_average, and test_average_fortran. test_average is based on xtensor, so that we can get used to working with it. Quite frankly, basic C arrays would have worked just fine here. I also made a Fortran90 version, so that I’ll be able to talk about how to do things in Fortran as well – however, if you don’t use Fortran otherwise and anticipate you never will, you can just ignore the Fortran version and focus on the C++ version.

Note

The starter code includes a Fortran version of the example, and the build is set up accordingly. This can cause trouble if you don’t have a working Fortran compiler (like gfortran) installed. Unless you care about Fortran, a workaround is to remove the “Fortran” from the project statement in CMakeLists.txt: project(OpenMP LANGUAGES CXX), and then also to comment out the add_executable(test_average_fortran ...) statement.

The Fortran90 version actually is ahead of the other versions, ie., it contains part of the work that you should do for your version today in class. In particular, it already contains simple timing code.

By the way, the code actually works, too:

test_average.png

You want to be able to check whether things work correctly as you go. I still like gnuplot for quick and easy simple plotting, in this case you just need to do:

[kai@mbpro openmp]$ gnuplot
[...]
gnuplot> plot "f_nc.asc", "f_cc.asc"

Unfortunately, with WSL2, running gnuplot in Linux tends to not be as quick and easy as it should be, since it doesn’t know how to pop up a window to show the plot. More generally, while gnuplot can be used to make publication-quality figures, I’m pretty much only using it as a “quick ‘n dirty” way to quickly throw up a plot, and I’d recommend using python/matplotlib or Matlab, or whatever your favorite plotting tool is in any case.

Timing#

Pretty much the whole reason for parallelizing code is to improve performance, so we want to quantify that. As I mentioned, I already put in a timing function in the Fortran code. For your code, you can use the Wtime() function, which I put into this repo, too – we’ve seen it before, it’s now in wtime.h.

Your turn#

  • Add timing statements to measure how long the actual averaging takes. You will likely find out that it’s really fast and so you don’t get good measurements. In order to get reproducible numbers, you probably want to increase your number of grid points, and if that’s still not enough, just write a loop around the call, so that you call the averaging function many times instead of just once. The loop is generally a good idea, so that you can get multiple timing numbers and see how consistent there are, and whether there are any special start-up costs.

OpenMP#

OpenMP (Open Multi-Processing) is a standard that extends C, C++, and Fortran to support some easy-to-use shared-memory parallelism. We’ll get into what exactly that means later.

OpenMP doesn’t change the programming language at all, it only provides directives to the compiler that certain parts of the code should be executed in parallel. These directives come in the form of pragmas (C) or special comments (Fortran), which means that you can still compile the code with a regular compiler, and it will still work, though of course it won’t be parallel then.

OpenMP: Parallelizing “for” loops#

In our vector average example, almost all of the time we care about is spent in the for loop that calculates the average one-by-one. There’s no reason why this couldn’t be done in parallel (other than that a regular compiler wouldn’t know how to), so all we need to do is to tell the OpenMP compiler to parallelize that loop. This is done by adding the single line directives, e.g.:

diff --git a/class16/test_average.cxx b/class14/test_average.cxx
index f7bd89b..fbe4bda 100644
--- a/class16/test_average.cxx
+++ b/class16/test_average.cxx
@@ -31,6 +31,7 @@ vector avg(const vector& f_nc)
 {
   auto f_cc = xt::empty<double>({N});

+#pragma omp parallel for
   for (int i = 0; i < N; i++) {
     f_cc(i) = .5 * (f_nc(i) + f_nc(i + 1));
   }

In Fortran:

diff --git a/test_average_fortran.f90 b/test_average_fortran.f90
index de09ff8..94b5d3d 100644
--- a/test_average_fortran.f90
+++ b/test_average_fortran.f90
@@ -10,6 +10,7 @@ subroutine vector_average(cc, nc, n)

   integer :: i

+!$omp parallel do
   do i = 0, n-1
      cc(i) = .5 * (nc(i) + nc(i+1))
   end do
+!$omp end parallel do

(As you can see, for Fortran I shouldn’t say “for” loops, but “do” loops, but other than that everything’s the same).

In Fortran 77, the directive would be “c$omp parallel do”. In both cases, the c or ! just starts a comment, and the $omp is a special keyword for the compiler to actually consider this to be an OpenMP directive rather than a regular comment.

Your turn#

  • Parallelize your code, or the sample code, by adding the requisite #pragma omp .... Run it again. How does the timing change?

Enabling OpenMP compiler support#

In order to use OpenMP, you need a compiler that supports OpenMP. Fortunately our standard compilers (gcc, gfortran) on Linux already have built-in support, but it needs to be enabled. I believe the compilers that come in modern MacOS now also have built-in support for OpenMP, though for the longest time they were lacking in that regard. If one is calling the compiler by hand, enabling OpenMP simple requires adding the flag -fopenmp to the command line. When using cmake, as we do, I’ve already put what’s needed into CMakeLists.txt, though I left it commented out, so really, all you need to do is to activate those lines:

diff --git a/CMakeLists.txt b/CMakeLists.txt
index a2ac45a..c09d8a7 100644
--- a/CMakeLists.txt
+++ b/CMakeLists.txt
@@ -28,9 +28,14 @@ CPMAddPackage(
   VERSION 0.24.5
 )

+#find_package(OpenMP REQUIRED)
+
 add_executable(test_integration test_integration.cxx)
+#target_link_libraries(test_integration OpenMP::OpenMP_CXX)

 add_executable(test_average test_average.cxx)
 target_link_libraries(test_average xtensor)
+#target_link_libraries(test_average OpenMP::OpenMP_CXX)

 add_executable(test_average_fortran test_average_fortran.F90 vector_average_fortran.F90)
+#target_link_libraries(test_average_fortran OpenMP::OpenMP_Fortran)

Sometimes, you want to support OpenMP optionally, ie., use it if it’s available but otherwise just build the code without parallelization. So here’s an example how to do that with cmake:

diff --git a/class15/CMakeLists.txt b/class15/CMakeLists.txt
index dce5d85..f581ad2 100644
--- a/class16/CMakeLists.txt
+++ b/class16 /CMakeLists.txt
@@ -5,6 +5,8 @@ project(LinearAlgebra)

 enable_testing()

+find_package(OpenMP)
+
 # If we knew GoogleTest is installed, we can just use the following single line
 find_package(GTest QUIET)

@@ -63,3 +65,17 @@ add_executable(test_integration test_integration.cxx)

 add_executable(test_average test_average.cxx)
 target_link_libraries(test_average PRIVATE xtensor)
+if (OpenMP_CXX_FOUND)
+  target_link_libraries(test_average PRIVATE OpenMP::OpenMP_CXX)
+endif()
+
[....]

The difference here is that there is no REQUIRED, as opposed to thefind_package(OpenMP REQUIRED) in the previous snippet. That way cmake will be fine with not being able to find OpenMP, but in that case we can’t “link” to OpenMP unconditionally, we can only do so if it’s available.

In the assignment repo, I used find_package(OpenMP REQUIRED), which will make cmake abort with an error if OpenMP is not detected. I did this because in the next couple of classes, we really do require OpenMP to work, and if your compiler doesn’t support it, you’ll need to install a compiler that does support OpenMP. This shouldn’t happen on Linux/WSL2, but on a Mac, it might happen. If you (like me) use a Mac and Macports, you can do sudo port install gcc13 to install, e.g., GCC 13 (pretty much any version should do, and using clang instead should work as well).

In general, if you’re on a Mac I’d recommend using some kind of package manager (like Macports, Homebrew or Conda) to install compilers and other tools, since the ones that come with MacOS are often pretty old and/or missing features like OpenMP support. Using a package manager helps make sure that you get a consistent set of tools that work together, and also makes it easier to update things when needed.

If you want cmake to use a non-default compiler you can do so by saying, e.g.,

```sh
[kai@macbook build (master)]$ cmake -DCMAKE_CXX_COMPILER=/path/to/g++ ..

Though if you’re using VS code, it will give you a choice of “kits” that it detects automatically, which should give you a choice of the compilers you have available.

Your turn#

  • Now enable OpenMP parallelization in the CMakeLists.txt.

  • Run the program again. How would you expect the timing to change? What do you observe?

You can tell OpenMP how many processor cores you want to use, by saying

[kai@macbook linear_algebra]$ OMP_NUM_THREADS=<n> ./test_average # or ./test_average_fortran

where you replace <n> by the number of cores you want to use (of course, you can do this with your own program, too). Vary <n>, and plot the execution time vs <n>. What do you observe, and why? Can you learn anything about the CPU on your laptop?

Homework#

  • If you have worked in today’s assignment repo with my starter code, start over in the main Class 15 repo, starting with the steps there.

  • Finish the in-classs exercises above, and as usual write a brief report (including copy&paste) about it.

  • Parallelize your test_integrate example using OpenMP. Measure how the timing changes. Do you notice anything else?

Notes on gnuplot#

TL;DR: No matter how one does it, it’s never as easy as it should be…

  • First of all, I used gnuplot here because it’s quick and easy (and free). It is fairly powerful by now, but I also think there are better tools out there (like matlab, python/matplotlib and others). However, a lot of the problems persist no matter what tool you want to use for plotting.

  • Second, for the homework while it’s not essential to the main goal of running in parallel, it’d be good to add a plot. It’s a good opportunity to figure out how you want to do your plotting, and to get it set up. It’s something you’ll definitely need in the future, so it’s worth the practice.

  • One way to deal with visualization/plotting is to run the plotting tool where the data is. That’s actually often not the best way, in particular when the data is at a remote computing center. Even when the data is on your laptop, but in WSL2 or some docker container, running a tool like gnuplot inside of WSL2/… is problematic because those are typically not designed to be able to open graphical windows on your Desktop. For WSL2 in particular, though, I linked to a youtube video and Medium notes that describe how to set things up so that GUI apps work in WSL2 back when we started this class. So that’s definitely one option, setting things up accordingly and then doing apt install gnuplot in WSL2 and using it right there.

  • Other ways typically involving copying data (or creating the plots as image files and copy those image files). That’s always an option, but of course can be a bit of a hassle, too. You can copy the data files out of WSL2 into Windows and then do the plotting there (there should be gnuplot for Windows, too, if you want to use that). In the case of WSL2, you can also take advantage that you can access the Linux filesystems from Windows and vice versa. That’s actually a pretty good option.

  • If you want to plot with gnuplot on WSL2 but graphical windows don’t work, you can make, e.g., png files directly by saying

> set term png
> set output "plot.png"
> plot ...