# OpenMP II: Scalability, Data Sharing

There's no new assignment repo for today, we'll keep working in the one(s) from last class.

## Making sure that OpenMP works

In the "in-class" repo from Tuesday, I added another test: `test_openmp`. It should show up as a pull request that you can merge.

As some of you may have found out, while
the cmake detection of OpenMP usually works, usually doesn't mean always. The
`test_openmp.cxx` makes sure that the program really runs multi-threaded via
OpenMP.

### Your turn

Merge the pull request (or do the equivalent work in your main assignment repo). Unless you've already done so before, you'll still have to enable OpenMP in your `CMakeLists.txt` by uncommenting the respective lines. However, since this test uses the library part of OpenMP directly, rather than just relying on the compiler to recognize the `#pragma omp` and do the right thing, it will actually explicitly lead to a build error if OpenMP is not properly enabled (which is preferable to silently just not parallelizing).

Anyway, once you have the test built, it should run and print out something like this:

```sh
vscode ➜ /workspaces/openmp/build (main) $ cmake --build . && ./test_openmp
ninja: no work to do.
Hi, just starting
Hi, I'm thread 0 of 12
Hi, I'm thread 9 of 12
Hi, I'm thread 5 of 12
Hi, I'm thread 10 of 12
Hi, I'm thread 1 of 12
Hi, I'm thread 2 of 12
Hi, I'm thread 11 of 12
Hi, I'm thread 3 of 12
Hi, I'm thread 7 of 12
Hi, I'm thread 8 of 12
Hi, I'm thread 4 of 12
Hi, I'm thread 6 of 12
I'm about to be done.
```

And, we can also specify the number of threads to use via the `OMP_NUM_THREADS` environment variable, e.g., to use 4 threads:

```sh
vscode ➜ /workspaces/openmp/build (main) $ cmake --build . && OMP_NUM_THREADS=4 ./test_openmp
ninja: no work to do.
Hi, just starting
Hi, I'm thread 0 of 4
Hi, I'm thread 3 of 4
Hi, I'm thread 1 of 4
Hi, I'm thread 2 of 4
I'm about to be done.
```

## Parallel Scalability

As you get timing information from your code, you'll hopefully be able to see
that running in parallel speeds things up (that's pretty much the whole point of
parallelizing code). It is, however, quite possible to not see a speed-up, in
which case one may either give up on parallelization, or change the algorithms
or the problem (e.g., size) so that one does see gains.

In general, we care about time vs number of cores, that's what's called parallel
scalability, as you scale up the number of cores you're using, does the code get
correspondingly faster?

What we've been measuring is called wallclock time, which expresses that you
look at the clock on the wall (or the watch on your wrist) when the part of
interest starts, look again when it's done, and so you measure how many seconds
of actual time have passed. Of course, you'd have to look rather quickly, so we
rather use WTime() in the code to do that for us. This quantity is called
wallclock time to distinguish it from CPU time, which is how much CPU time the
code uses. CPU time takes into account how many CPUs (CPU cores, more
specifically) you use, e.g., if you run your code on 4 cores for 2 hours, you
actually use 4 x 2 = 8 CPU hours (core hours, to be exact).

There are a number of different ways to plot scaling information.

Here's the data I measured on Marvin, the Cray supercomputer at UNH, which unfortunately recently has had hardware failures and is at the end of its life.

```
# num_threads time
1 0.67500000004656613
2 0.35199999995529652
3 0.26000000000931323
4 0.19599999999627471
5 0.16400000010617077
6 0.15999999991618097
7 0.14800000004470348
8 0.12199999997392297
9 0.12400000006891787
10 0.11100000003352761
11 0.11599999992176890
12 0.11199999996460974
13 0.11100000003352761
14 0.11100000003352761
15 0.10800000000745058
16 0.10800000000745058
17 0.10100000002421439
18 0.10299999988637865
19 0.10999999986961484
20 0.10000000009313226
```

Here's a fancy way of measuring the data, just for fun:

```sh
for i in `seq 1 20`; do echo -n "$i "; OMP_NUM_THREADS=$i ./test_average_fortran | ( read a b c; echo $b ) ; done
```

I call the data in the table (n, t), ie., n number of threads and t: wallclock
time for doing the averaging.

I just plot t vs n using gnuplot (you may use Matlab or whatever your preferred
plotting tool is):

```
gnuplot> plot "scale_avg_marvin.txt" w lp, .675/x
```

![scale_avg_0](scale_avg_0.png)

Here, I also plotted the ideal or perfect scalability result: As the code runs
in .675 seconds on one core, I'd expected it to run n times faster, ie., run in
.675 sec/n time on n cores.

Perfect scalability means that the points in the curve lie on a hyperbola,
however, without drawing one to compare with, it's difficult to see whether it
really does. So one good way to plot this info is to do a log-log plot:

```
gnuplot> set log
gnuplot> plot "scale_avg_marvin.txt" w lp, .675/x
```

![scale_avg_1](scale_avg_1.png)

In a log-log plot, the perfect scaling t ~ n<sup>-1</sup> should show as a
straight line with slope -1.

One could also plot CPU time, ie., number of cores used multiplied by the wall
clock time. This curve should ideally be perfectly flat -- same amount of total
CPU time used in the case with, say, 8x cores would mean one would get the
answer 8x faster (in terms of wallclock time), while using the same
computational resources.

```
gnuplot> unset log
gnuplot> set yra [0:]
gnuplot> plot "scale_avg_marvin.txt" u 1:($1*$2) w lp, .675
```

![scale_avg_2](scale_avg_2.png)


## Homework

- As usual, finish the "Your turn" exercises (in the main assignment repo from Tuesday, not the in-class repo) and add a report to the feedback PR.

## off-topic(ish): AI coding

- https://www.nytimes.com/2026/03/12/magazine/ai-coding-programming-jobs-claude-chatgpt.html