# Algorithmic Scalability, Prepare for Parallel Programming

You can continue working in Tuesday's assignment repo.

## Algorithmic Scalability

### Big-O Notation

Looking at scalability is basically about getting a really simple model to
predict the runtime and other related properties of a given code (or section of
code) as parameters change. For example, we've looked at the performance of
matrix vector multiplication, where the matrix is a $N \times N$ matrix, and the
vectors has $N$ elements. In this case, $N$ is an obvious choice for the
parameter of interest, with the question being, how do things change as I change
$N$?

"Runtime" is actually not necessarily all that easy to predict, as modern
computers are quite complex beasts, and the runtime will depend on the number of
calculations, the number of memory accesses, caches, instruction scheduling,
interference of other things your computer is doing at the same time, etc. In
general, two quantities that are of fundamental interest are

- number of floating point operations ("flops")
- number of memory accesses

For the matrix-vector product $y = Ax$, let's remember that

$$y_i = \sum_{j=0}^{N-1} A_{ij} x_j$$

So with a bit of counting, we can figure out that calculating a single $y_i$
takes $N$ multiplications and $N-1$ additions. We have to do this for every $i$
from $0$ to $N-1$, so we end up with a total number of floating points
operations

$$
flops = N(N + N-1) = 2N^2 - N
$$

For large $N$, e.g., $N = 1000$, we get
$2(1000^2) - 1000 = 2,000,000 - 1000 \approx 2,000,000$. As I just did, it's
useful to neglect lower powers of $N$, since for large $N$, the answer is
dominated by the highest power. And, in fact, the constant factor is also not
usually really important, so what one ends up saying is

$$
flops = \mathcal{O}(N^2)
$$

The "big-O" notation has a precise mathematical definition, see e.g.
<https://en.wikipedia.org/wiki/Big_O_notation>. But what's most important for us is
how it gives an estimate of how the flops change as $N$ changes. E.g., if I
double my problem size $N$ from $1000$ to $2000$, flops change from $1000^2$ to
$(2\cdot 1000)^2 = 4(1000^2)$, ie., the number of flops quadrupled, and that's
consistent with our observation that the runtime also roughly quadrupled when we
doubled $N$.

We can, and should do the same for memory accesses. In calculating the
matrix-vector product, we read every element $A_{ij}$, so that's $N^2$ memory
reads. In addition, we read every element of $x$, that's another $N$ reads, and
we write every element of $y$, that's $N$ writes. So the total number of memory
accesses is

$$
mem = N^2 + 2N = \mathcal{O}(N^2)
$$

That is, in the matrix-vector multiplication problem, both the number of
floating point operations and memory accesses both scale like $N^2$, so we don't
really have to think of them separately. It is worth noting that one usually
cares most about is the number of flops, since the floating point operations
represent the actual useful calculations that eventually give you the answer
you're looking for, while the memory accesses are more of a necessary evil that
you (well, the computer) also has to do in order to be able to do those
calculations. But, memory accesses are necessary, and they do (in general) take
a noticable amount of time, so one needs to take them into account when
considering performance.


#### Your turn

- Repeat the scaling analysis above for matrix-matrix multiplication (as I did
  implicitly above, you can assume square matrices).

We haven't actually really talked about how the quantities above affect run
time. We'll get into this more, but to first order, every floating point
operation is going to take a certain amount of time (say, 0.1 ns), and every
memory access is also going to take a given amount of time (maybe 1 ns). That's
a gross oversimplification of real life, but basically, it means that time spent
is going to be proportional to flops / memory accesses, so the scaling above
translates over directly to run time.

### Tracing

- I'll show a little bit of tracing using the [perfetto](https://perfetto.dev/) tool, which is a very powerful tool for tracing and profiling code. It can be used to see how much time is spent in different functions, how the CPU is being used, etc. It's unfortunately rather complex to use and recently has given me a lot of trouble, but it can give you a lot of insight into what's going on in your code and your computer in general.


### Profiling

There are many tools that can be used to profile code, for C / C++ / Fortran, a basic tool is `gprof`, though it doesn't actually seem to work on my Mac. There's also `oprofile` on Linux, `shark` on the Mac, the PGI compiler comes with its own pgprof, ... Some come with graphical user interface options, too.

If they work, those tools are nice -- you don't have to change your code, things will, supposedly, just work. However, there are issues with portability (only work with a certain compiler / a certain operating system), and they give you a lot of information, but it's not necessarily easy to digest.

This is how gprof can be used on Linux:

```sh
# the "-pg" tells the compiler to turn on profiling
[kai@lolcat build]$ cmake -DCMAKE_BUILD_TYPE=Release -DCMAKE_CXX_FLAGS=-pg -DCMAKE_EXE_LINKER_FLAGS=-pg -DCMAKE_SHARED_LINKER_FLAGS=-pg ..
[...]
# rebuild
[kai@lolcat build]$ make
# run the code
[kai@lolcat build]$ cxx/large_matrix_vector_mul_cxx
setup took 2.03656 sec
matrix_vector_mul() took 0.778702 sec
# that produced a gmon.out file, which gprof will read and analyze
[kai@lolcat linear_algebra]$ gprof cxx/large_matrix_vector_mul_cxx
Flat profile:

Each sample counts as 0.01 seconds.
  %   cumulative   self              self     total
 time   seconds   seconds    calls   s/call   s/call  name
 82.45      2.49     2.49        1     2.49     2.49  xt::xtensor_container<xt::uvector<double, std::allocator<double> >, 2ul, (xt::layout_type)1, xt::xtensor_expression_tag>::xtensor_container<xt::xgenerator<xt::detail::fn_impl<xt::detail::eye_fn<bool> >, bool, std::vector<unsigned long, std::allocator<unsigned long> > > >(xt::xexpression<xt::xgenerator<xt::detail::fn_impl<xt::detail::eye_fn<bool> >, bool, std::vector<unsigned long, std::allocator<unsigned long> > > > const&)
 17.55      3.02     0.53        1     0.53     0.53  dot(xt::xtensor_container<xt::uvector<double, std::allocator<double> >, 2ul, (xt::layout_type)1, xt::xtensor_expression_tag> const&, xt::xtensor_container<xt::uvector<double, std::allocator<double> >, 1ul, (xt::layout_type)1, xt::xtensor_expression_tag> const&)
  0.00      3.02     0.00        2     0.00     0.00  xt::xtensor_container<xt::uvector<double, std::allocator<double> >, 1ul, (xt::layout_type)1, xt::xtensor_expression_tag>::~xtensor_container()
  0.00      3.02     0.00        1     0.00     0.00  xt::xgenerator<xt::detail::fn_impl<xt::detail::eye_fn<bool> >, bool, std::vector<unsigned long, std::allocator<unsigned long> > >::~xgenerator()
  0.00      3.02     0.00        1     0.00     0.00  xt::xtensor_container<xt::uvector<double, std::allocator<double> >, 2ul, (xt::layout_type)1, xt::xtensor_expression_tag>::~xtensor_container()
  0.00      3.02     0.00        1     0.00     0.00  auto xt::eye<bool>(unsigned long, int)
```

While this looks pretty horrible, the important information is actually there, just hard to digest:

- 82.45% of the total time, that is, 2.49 seconds were spent in the `eye` function that initializes the matrix.
- 17.55% of the total time, that is, 0.53 seconds were spent in the `dot` function that we care about.

This is definitely one of those things that were much less complex with C:

```
Flat profile:

Each sample counts as 0.01 seconds.
  %   cumulative   self              self     total
 time   seconds   seconds    calls  ms/call  ms/call  name
 72.50      2.04     2.04                             main
 27.86      2.82     0.78        1   782.80   782.80  matrix_vector_mul
  0.00      2.82     0.00        3     0.00     0.00  vector_create
  0.00      2.82     0.00        1     0.00     0.00  matrix_create
  0.00      2.82     0.00        1     0.00     0.00  vector_is_equal
[...]
```

This is an old profile, which is actually not trivial to interpret, either, since at the time I had a separate `setup_test_matrix` function where the matrix setup happened, ie, the slowest part of all, and it doesn't show up..

Basically all the remaining time other than `matrix_vector_mul()` itself appears to have been spent in main(), and no time in setting up the vectors and matrix, even though we know that the setup phase took 2.036 sec. So one lesson to be learned here is that you can't necessarily completely trust the profiling. That time is accounted for, but it's accounted for under `main()`, rather than under the individual `setup_test_vectors/matrix` functions.

The reason is that the compiler optimized the code to avoid actually calling those functions, and rather moved the work to be done directly in `main()`. This is called "inlining". If we turn off optimization, things match better:

```sh
[kai@lolcat linear_algebra]$ make clean
[...]
[kai@lolcat linear_algebra]$ make CFLAGS="-g -O0 -pg -Wall"
[...]
[kai@lolcat linear_algebra]$ ./large_matrix_vector_mul_cxx
setup took 3.80221 sec
matrix_vector_mul() took 4.39365 sec
checking result took 0.000118971 sec
[kai@lolcat linear_algebra]$ gprof large_matrix_vector_mul_cxx
Flat profile:

Each sample counts as 0.01 seconds.
  %   cumulative   self              self     total
 time   seconds   seconds    calls   s/call   s/call  name
 53.98      4.40     4.40        1     4.40     4.40  matrix_vector_mul
 46.23      8.18     3.77        1     3.77     3.77  setup_test_matrix
  0.00      8.18     0.00        6     0.00     0.00  WTime
  0.00      8.18     0.00        3     0.00     0.00  vector_create
  0.00      8.18     0.00        1     0.00     0.00  matrix_create
  0.00      8.18     0.00        1     0.00     0.00  setup_test_vectors
  0.00      8.18     0.00        1     0.00     0.00  vector_is_equal
[...]
```

So now everything's actually represented correctly -- but then again, with optimization turned off, we're not really profiling what we want to be profiling.

## Prepare for Parallel Programming

It's finally time to get ready for parallel programming. To prepare, you get to
write sample calculation codes that we'll then parallelize later.

Do the following work in today's assignment repository. For a change, the
assignment repo is essentially empty, so you get to practice to set everything
up yourself from scratch, including cmake for building, and don't forget about
`.gitignore`.

My recommendation is to use C++ and xtensor for your work, though I leave the
choice of language up to you (out of C, C++, Fortran).

### An example: integration

As the first example, we will calculate the area under a curve, that is the
integral from 0 to 1 of f(x), where $f(x) = \sqrt{1 - x^2}$.

In order to do this, we will use the
[trapezoidal rule](http://en.wikipedia.org/wiki/Trapezoidal_rule), dividing the
area under the curve into little trapezoids and summing them up.

This means we will be calculating

$$
  \sum_{i=0}^{N-1} \frac{1}{2} (x_{i+1} - x_{i}) (f(x_{i}) + f(x_{i+1}))
$$

where $x_i = i/N$ and $N$ is the number of intervals we divided the range from 0
to 1 into.

#### Your turn

- Write a function that calculates the approximate integral above, given the
  number of intervals `N`.

  Note that for this program you do not need (and should not use) arrays of any
  kind.

### A second example: averaging

As a second example, let's assume we have a function f discretized on "nodes",
that is on $i \Delta x$, where $i = 0, 1, \ldots, N$ and $\Delta x = 2\pi / N$:

$$f_i = f(i\Delta x) \qquad i = 0, 1, \ldots, N$$

We now want to calculate the approximate values of that function on cell
centers, that is

$$
F_i = f_{i+1/2} = f((i+1/2) \Delta x) \qquad i = 0, 1, ..., N-1
$$

#### Your turn

- Write a function that takes the $N + 1$ values of $f_i$ (in an xtensor / array),
  and calculates the $N$ values of $F_i$ by averaging:

$$F_i = \frac{1}{2} (f_i + f_{i+1})$$

The output of this function should be written into another vector / array, not
directly written to the screen or to a file. You do, however, want to write the
data into files eventually, but use a separate function for that.

- Think about how you can make a plot of your data (and ideally, do it, too).
