Algorithmic Scalability, Prepare for Parallel Programming#

You can continue working in Tuesday’s assignment repo.

Algorithmic Scalability#

Big-O Notation#

Looking at scalability is basically about getting a really simple model to predict the runtime and other related properties of a given code (or section of code) as parameters change. For example, we’ve looked at the performance of matrix vector multiplication, where the matrix is a \(N \times N\) matrix, and the vectors has \(N\) elements. In this case, \(N\) is an obvious choice for the parameter of interest, with the question being, how do things change as I change \(N\)?

“Runtime” is actually not necessarily all that easy to predict, as modern computers are quite complex beasts, and the runtime will depend on the number of calculations, the number of memory accesses, caches, instruction scheduling, interference of other things your computer is doing at the same time, etc. In general, two quantities that are of fundamental interest are

number of floating point operations (“flops”)
number of memory accesses

For the matrix-vector product \(y = Ax\), let’s remember that

\[y_i = \sum_{j=0}^{N-1} A_{ij} x_j\]

So with a bit of counting, we can figure out that calculating a single \(y_i\) takes \(N\) multiplications and \(N-1\) additions. We have to do this for every \(i\) from \(0\) to \(N-1\), so we end up with a total number of floating points operations

\[ flops = N(N + N-1) = 2N^2 - N \]

For large \(N\), e.g., \(N = 1000\), we get \(2(1000^2) - 1000 = 2,000,000 - 1000 \approx 2,000,000\). As I just did, it’s useful to neglect lower powers of \(N\), since for large \(N\), the answer is dominated by the highest power. And, in fact, the constant factor is also not usually really important, so what one ends up saying is

\[ flops = \mathcal{O}(N^2) \]

The “big-O” notation has a precise mathematical definition, see e.g. https://en.wikipedia.org/wiki/Big_O_notation. But what’s most important for us is how it gives an estimate of how the flops change as \(N\) changes. E.g., if I double my problem size \(N\) from \(1000\) to \(2000\), flops change from \(1000^2\) to \((2\cdot 1000)^2 = 4(1000^2)\), ie., the number of flops quadrupled, and that’s consistent with our observation that the runtime also roughly quadrupled when we doubled \(N\).

We can, and should do the same for memory accesses. In calculating the matrix-vector product, we read every element \(A_{ij}\), so that’s \(N^2\) memory reads. In addition, we read every element of \(x\), that’s another \(N\) reads, and we write every element of \(y\), that’s \(N\) writes. So the total number of memory accesses is

\[ mem = N^2 + 2N = \mathcal{O}(N^2) \]

That is, in the matrix-vector multiplication problem, both the number of floating point operations and memory accesses both scale like \(N^2\), so we don’t really have to think of them separately. It is worth noting that one usually cares most about is the number of flops, since the floating point operations represent the actual useful calculations that eventually give you the answer you’re looking for, while the memory accesses are more of a necessary evil that you (well, the computer) also has to do in order to be able to do those calculations. But, memory accesses are necessary, and they do (in general) take a noticable amount of time, so one needs to take them into account when considering performance.

Your turn#

Repeat the scaling analysis above for matrix-matrix multiplication (as I did implicitly above, you can assume square matrices).

We haven’t actually really talked about how the quantities above affect run time. We’ll get into this more, but to first order, every floating point operation is going to take a certain amount of time (say, 0.1 ns), and every memory access is also going to take a given amount of time (maybe 1 ns). That’s a gross oversimplification of real life, but basically, it means that time spent is going to be proportional to flops / memory accesses, so the scaling above translates over directly to run time.

Tracing#

I’ll show a little bit of tracing using the perfetto tool, which is a very powerful tool for tracing and profiling code. It can be used to see how much time is spent in different functions, how the CPU is being used, etc. It’s unfortunately rather complex to use and recently has given me a lot of trouble, but it can give you a lot of insight into what’s going on in your code and your computer in general.

Profiling#

There are many tools that can be used to profile code, for C / C++ / Fortran, a basic tool is gprof, though it doesn’t actually seem to work on my Mac. There’s also oprofile on Linux, shark on the Mac, the PGI compiler comes with its own pgprof, … Some come with graphical user interface options, too.

If they work, those tools are nice – you don’t have to change your code, things will, supposedly, just work. However, there are issues with portability (only work with a certain compiler / a certain operating system), and they give you a lot of information, but it’s not necessarily easy to digest.

This is how gprof can be used on Linux:

# the "-pg" tells the compiler to turn on profiling
[kai@lolcat build]$ cmake -DCMAKE_BUILD_TYPE=Release -DCMAKE_CXX_FLAGS=-pg -DCMAKE_EXE_LINKER_FLAGS=-pg -DCMAKE_SHARED_LINKER_FLAGS=-pg ..
[...]
# rebuild
[kai@lolcat build]$ make
# run the code
[kai@lolcat build]$ cxx/large_matrix_vector_mul_cxx
setup took 2.03656 sec
matrix_vector_mul() took 0.778702 sec
# that produced a gmon.out file, which gprof will read and analyze
[kai@lolcat linear_algebra]$ gprof cxx/large_matrix_vector_mul_cxx
Flat profile:

Each sample counts as 0.01 seconds.
  %   cumulative   self              self     total
 time   seconds   seconds    calls   s/call   s/call  name
 82.45      2.49     2.49        1     2.49     2.49  xt::xtensor_container<xt::uvector<double, std::allocator<double> >, 2ul, (xt::layout_type)1, xt::xtensor_expression_tag>::xtensor_container<xt::xgenerator<xt::detail::fn_impl<xt::detail::eye_fn<bool> >, bool, std::vector<unsigned long, std::allocator<unsigned long> > > >(xt::xexpression<xt::xgenerator<xt::detail::fn_impl<xt::detail::eye_fn<bool> >, bool, std::vector<unsigned long, std::allocator<unsigned long> > > > const&)
 17.55      3.02     0.53        1     0.53     0.53  dot(xt::xtensor_container<xt::uvector<double, std::allocator<double> >, 2ul, (xt::layout_type)1, xt::xtensor_expression_tag> const&, xt::xtensor_container<xt::uvector<double, std::allocator<double> >, 1ul, (xt::layout_type)1, xt::xtensor_expression_tag> const&)
  0.00      3.02     0.00        2     0.00     0.00  xt::xtensor_container<xt::uvector<double, std::allocator<double> >, 1ul, (xt::layout_type)1, xt::xtensor_expression_tag>::~xtensor_container()
  0.00      3.02     0.00        1     0.00     0.00  xt::xgenerator<xt::detail::fn_impl<xt::detail::eye_fn<bool> >, bool, std::vector<unsigned long, std::allocator<unsigned long> > >::~xgenerator()
  0.00      3.02     0.00        1     0.00     0.00  xt::xtensor_container<xt::uvector<double, std::allocator<double> >, 2ul, (xt::layout_type)1, xt::xtensor_expression_tag>::~xtensor_container()
  0.00      3.02     0.00        1     0.00     0.00  auto xt::eye<bool>(unsigned long, int)

While this looks pretty horrible, the important information is actually there, just hard to digest:

82.45% of the total time, that is, 2.49 seconds were spent in the eye function that initializes the matrix.
17.55% of the total time, that is, 0.53 seconds were spent in the dot function that we care about.

This is definitely one of those things that were much less complex with C:

Flat profile:

Each sample counts as 0.01 seconds.
  %   cumulative   self              self     total
 time   seconds   seconds    calls  ms/call  ms/call  name
 72.50      2.04     2.04                             main
 27.86      2.82     0.78        1   782.80   782.80  matrix_vector_mul
  0.00      2.82     0.00        3     0.00     0.00  vector_create
  0.00      2.82     0.00        1     0.00     0.00  matrix_create
  0.00      2.82     0.00        1     0.00     0.00  vector_is_equal
[...]

This is an old profile, which is actually not trivial to interpret, either, since at the time I had a separate setup_test_matrix function where the matrix setup happened, ie, the slowest part of all, and it doesn’t show up..

Basically all the remaining time other than matrix_vector_mul() itself appears to have been spent in main(), and no time in setting up the vectors and matrix, even though we know that the setup phase took 2.036 sec. So one lesson to be learned here is that you can’t necessarily completely trust the profiling. That time is accounted for, but it’s accounted for under main(), rather than under the individual setup_test_vectors/matrix functions.

The reason is that the compiler optimized the code to avoid actually calling those functions, and rather moved the work to be done directly in main(). This is called “inlining”. If we turn off optimization, things match better:

[kai@lolcat linear_algebra]$ make clean
[...]
[kai@lolcat linear_algebra]$ make CFLAGS="-g -O0 -pg -Wall"
[...]
[kai@lolcat linear_algebra]$ ./large_matrix_vector_mul_cxx
setup took 3.80221 sec
matrix_vector_mul() took 4.39365 sec
checking result took 0.000118971 sec
[kai@lolcat linear_algebra]$ gprof large_matrix_vector_mul_cxx
Flat profile:

Each sample counts as 0.01 seconds.
  %   cumulative   self              self     total
 time   seconds   seconds    calls   s/call   s/call  name
 53.98      4.40     4.40        1     4.40     4.40  matrix_vector_mul
 46.23      8.18     3.77        1     3.77     3.77  setup_test_matrix
  0.00      8.18     0.00        6     0.00     0.00  WTime
  0.00      8.18     0.00        3     0.00     0.00  vector_create
  0.00      8.18     0.00        1     0.00     0.00  matrix_create
  0.00      8.18     0.00        1     0.00     0.00  setup_test_vectors
  0.00      8.18     0.00        1     0.00     0.00  vector_is_equal
[...]

So now everything’s actually represented correctly – but then again, with optimization turned off, we’re not really profiling what we want to be profiling.

Prepare for Parallel Programming#

It’s finally time to get ready for parallel programming. To prepare, you get to write sample calculation codes that we’ll then parallelize later.

Do the following work in today’s assignment repository. For a change, the assignment repo is essentially empty, so you get to practice to set everything up yourself from scratch, including cmake for building, and don’t forget about .gitignore.

My recommendation is to use C++ and xtensor for your work, though I leave the choice of language up to you (out of C, C++, Fortran).

An example: integration#

As the first example, we will calculate the area under a curve, that is the integral from 0 to 1 of f(x), where \(f(x) = \sqrt{1 - x^2}\).

In order to do this, we will use the trapezoidal rule, dividing the area under the curve into little trapezoids and summing them up.

This means we will be calculating

\[ \sum_{i=0}^{N-1} \frac{1}{2} (x_{i+1} - x_{i}) (f(x_{i}) + f(x_{i+1})) \]

where \(x_i = i/N\) and \(N\) is the number of intervals we divided the range from 0 to 1 into.

Your turn#

Write a function that calculates the approximate integral above, given the number of intervals N.

Note that for this program you do not need (and should not use) arrays of any kind.

A second example: averaging#

As a second example, let’s assume we have a function f discretized on “nodes”, that is on \(i \Delta x\), where \(i = 0, 1, \ldots, N\) and \(\Delta x = 2\pi / N\):

\[f_i = f(i\Delta x) \qquad i = 0, 1, \ldots, N\]

We now want to calculate the approximate values of that function on cell centers, that is

\[ F_i = f_{i+1/2} = f((i+1/2) \Delta x) \qquad i = 0, 1, ..., N-1 \]

Your turn#

Write a function that takes the \(N + 1\) values of \(f_i\) (in an xtensor / array), and calculates the \(N\) values of \(F_i\) by averaging:

\[F_i = \frac{1}{2} (f_i + f_{i+1})\]

The output of this function should be written into another vector / array, not directly written to the screen or to a file. You do, however, want to write the data into files eventually, but use a separate function for that.

Think about how you can make a plot of your data (and ideally, do it, too).

Algorithmic Scalability, Prepare for Parallel Programming

Contents

Algorithmic Scalability, Prepare for Parallel Programming#