# Performance, Profiling, Data Layout

Today's assignment: <https://classroom.github.com/a/NOtBB1f3>

## Profiling

Profiling basically refers to the process of creating a profile of where your
code spends most of its time (or look at related things, like memory use, cache
accesses, ...).

Some tools are non-intrusive, that is you don't have to change your source code
to use them. However, in many cases for detailed measurements your code needs to
be "instrumented", either manually by yourself, or automatically by other tools.

My profiling example is based on the matrix-vector multiply in the latest iteration
of the linear algebra library, which is now based on xtensor, see your assignment repo.

Since the plan is to use xtensor going forward, the only thing I'm keeping in terms of my own implementation is the `dot()` functions. The `matrix` and `vector` types were really just aliases for the corresponding xtensor types, so in order to get us used to using xtensor natively, I removed these aliases in [f8fd4c4](https://github.com/UNH-HPC-2026/unh-hpc-2026-class-13-performance-class-13/commit/f8fd4c4c77088679a6abf6f00effccec56632902) and [ffda1b7d](https://github.com/UNH-HPC-2026/unh-hpc-2026-class-13-performance-class-13/commit/ffda1b7df278355a7c2ed38c3d6eaf70ff971aa6).


In the next commit
[e9c401b4](https://github.com/UNH-HPC-2026/unh-hpc-2026-class-13-performance-class-13/commit/e9c401b42d027f2ca110a9a35ea58655325abab2),
I added a new test, called `matrix_vector_dot_big` to the existing tests. That's
arguably not the best way of doing things -- I want to have a bigger test so I
can look at performance, but it doesn't really test correctness. Nevertheless,
it's a quick n' dirty way of checking out performance, so that's where I
started:

```sh
vscode ➜ /workspaces/class-8/build (class13) $ ctest
Test project /workspaces/class-8/build
      Start  1: test_vector_dot
 1/10 Test  #1: test_vector_dot ..................   Passed    0.00 sec
      Start  2: test_vector_add
 2/10 Test  #2: test_vector_add ..................   Passed    0.00 sec
      Start  3: test_matrix_vector_mul
 3/10 Test  #3: test_matrix_vector_mul ...........   Passed    0.00 sec
      Start  4: test_matrix_matrix_mul
 4/10 Test  #4: test_matrix_matrix_mul ...........   Passed    0.03 sec
      Start  5: vector.add
 5/10 Test  #5: vector.add .......................   Passed    0.00 sec
      Start  6: matrix.equal
 6/10 Test  #6: matrix.equal .....................   Passed    0.00 sec
      Start  7: dot.vector_vector
 7/10 Test  #7: dot.vector_vector ................   Passed    0.00 sec
      Start  8: dot.matrix_vector
 8/10 Test  #8: dot.matrix_vector ................   Passed    0.00 sec
      Start  9: dot.matrix_matrix
 9/10 Test  #9: dot.matrix_matrix ................   Passed    0.00 sec
      Start 10: linalg.matrix_vector_dot_big
10/10 Test #10: linalg.matrix_vector_dot_big .....   Passed    0.56 sec

100% tests passed, 0 tests failed out of 10

Total Test time (real) =   0.62 sec
```

ctest shows me how long each test took to run, and I can see that my "big" test
actually took a somewhat noticable amount of time to run, as opposed to the
other tests, which ran really quick (though I'm sure they actually still took
finite time to run).

#### Your turn

- Go the commit above (it should be in your assignment repo :crossed_fingers:,
  but it might have a different commit id), compile and run `ctest`. What timing
  did you get?

### Separate benchmark

As it usually goes there's lots of fancy ways of benchmarking / profiling /
analyzing performance. But let's start simple. As a first step, I figured I
better have a separate executable to look at the matrix-vector multiply
performance, so I did that in [b3977ef](https://github.com/UNH-HPC-2026/unh-hpc-2026-class-13-performance-class-13/commit/b3977ef805ea4086966a084aa34b0f5553218892).

I can now run my benchmark executable `time_matrix_vector_mul` separately:

```sh
vscode ➜ /workspaces/class-8/build (class13) $ make && cxx/test_matrix_vector_big 
[ 30%] Built target linalg
[ 39%] Built target test_vector_dot
[ 47%] Built target test_vector_add
[ 56%] Built target test_matrix_vector_mul
[ 65%] Built target test_matrix_matrix_mul
[ 73%] Built target gtest
[ 82%] Built target gtest_main
[ 91%] Built target test_linalg
[100%] Built target test_matrix_vector_big
```

Well, it seems to be working, in that I don't get an error, and I can notice a
short delay while it's running, but I don't have much of an idea how long it
really took.

### `time` command

One, if rather coarse, way to get some idea about performance is to use the
`time` command:

```sh
vscode ➜ /workspaces/class-8/build (class13) $ make && time cxx/test_matrix_vector_big 
[...]

real    0m0.935s
user    0m0.312s
sys     0m0.623s
```

[Note: Depending on your OS, the output might be somewhat different, e.g., on
MacOS, but it should have the same information.]

Most important is probably the `real` number, that's how much actual time it
took to run the program, beginning to end, also known as wallclock time. `user`
is how much time was spent in user code, ie., the code you (and the xtensor
authors) wrote that was compiled into the executable. `sys` accounts for time
being spent by the system, e.g., for allocating memory or doing I/O.

### Adding timing statements (manually)

That doesn't tell you where that time was spent, though, so we need more detail.
The probably easiest way is to just add some timing statements into your code.
In this commit
[99e7a00](https://github.com/UNH-HPC-2026/unh-hpc-2026-class-13-performance-class-13/commit/99e7a00cd7725d3c3cef72c0c7e627e234a97781),
I've provided a `double Wtime()` function in `linalg.h`, which returns
the current time in seconds (and fractions of a second).

I've used that to time the `dot()` function:

```sh
vscode ➜ /workspaces/class-8/build (class13) $ make && time cxx/test_matrix_vector_big 
[...]
Time: 0.0935109 seconds

real    0m1.284s
user    0m0.521s
sys     0m0.756s
```

#### Your turn

- Do the times add up? Does the "real" time give a good measurement of how long
  the dot() function took? Why / Why not?
- Add timing code, to get a better picture of what part of the code takes
  how long. Do your measurements make sense?

### Compiler optimization

So far, you may or may not have told the compiler about optimizing the code. The
compiler will try to optimize your code, that is, generate code that runs
faster, if you tell it to. If you were calling the compiler directly, that'd be,
e.g., `-O0` (no optimization) vs `-O2` (do a good job optimizing).

Since we're not calling the compiler ourselves, but using cmake, cmake is taking
care of telling the compiler how to optimize. In particular, in `Debug` mode,
code won't be optimized, but in `Release` mode it will be. You may not have told
it anything. If you're building directly through VS Code, ie., VS Code is
calling cmake for you, you can use the command "CMake: Select Variant" to set
the build type.

If you call cmake by hand, you can use (we've seen this before):

```sh
vscode ➜ /workspaces/iam851/build (class14) $ cmake -S .. -DCMAKE_BUILD_TYPE=Release
```

#### Your turn

- Use manual profiling to analyze the performance of the matrix-vector
  multiplication code in Debug vs Release mode.

- If you're interested in digging into more details, you can provide compiler flags directly

  ```sh
  cmake -DCMAKE_CXX_FLAGS="-O1" .
  ```

  but you have to be careful, because there is also
  `CMAKE_CXX_FLAGS_{Release,Debug}` that will also be added.

  `make VERBOSE=1` will show you the actual compiler commands being used, so you can check that your flags are actually being used.

- Do these results match your expectations?

- For an extra challenge, figure out how to do bounds checking with xtensor and
  check what performance impact it has.

If one cares about performance enough to actually analyze it (as we do in this
class), one usually really wants things to go fast in the first place. That is, one really cares just about
performance in the optimized case (Release mode), and so there isn't too much
point to analyze performance in, say, Debug mode, where it's poor in the first
place. So, going forward, you want to make sure you do your timing measurements
in optimized mode.

### Scaling with problem size

So far, we've used a fixed problem size (`const int n = 10000;`) that I've
provided. One really important aspect of performance is scalability, and we'll
learn more about that as we go. For now, let's think about and test how
performance changes with problem size.

#### Your turn

- Looking at what the code actually does, predict how the timing of the code
  will change as you, e.g., double or halve the parameter `n`. This is what's
  called algorithmic scalability.
- Now actually do some timing experiments, where you change `n` and track the
  corresponding time measurements. Does it match your predictions well?

Another worthwhile note is that, as at least physicists in class should have plenty of experience with,
measurements come with errors, and those are important in analyzing what those
measurements mean. We're not going to go deeply into that, but it is something
to be aware of. In particular, if your timing measurements aren't reproducible,
they're likely not accurate, and you may need to think about doing something
different.

### Inlining code

Inlining code can help performance -- it means the compiler can compile a
function right where it's being used, avoiding some overhead and possibly being
in a better position to do optimization decisions.

In the previous class, I had moved all of the implementation of the linear algebra library into the header file -- that is, I had turned it into a header-only library. As I laid out previously, like many things this has its pros and cons -- a con is that it can lead to longer compile times, since the code now doesn't get compiled into a library once, but rather gets compiled every time it's used. However, that can also be a pro, because it allows the compiler to inline code, which can lead to better performance -- rather than compiling a function generically once, it can just compile it right where it's being used, and possibly make better optimization decisions. A header-only library certainly helps with the hassles of pre-compiling and linking a regular library, though cmake usually takes care of that for you, so that's not such a big deal. I shall also note that for templated libraries (like the templated-by-type vector and matrix classes), header-only is pretty much the only way to go, since the compiler needs to see the actual code in order to compile it for the specific types that are being used.

Now, it's a good opportunity to point out that one has to be somewhat
careful -- in particular in benchmarks with inlined code, the compiler might
notice that you're calculating something and then never actually using the
result. If that's the case, the compiler can decide that it doesn't need to do
that calculation in the first place, and your timing will show that the
calculation became really fast, but that may not be real, it may just represent the fact that doing nothing can be really fast. So one has to be careful in benchmark
code to make sure the compiler doesn't optimize away the thing you mean to
benchmark.

### Data Layout and Performance

The layout of your data, and the order in which it is accessed can have a big
impact on performance.

In my last commmit
[43001b](https://github.com/UNH-HPC-2026/unh-hpc-2026-class-13-performance-class-13/commit/43001b91ee5dacc1a2a2d9b9760d124096a65c3a),
I templated the matrix-vector dot() function so that it supports different data
layouts. But we first should learn more about data layouts --
[Wikipedia](https://en.wikipedia.org/wiki/Row-_and_column-major_order) is a
useful reference.

Our matrix is stored as a simple 2-d array, `xt::xtensor<double, 2>`. So we
don't actually know anything about what data layout it uses internally.
Generally, that's a good thing -- it's an implementation detail that doesn't
change the correctness of our code. But it can be important for performance, as
well as for interfacing between different languages, in particular C/C++ vs
Fortran.

Since xtensor provides this nice 2-d array abstraction, in theory should be
possible to just change the data layout internally. How easy or how hard that it
is depends on the actual implementation. That implementation, in xtensor's case,
is plenty complicated, but fortunately, it actually has options to change the
data layout.

#### Your turn

- Figure out the data layout that the matrix `A` in `test_matrix_vector_big` is
  currently using. You could do this by reading the documentation (not
  necessarily a bad thing :smirk:), but really let's do it by actually printing the
  underlying data storage (which is 1-dimensional).

  - Hint: This is probably much easier to do for a 3x3 matrix rather than a
    10000x10000 one.

  - Print the matrix `A` so that you can make sure you know what its elements
    are. Note: xtensor supports printing natively in the standard C++ fashion
    after including `<xtensor/io/xio.hpp>`. But it may be more instructive to print
    it yourself element by element, anyway.

  - Print the underlying stored 1-d data. It can be accessed by `A.data()` and
    its length is `A.size()`.

- You can change the data layout in xtensor using
  `xt::xtensor<double, 2, xt::layout_type::row_major>` or
  `xt::xtensor<double, 2, xt::layout_type::column_major>`. How does using those
  types change the matrix elements when printed in 2-d? How about the underlying
  1-d storage?

  Do the terms "row-major" and "column-major" make sense given what you observe?

- What's the impact of the data layout on the performance of our matrix-vector
  multiplication? Does it match what you expectation? Do you have an
  explanation? How would the situation look in Fortran (which uses column-major
  layout)?

- For more fun, you could try fixing the performance of the matrix-vector
  multiplication implementation while sticking with Fortran (column-major) order
  data layout.

## Homework

- As usual, finish the "Your turn" from above. For the first parts, you need to
  check out older commits to answer the corresponding questions. For the last
  set, work of the latest version and make commits as you go.

- Let's do it all again, with a case that's actually somewhat more interesting:
  matrix-matrix multiplication.

  - Think about what you would expect for scalability of the matrix-matrix multiplication.

  - Add an executable to measure performance of the matrix-matrix multiplication
    for some bigger matrices. Are the performance measurements consistent with
    your predictions?

  - (somewhat tricky) Try to predict the performance impact of different choices of
    data layout for the matrices `A` and `B`. Update your implementation such that
    both `A` and `B` can have arbitrary data layout. Do some experiments, and
    interpret them.

