Performance, Profiling, Data Layout#

Today’s assignment: https://classroom.github.com/a/NOtBB1f3

Profiling#

Profiling basically refers to the process of creating a profile of where your code spends most of its time (or look at related things, like memory use, cache accesses, …).

Some tools are non-intrusive, that is you don’t have to change your source code to use them. However, in many cases for detailed measurements your code needs to be “instrumented”, either manually by yourself, or automatically by other tools.

My profiling example is based on the matrix-vector multiply in the latest iteration of the linear algebra library, which is now based on xtensor, see your assignment repo.

Since the plan is to use xtensor going forward, the only thing I’m keeping in terms of my own implementation is the dot() functions. The matrix and vector types were really just aliases for the corresponding xtensor types, so in order to get us used to using xtensor natively, I removed these aliases in f8fd4c4 and ffda1b7d.

In the next commit e9c401b4, I added a new test, called matrix_vector_dot_big to the existing tests. That’s arguably not the best way of doing things – I want to have a bigger test so I can look at performance, but it doesn’t really test correctness. Nevertheless, it’s a quick n’ dirty way of checking out performance, so that’s where I started:

vscode ➜ /workspaces/class-8/build (class13) $ ctest
Test project /workspaces/class-8/build
      Start  1: test_vector_dot
 1/10 Test  #1: test_vector_dot ..................   Passed    0.00 sec
      Start  2: test_vector_add
 2/10 Test  #2: test_vector_add ..................   Passed    0.00 sec
      Start  3: test_matrix_vector_mul
 3/10 Test  #3: test_matrix_vector_mul ...........   Passed    0.00 sec
      Start  4: test_matrix_matrix_mul
 4/10 Test  #4: test_matrix_matrix_mul ...........   Passed    0.03 sec
      Start  5: vector.add
 5/10 Test  #5: vector.add .......................   Passed    0.00 sec
      Start  6: matrix.equal
 6/10 Test  #6: matrix.equal .....................   Passed    0.00 sec
      Start  7: dot.vector_vector
 7/10 Test  #7: dot.vector_vector ................   Passed    0.00 sec
      Start  8: dot.matrix_vector
 8/10 Test  #8: dot.matrix_vector ................   Passed    0.00 sec
      Start  9: dot.matrix_matrix
 9/10 Test  #9: dot.matrix_matrix ................   Passed    0.00 sec
      Start 10: linalg.matrix_vector_dot_big
10/10 Test #10: linalg.matrix_vector_dot_big .....   Passed    0.56 sec

100% tests passed, 0 tests failed out of 10

Total Test time (real) =   0.62 sec

ctest shows me how long each test took to run, and I can see that my “big” test actually took a somewhat noticable amount of time to run, as opposed to the other tests, which ran really quick (though I’m sure they actually still took finite time to run).

Your turn#

Go the commit above (it should be in your assignment repo :crossed_fingers:, but it might have a different commit id), compile and run ctest. What timing did you get?

Separate benchmark#

As it usually goes there’s lots of fancy ways of benchmarking / profiling / analyzing performance. But let’s start simple. As a first step, I figured I better have a separate executable to look at the matrix-vector multiply performance, so I did that in b3977ef.

I can now run my benchmark executable time_matrix_vector_mul separately:

vscode ➜ /workspaces/class-8/build (class13) $ make && cxx/test_matrix_vector_big 
[ 30%] Built target linalg
[ 39%] Built target test_vector_dot
[ 47%] Built target test_vector_add
[ 56%] Built target test_matrix_vector_mul
[ 65%] Built target test_matrix_matrix_mul
[ 73%] Built target gtest
[ 82%] Built target gtest_main
[ 91%] Built target test_linalg
[100%] Built target test_matrix_vector_big

Well, it seems to be working, in that I don’t get an error, and I can notice a short delay while it’s running, but I don’t have much of an idea how long it really took.

`time` command#

One, if rather coarse, way to get some idea about performance is to use the time command:

vscode ➜ /workspaces/class-8/build (class13) $ make && time cxx/test_matrix_vector_big 
[...]

real    0m0.935s
user    0m0.312s
sys     0m0.623s

[Note: Depending on your OS, the output might be somewhat different, e.g., on MacOS, but it should have the same information.]

Most important is probably the real number, that’s how much actual time it took to run the program, beginning to end, also known as wallclock time. user is how much time was spent in user code, ie., the code you (and the xtensor authors) wrote that was compiled into the executable. sys accounts for time being spent by the system, e.g., for allocating memory or doing I/O.

Adding timing statements (manually)#

That doesn’t tell you where that time was spent, though, so we need more detail. The probably easiest way is to just add some timing statements into your code. In this commit 99e7a00, I’ve provided a double Wtime() function in linalg.h, which returns the current time in seconds (and fractions of a second).

I’ve used that to time the dot() function:

vscode ➜ /workspaces/class-8/build (class13) $ make && time cxx/test_matrix_vector_big 
[...]
Time: 0.0935109 seconds

real    0m1.284s
user    0m0.521s
sys     0m0.756s

Your turn#

Do the times add up? Does the “real” time give a good measurement of how long the dot() function took? Why / Why not?
Add timing code, to get a better picture of what part of the code takes how long. Do your measurements make sense?

Compiler optimization#

So far, you may or may not have told the compiler about optimizing the code. The compiler will try to optimize your code, that is, generate code that runs faster, if you tell it to. If you were calling the compiler directly, that’d be, e.g., -O0 (no optimization) vs -O2 (do a good job optimizing).

Since we’re not calling the compiler ourselves, but using cmake, cmake is taking care of telling the compiler how to optimize. In particular, in Debug mode, code won’t be optimized, but in Release mode it will be. You may not have told it anything. If you’re building directly through VS Code, ie., VS Code is calling cmake for you, you can use the command “CMake: Select Variant” to set the build type.

If you call cmake by hand, you can use (we’ve seen this before):

vscode ➜ /workspaces/iam851/build (class14) $ cmake -S .. -DCMAKE_BUILD_TYPE=Release

Your turn#

Use manual profiling to analyze the performance of the matrix-vector multiplication code in Debug vs Release mode.
If you’re interested in digging into more details, you can provide compiler flags directly
```
cmake -DCMAKE_CXX_FLAGS="-O1" .
```
but you have to be careful, because there is also CMAKE_CXX_FLAGS_{Release,Debug} that will also be added.

make VERBOSE=1 will show you the actual compiler commands being used, so you can check that your flags are actually being used.
Do these results match your expectations?
For an extra challenge, figure out how to do bounds checking with xtensor and check what performance impact it has.

If one cares about performance enough to actually analyze it (as we do in this class), one usually really wants things to go fast in the first place. That is, one really cares just about performance in the optimized case (Release mode), and so there isn’t too much point to analyze performance in, say, Debug mode, where it’s poor in the first place. So, going forward, you want to make sure you do your timing measurements in optimized mode.

Scaling with problem size#

So far, we’ve used a fixed problem size (const int n = 10000;) that I’ve provided. One really important aspect of performance is scalability, and we’ll learn more about that as we go. For now, let’s think about and test how performance changes with problem size.

Your turn#

Looking at what the code actually does, predict how the timing of the code will change as you, e.g., double or halve the parameter n. This is what’s called algorithmic scalability.
Now actually do some timing experiments, where you change n and track the corresponding time measurements. Does it match your predictions well?

Another worthwhile note is that, as at least physicists in class should have plenty of experience with, measurements come with errors, and those are important in analyzing what those measurements mean. We’re not going to go deeply into that, but it is something to be aware of. In particular, if your timing measurements aren’t reproducible, they’re likely not accurate, and you may need to think about doing something different.

Inlining code#

Inlining code can help performance – it means the compiler can compile a function right where it’s being used, avoiding some overhead and possibly being in a better position to do optimization decisions.

In the previous class, I had moved all of the implementation of the linear algebra library into the header file – that is, I had turned it into a header-only library. As I laid out previously, like many things this has its pros and cons – a con is that it can lead to longer compile times, since the code now doesn’t get compiled into a library once, but rather gets compiled every time it’s used. However, that can also be a pro, because it allows the compiler to inline code, which can lead to better performance – rather than compiling a function generically once, it can just compile it right where it’s being used, and possibly make better optimization decisions. A header-only library certainly helps with the hassles of pre-compiling and linking a regular library, though cmake usually takes care of that for you, so that’s not such a big deal. I shall also note that for templated libraries (like the templated-by-type vector and matrix classes), header-only is pretty much the only way to go, since the compiler needs to see the actual code in order to compile it for the specific types that are being used.

Now, it’s a good opportunity to point out that one has to be somewhat careful – in particular in benchmarks with inlined code, the compiler might notice that you’re calculating something and then never actually using the result. If that’s the case, the compiler can decide that it doesn’t need to do that calculation in the first place, and your timing will show that the calculation became really fast, but that may not be real, it may just represent the fact that doing nothing can be really fast. So one has to be careful in benchmark code to make sure the compiler doesn’t optimize away the thing you mean to benchmark.

Data Layout and Performance#

The layout of your data, and the order in which it is accessed can have a big impact on performance.

In my last commmit 43001b, I templated the matrix-vector dot() function so that it supports different data layouts. But we first should learn more about data layouts – Wikipedia is a useful reference.

Our matrix is stored as a simple 2-d array, xt::xtensor<double, 2>. So we don’t actually know anything about what data layout it uses internally. Generally, that’s a good thing – it’s an implementation detail that doesn’t change the correctness of our code. But it can be important for performance, as well as for interfacing between different languages, in particular C/C++ vs Fortran.

Since xtensor provides this nice 2-d array abstraction, in theory should be possible to just change the data layout internally. How easy or how hard that it is depends on the actual implementation. That implementation, in xtensor’s case, is plenty complicated, but fortunately, it actually has options to change the data layout.

Your turn#

Figure out the data layout that the matrix A in test_matrix_vector_big is currently using. You could do this by reading the documentation (not necessarily a bad thing :smirk:), but really let’s do it by actually printing the underlying data storage (which is 1-dimensional).
- Hint: This is probably much easier to do for a 3x3 matrix rather than a 10000x10000 one.
- Print the matrix A so that you can make sure you know what its elements are. Note: xtensor supports printing natively in the standard C++ fashion after including <xtensor/io/xio.hpp>. But it may be more instructive to print it yourself element by element, anyway.
- Print the underlying stored 1-d data. It can be accessed by A.data() and its length is A.size().
You can change the data layout in xtensor using xt::xtensor<double, 2, xt::layout_type::row_major> or xt::xtensor<double, 2, xt::layout_type::column_major>. How does using those types change the matrix elements when printed in 2-d? How about the underlying 1-d storage?

Do the terms “row-major” and “column-major” make sense given what you observe?
What’s the impact of the data layout on the performance of our matrix-vector multiplication? Does it match what you expectation? Do you have an explanation? How would the situation look in Fortran (which uses column-major layout)?
For more fun, you could try fixing the performance of the matrix-vector multiplication implementation while sticking with Fortran (column-major) order data layout.

Homework#

As usual, finish the “Your turn” from above. For the first parts, you need to check out older commits to answer the corresponding questions. For the last set, work of the latest version and make commits as you go.
Let’s do it all again, with a case that’s actually somewhat more interesting: matrix-matrix multiplication.
- Think about what you would expect for scalability of the matrix-matrix multiplication.
- Add an executable to measure performance of the matrix-matrix multiplication for some bigger matrices. Are the performance measurements consistent with your predictions?
- (somewhat tricky) Try to predict the performance impact of different choices of data layout for the matrices A and B. Update your implementation such that both A and B can have arbitrary data layout. Do some experiments, and interpret them.

Performance, Profiling, Data Layout

Contents

Performance, Profiling, Data Layout#

Profiling#

Your turn#

Separate benchmark#

time command#

Adding timing statements (manually)#

Your turn#

Compiler optimization#

Your turn#

Scaling with problem size#

Your turn#

Inlining code#

Data Layout and Performance#

Your turn#

Homework#

`time` command#