Linear algebra, bounds checking and debugging#

Today’s assignment

Bounds checking#

A very common error is to access arrays beyond their actual bounds. In C, if an array is declared as

 double arr[5];

the allowed indices to access it are 0…4, as there are 5 elements, and C/C++ start counting at 0. Accessing arr with an index less than 0 or greater or equal than 5 is what the C language spec calls “undefined behavior”, which means it may sometimes work, other times crash the program, or, even more annoyingly, lead to subtle errors in totally different part of your program later on. So you definitely want to avoid this. Fortran compilers, which are subject to the same problem, may offer a -fbounds-check or similar option, but C generally does not, since in C an array is in most places really just a pointer to its first (0th, really) element, and there’s no way the compiler can know how many elements follow. There are sophisticated tools, though, which can be helpful in tracking down illegal accesses, though they’re not always available or easy to use (see the references at the bottom).

If you use your own vector or matrix struct, which actually stores not just the raw data, but also the length of the vector / size of the matrix, you can make use of this to implement your own bounds checking. This is demonstrated here, where bounds checking code is added, though not turned on. In general, the additional checks will take time, so they’re useful during development, but they probably should be turned off in the finished application when you want it to run as fast as possible.

Your turn#

My work will be in the c/ subdirectory. However, as usual you (or VS Code) should just make a top-level build/ directory and call cmake -S .. from there. The executables will end up in build/c/test_....

  • Sign up for the assignment repo

  • Clone onto your local machine

You can follow what I show in class by going back in history. For example: Let’s go back to an older version of the code main~7. What this means is, check out the version of the code on the main branch, but not the current one, rather go back 7 commits in history. Caveat: If you add additional commits in your repository, the numbering will change. [To avoid this ambiguousness, commits have a unique id (hash), which gets you to a specific commit for sure. But if one wants to go back and forth, the branch~<n> is more convenient.]

➜  build git:(main) git checkout main~7
Note: switching to 'main~7'.

You are in 'detached HEAD' state. You can look around, make experimental
changes and commit them, and you can discard any commits you make in this
state without impacting any branches by switching back to a branch.

If you want to create a new branch to retain commits you create, you may
do so (now or later) by using -c with the switch command. Example:

  git switch -c <new-branch-name>

Or undo this operation with:

  git switch -

Turn off this advice by setting config variable advice.detachedHead to false

HEAD is now at e9d3401 class12: print out vector_add result

The warning is telling you is that since you’re not on the tip of a branch anymore, bad things may happen if you make a commit there. So let’s just not do that ;)

If you’re using VS code, or another git GUI, there’s probably a way to check out an old version of the code from that GUI. E.g., in VS code, you can look at the history in the Version Control panel, and you can right-click on a given commit and select Checkout (Detached) from the context menu.

What happened in that commit is that we were presumably debugging something with the vector_add() function and we just added some printing:

commit e4ae3156527b662556df1539f6d5f1853e8a0a7a (HEAD)
Author: Kai Germaschewski <kai.germaschewski@gmail.com>
Date:   Tue Feb 24 14:38:53 2026 +0000

    class11: print out vector_add result
    
    Let's say we're debugging something...

diff --git a/c/test_vector_add.c b/c/test_vector_add.c
index 9b2b29e..b8ce889 100644
--- a/c/test_vector_add.c
+++ b/c/test_vector_add.c
@@ -2,6 +2,7 @@
 #include "linalg.h"
 
 #include <assert.h>
+#include <stdio.h>
 #include <stdlib.h>
 
 void test_vector_add(int n)
@@ -20,6 +21,12 @@ void test_vector_add(int n)
 
   vector_add(&x, &y, &z);
 
+  printf("{");
+  for (int i = 0; i < 3; i++) {
+    printf(" %g", VEC(&z, i));
+  }
+  printf(" }\n");
+
   assert(vector_equals(&z, &z_ref));
 
   vector_destruct(&x);

Your turn#

  • Run the test_vector_add. What output do you get? Is it what you want? (Remember, we’re trying to manually debug whether vector_add() gives the right result.)

That debug printing seems to work fine, except it just shows 3 elements, Now, say, you want to actually print all 4 elements of the vector in the 2nd test. main~6 makes the following change:

commit 60026088cb5ace32118bba47dbed2b7610aea62d (HEAD)
Author: Kai Germaschewski <kai.germaschewski@gmail.com>
Date:   Tue Feb 24 14:40:05 2026 +0000

    class11: print 4 elements to debug 2nd test
    
    Nothing's actually wrong in vector_add(), actually, but let's presume we're
    debugging anyway...

diff --git a/c/test_vector_add.c b/c/test_vector_add.c
index b8ce889..1f7fa1a 100644
--- a/c/test_vector_add.c
+++ b/c/test_vector_add.c
@@ -22,7 +22,7 @@ void test_vector_add(int n)
   vector_add(&x, &y, &z);
 
   printf("{");
-  for (int i = 0; i < 3; i++) {
+  for (int i = 0; i < 4; i++) {
     printf(" %g", VEC(&z, i));
   }
   printf(" }\n");

It may not be so obvious, but we just introduced a bug – that test is called once for 3 element vectors, and then again for 4 element vectors. If we always print 4 elements, that’s not so good.

Your turn#

  • Build the code and run it. Does it work or crash, or what?

Turning on bounds checking#

Alright, so it doesn’t crash (probably), doesn’t fail, but doesn’t quite look right, either. Actually, these are about the worst kinds of bugs, since they lead to random problems (or non at all, seemingly), making them really hard to track down most of the time. So let’s turn on bounds checking.

Your turn#

  • Uncomment the #define BOUNDS_CHECK in linear_algebra.h. Build and run the test.

What went wrong? Well, the bounds checking should point you fairly closely to where to look for the bug.

After you’re done, let’s go back to the main branch. If you have made local changes, you likely need to undo them first, since git doesn’t want to switch branches / commits otherwise, as you’d lose those changes, and git tries to prevent you from unintentionally losing your work. However, today we are okay with losing this temporary changes. In VS Code’s version management menu, it lists the files which are changed, and it has a “revert” button next to the “stage” button. If you’re using another git GUI, it probably also has an option to revert local changes, or $ git checkout <source-dir> will do it on the command line.

To go back to your (current) branch, do git checkout main (or git switch main), or use your GUI / VS Code to do so.

More debugging#

Your branch will be the starting point for the main debugging task of this class.

Well, we first need some real bugs. And to make them not too easy to find, they’re in somewhat complex code, that is, the current version of the linear algebra library, which allows for variable dimensions of matrices and vectors.

For this class, I added a new feature – matrix-matrix multiplication. But we first have to take a step back again. Checkout the version where I added those, which is 2 commits back in the main branch (for me, anyway):

vscode  /workspaces/class-8/build (6002608) $ git checkout main~2
Previous HEAD position was 6002608 class11: print 4 elements to debug 2nd test
HEAD is now at 7567948 class11: add matrix-matrix multiply and test

Your turn#

  • Run the tests. They should check out fine.

Tests#

  build git:(4c11f17) cmake --build . && ctest
[...]
Test project /workspaces/class-8/build
    Start 1: test_vector_dot
1/7 Test #1: test_vector_dot ..................   Passed    0.00 sec
    Start 2: test_vector_add
2/7 Test #2: test_vector_add ..................   Passed    0.00 sec
    Start 3: test_matrix_vector_mul
3/7 Test #3: test_matrix_vector_mul ...........   Passed    0.00 sec
    Start 4: test_matrix_matrix_mul
4/7 Test #4: test_matrix_matrix_mul ...........   Passed    0.00 sec
    Start 5: test_vector_dot_cxx
5/7 Test #5: test_vector_dot_cxx ..............   Passed    0.00 sec
    Start 6: test_vector_add_cxx
6/7 Test #6: test_vector_add_cxx ..............   Passed    0.00 sec
    Start 7: test_matrix_vector_mul_cxx
7/7 Test #7: test_matrix_vector_mul_cxx .......   Passed    0.00 sec

100% tests passed, 0 tests failed out of 7

Total Test time (real) =   0.02 sec

Excellent, everything looks good.

[Side note: Look at test_matrix_matrix_mul.c. What does it actually test? But never mind.]

However, after the next change, it turns out that the code now crashes.

  build git:(4c11f17) git checkout main~1
Previous HEAD position was 7567948 class11: add matrix-matrix multiply and test
HEAD is now at 2afb480 class11: make the mat-mat-mul test bigger
vscode  /workspaces/class-8/build (2afb480) $ cmake --build . && ctest
[...]
Test project /workspaces/class-8/build
    Start 1: test_vector_dot
1/7 Test #1: test_vector_dot ..................   Passed    0.00 sec
    Start 2: test_vector_add
2/7 Test #2: test_vector_add ..................   Passed    0.00 sec
    Start 3: test_matrix_vector_mul
3/7 Test #3: test_matrix_vector_mul ...........   Passed    0.00 sec
    Start 4: test_matrix_matrix_mul
4/7 Test #4: test_matrix_matrix_mul ...........***Exception: SegFault  0.00 sec
    Start 5: test_vector_dot_cxx
5/7 Test #5: test_vector_dot_cxx ..............   Passed    0.00 sec
    Start 6: test_vector_add_cxx
6/7 Test #6: test_vector_add_cxx ..............   Passed    0.00 sec
    Start 7: test_matrix_vector_mul_cxx
7/7 Test #7: test_matrix_vector_mul_cxx .......   Passed    0.00 sec

86% tests passed, 1 tests failed out of 7

Total Test time (real) =   0.02 sec

The following tests FAILED:
          4 - test_matrix_matrix_mul (SEGFAULT)
Errors while running CTest
Output from these tests are in: /workspaces/class-8/build/Testing/Temporary/LastTest.log
Use "--rerun-failed --output-on-failure" to re-run the failed cases verbosely.

[I’m hoping it’ll crash for you as well. I tried, and it does crash on my Mac, and it does crash on Linux. But I know that in the past, it wasn’t that easy for me to provide a buggy code that would fail for everyone. If it doesn’t crash for you at all, don’t worry too much – it’s still broken :smirk:]

Narrowing down versions where things break#

Ideally, one should run the tests every time before checking in a new version, so this should never happen. In reality, however, it does happen quite a bit that things get broken, and it may take a while to notice (many bugs are more subtle and even running a bunch of tests is by far no guarantee that everthing is bug-free). In general, one good way to debug a newly found problem is to figure out at which version it broke, that is, find the version n such that things work fine in version n-1, but the problem occurs in version n.

Note: Git has a feature called git bisect that’s useful in finding the first bad commit more quickly, that might be useful to look into (in general, not for this class, since we already figured out where it breaks.)

Inspecting what has changed inbetween often helps a lot in narrowing down what caused the bug. However, I made it not so easy here, since the change is that we’re testing a non-square matrix, which previously wasn’t tested at all, so there’s not really an old version which worked.

So here’s the situation: when running test_matrix_matrix_mul, it “segfaults”. Though google is often a successful strategy for how to fix a problem, googling “segmentation fault” is likely of little help here (other than maybe some consolation that one definitely isn’t alone with this kind of problem), since segmentation faults can be caused by all kinds of bugs.

Your turn#

  • Make sure that you can reproduce the problem yourself, and that it starts at one specific commit.

What can we do about the segfault?#

Staring at the code#

It can be quite effective to just read the code and look for a bug, in particular when one knows what to stare at, e.g., the changes between a working and a non-working version. If one has no idea where the problem is, however, there’s probably a lot of code to stare at, and it’s not so likely that’ll lead to success anytime soon.

printf()#

One of the most basic, but still often effective ways is "printf()" debugging, to figure out where exactly the problem happened. The idea is to add a bunch of print statements, and well, the ones that actually print something when you run the code clearly indicate that the code hasn’t yet crashed at that point. On the other hand, if a later print statement does not actually print anything when running the code, that’s a sign that the code didn’t get to it, since it crashes beforehand.

The HERE macro introduced in the last commit makes this job a bit easier, as it saves typing. It’s easy enough to figure out that the problem happens in matrix_matrix_mul(), but we don’t easily get much further using this method.

assert()#

The assert macro (see man assert), which we’ve encountered before, is another useful tool if used in the right places. For example, some bugs could happen if we call matrix matrix multiplication with the matrix dimensions not matching. This is actually missing in the code right now, and something for you to add later. But, it is not the culprit, unfortunately. The code calculates C = AB, where A is a m x k matrix, B is a k x n matrix, and C is a m x n matrix, so everything is fine.

In general, assert() has the following nice properties:

  • It checks the given condition at runtime, and will abort the code with a (somewhat) helpful message if the condition is violated.

  • It also serves as documentation for the reader of the code.

  • Finally, the checks can be easily removed by defining the macro NDEBUG, removing any performance impact (but also giving up the benefit of runtime checking). cmake automatically defines NDEBUG for all builds other than Debug builds.

As an example, in our factorial() function, we should have at least added a comment:

// This function needs to be called with n >= 0
int
factorial(int n)
{
  ...

However, we can replace this comment by actually enforcing the condition, too:

#include <assert.h>

// ...

int
factorial(int n)
{
  assert(n >= 0);
  // ...
}

This makes it clear to someone looking at the function that n has to be non-negative, but also for someone who doesn’t bother looking that calls the function with a negative number, it’ll catch the problem and terminate the program with a meaningful message.

Core dumps (advanced)#

Core dumps save the state of the process to a file as it dies. This can be used to figure out what went wrong after the fact. This is also known as “post-mortem analysis”.

To enable them:

vscode  /workspaces/class-8/build (2afb480) $ c/test_matrix_matrix_mul 
Segmentation fault
vscode  /workspaces/class-8/build (2afb480) $ ulimit -c unlimited
vscode  /workspaces/class-8/build (2afb480) $ c/test_matrix_matrix_mul 
Segmentation fault (core dumped)

Note that it now says “core dumped”. The core can be found as core or core.<pid> in a system specific location, e.g. might be in /cores/ on Mac OS. Use gdb to analyze it:

vscode  /workspaces/class-8/build (2afb480) $ gdb c/test_matrix_matrix_mul core 
GNU gdb (Ubuntu 15.0.50.20240403-0ubuntu1) 15.0.50.20240403-git
[...]
[New LWP 24891]
Core was generated by `c/test_matrix_matrix_mul'.
Program terminated with signal SIGSEGV, Segmentation fault.
#0  0x0000aaaab7100d3c in ?? ()
(gdb) bt
#0  0x0000aaaab7100d3c in ?? ()
#1  0x0000aaaab7100818 in ?? ()
Backtrace stopped: previous frame identical to this frame (corrupt stack?)

Well, that’s not helpful at all. If you get more lucky, and compiling the code in Debug mode definitely helps, you might get something like this:

[kai@macbook linear_algebra]$ gdb test_matrix_matrix_mul /cores/core.144
[...]
(gdb) bt
#0  matrix_matrix_multiply (C=<value temporarily unavailable, due to optimizations>, A=0x1001000a0, B=0x1001000b0) at matrix.c:96
96  MAT(C, i, j) += MAT(A, i, k) * MAT(B, k, j);
#1  0x000000010000179f in WTime [inlined] () at /Users/kai/src/iam851/linear_algebra/linear_algebra.h:68
#2  0x000000010000179f in main (argc=<value temporarily unavailable, due to optimizations>, argv=<value temporarily unavailable, due to optimizations>) at test_matrix_matrix_multiply.c:69

The debugger: gdb (advanced)#

It’s even more convenient to use gdb directly, and it can do a lot more than just analyze the stack trace after the code died. However, in many HPC environments, in particular when using MPI, it’s difficult to interactively run a debugger.

In general if one wants to use a debugger, it’s a good idea to compile the code with -g -O0 -Wall flags, where the -O0 turns off optimization, which otherwise tends to confuse both the debugger and in turn the guy who’s trying to use the debugger.

In cmake, this can be done by running cmake -DCMAKE_BUILD_TYPE=Debug [...]. And, I talked about that before, when one is done debugging and wants best performance, one should actually use cmake -DCMAKE_BUILD_TYPE=Release [...].

Bounds checking#

Time to go full circle. You’ve learned how to use the bounds checking for my vectors and matrices in the beginning of the class. Let’s use it to find where the bug happens. It’ll still require some more thinking to find and fix the bug, but it’s a good start.

Your turn#

  • Find and fix the bug!

More debugging#

Your turn / homework#

  • Follow the “Your turn” steps above. Other than actually fixing the bug, there isn’t a need to commit anything, but you should keep track of what you’re doing / what’s happening and put notes in the Feedback pull request.

  • Add assert() statments to the matrix_matrix_mul() function that makes sure that the matrix dimensions match as required by the underlying math.

  • Implement a matrix_equals() function and use it to complete the matrix-matrix multiplication test.

  • Add bounds checking to the C++ version of vector and matrix, and verify that it works as intended.

  • Add matrix-matrix multiplication to the C++ version of the library, and add a test for it.