Session 13 Summary

Overview of SuiteSparse GraphBLAS Structure
- Progress Review
- Source Code Organization
Element-wise Add vs Element-wise Union
- Set Union Semantics
- Use Cases and Advantages
Implementation Details
Zombie Entries
- Concept
- The Zombie Function
Debugging Infrastructure
- Debug Levels
- Assertion Macros
Code Generation and Factory Kernels
Template System
- Macro-based Templates
- Template Organization
Sparse Add Algorithm Structure
- Multi-phase Approach
- Work Division Strategy
Control Flow Through Element-wise Add
- Entry Points
- Accumulator-Mask Phase
Language and API Design Choices
- Why C Instead of C++
Technical Details and Optimizations
Recursive Nature of GraphBLAS
- Interconnected Operations
Naming Conventions and Code Navigation
- Strict Function Call Format
- Method Naming
Next Session Preview

Overview

Session 13 starts a new topic: element-wise "add". It computes C=A+B where the pattern of C is the set union of A and B, but where the operator can be any binary operator.

Overview of SuiteSparse GraphBLAS Structure

[00:00:00]

Progress Review

[00:00:25]

Tim began by reviewing what has been covered in previous sessions:

Apply operations
Basic matrix data structures
Transpose
Builder
Built-in objects
JIT (covered in an early session)
Reduce

The session focused on element-wise add operations, which Tim proposed as the next topic to explore.

Key Quote: "I thought we would look over what we've done so far. We have done apply, we've done the basic matrix data structures, we've done transpose."

Source Code Organization

[00:05:00]

Tim provided an overview of the GraphBLAS folder structure, highlighting major components:

Matrix multiply (mxm): 15,000 lines of code
Assign: A "beast" with 40 different kernels
Element-wise add: 4,000 lines of code
Select, serialize/deserialize, task parallelism
Compression wrappers for LZ4 and Zstandard

Key Quote: "Matrix multiply, that's 15,000 lines of code right there. Assign is a beast. There's 40 different kernels in assign."

Element-wise Add vs Element-wise Union

[00:09:00]

Set Union Semantics

[00:09:15]

Tim explained that element-wise add performs a set union of matrix patterns:

Element-wise Add:

If entry exists in both A and B: C(i,j) = op(A(i,j), B(i,j))
If entry exists only in A: C(i,j) = A(i,j)
If entry exists only in B: C(i,j) = B(i,j)

Element-wise Union:

If entry exists in both A and B: C(i,j) = op(A(i,j), B(i,j))
If entry exists only in A: C(i,j) = op(alpha, A(i,j))
If entry exists only in B: C(i,j) = op(B(i,j), beta)

Key Quote: "eWise_add should have been called simply Union in the spec. You're doing a set union of the pattern of both matrices. And then what you have is an operator that says what to do when you have 2 entries that overlap at the same location."

Use Cases and Advantages

[00:12:20]

Tim provided an example of why element-wise union is necessary:

For computing C = A - B (matrix subtraction):

With element-wise add: entries in B but not A would just be copied as B, which is incorrect
With element-wise union: using alpha=0 and beta=0, you get 0 - B = -B, which is correct

Key Quote: "If you want to compute the linear algebraic operation of A minus B, subtract 2 matrices, use union, pass in Alpha and Beta as 0 and use the minus operator. You're done, simple."

Another critical use case is user-defined types with comparison operators where you need placeholder values (like infinity) for missing entries.

Implementation Details

Format Agnostic Algorithm Design

[00:26:00]

Tim explained his strategy for handling different matrix storage formats (by-row vs by-column):

The "Agnostic Trick":

Align all matrices to the same format by manipulating transpose descriptors
If both inputs need transpose, toggle the output format instead
From that point forward, the code doesn't distinguish between row and column orientation

Key Quote: "I don't want to have lots of code to do both store by row and store by column transpose or not. I need to conform, I need to map all these into one or few algorithms."

Transpose Optimization

[00:37:00 - 00:47:00]

When explicit transposes are necessary, Tim performs several optimizations:

Transpose + Typecast: If transposing anyway, apply typecasting simultaneously
Pattern-only transpose: If operator doesn't use values, transpose only the sparsity pattern
Mask transpose: Transpose and typecast mask to boolean in one operation

Performance Consideration: User code should manage transposes when possible to avoid repeated work:

Key Quote: "If you see a transpose in the output, know there are costs to that. If you want me to transpose it, I'll transpose it, but I'll throw it away. If you want to keep the transpose, transpose the mask first please, and then call me and then reuse it later yourself."

Tim noted that LAGraph caches transposes in graph objects, and MKL Sparse takes a similar approach, but GraphBLAS itself doesn't cache transposes within matrix objects.

Index Binary Operators and the "Flip IJ"

[00:31:00]

For positional operators that depend on row/column indices:

When format changes require index reinterpretation, Tim uses "flip IJ"
For built-in operators, he can swap operator variants (e.g., first_i becomes first_j)
For user-defined operators, indices must be passed in reversed order

Key Quote: "If the operator's positional, I have to revise the operator essentially. I can do this in 2 ways. I can just wherever I call the operator, later on I'll just reverse the roles of I and J. Or it's a lot simpler - I have a first eye operator and you reverse it, let me just rename it first J."

Zombie Entries

[00:33:00]

Concept

[00:34:00]

Tim introduced the concept of "zombies" - entries marked for future deletion that remain in the data structure:

Live entry: row index >= 0
Zombie entry: tagged with negative index via GB_ZOMBIE() function
Special value: -1 represents "nothing there"

The Zombie Function

[00:34:50]

// Self-inverting function around -1
GB_ZOMBIE(0) = -2
GB_ZOMBIE(-2) = 0
GB_ZOMBIE(-1) = -1

Key Quote: "A zombie is an entry marked for future deletion. It's still in the land of the living, but it's dead. The way I mark it is I have a function - it's a self-inverting function."

Tim mentioned zombies appear in many algorithms but didn't go into full detail since set/get element operations haven't been covered yet.

NOTE: in later versions of GraphBLAS, the zombie function has changed. In v9, it was GB_ZOMBIE(i) = (-(i)-2). In v10, it is now GB_ZOMBIE(i) = ~(i), the ones complement of i. The new function is still its own inverse.

Debugging Infrastructure

[00:51:00]

Debug Levels

[00:51:40]

Tim explained his custom debugging system:

Enabling debug mode:

Global: Edit GB_dev.h to enable assertions across all GraphBLAS
Per-file: Define GB_DEBUG at the top of a specific file

Debug print levels:

GB_0: No output (production code)
GB_1: Print name, dimensions, number of entries only
GB_2: Print first 30 entries
GB_3: Print all entries tersely
GB_5: Print all entries in full detail

Key Quote: "They're sanity checks. They may, if I'm debugging, and if I have a bug here in this code, I might want to say, well, okay, print me out the matrix. Set the debug level to 2 for this print. That means print the first 30 entries."

The debug system uses GB_0, GB_1, etc. instead of plain numbers to make searching and toggling debug statements easier.

Assertion Macros

[00:51:00]

Assertions are NOT controlled by -DNDEBUG
Must be explicitly enabled to prevent accidental performance degradation
Include checks like assert_matrix_ok() that validate entire matrix structures

Code Generation and Factory Kernels

[00:56:00]

Factory vs JIT

[00:57:00]

Factory Kernels:

Pre-compiled at build time
Located in factory/ subdirectories
Fast but numerous (causing long compile times)
Cannot handle typecasting (combinatorial explosion)

JIT Kernels:

Compiled at runtime
Use include/ and template/ subdirectories
Can specialize on matrix sparsity formats
More flexible but require runtime compilation

Key Quote: "The factory kernels cannot typecast. There's too many different variants. So my pre-compiled kernels, called factory kernels, which are tons - this is why GraphBLAS takes so long to compile, and I'm trying to reduce the number of factory kernels I use."

Binary Operator Factory

[00:57:40]

The binop_factory is a massive switch case (1,000 lines) that dispatches to pre-compiled kernels for:

Min/max operators
Arithmetic operators (plus, times, minus, divide)
Boolean operators
Comparison operators
Floating-point operations (power, hypot, fmod, remainder)
Complex number construction
Bitwise operations

Disabling Factory Kernels

[01:09:00]

Users can control which factory kernels are compiled:

Edit configuration files to disable specific types (e.g., int16)
Reduces binary size (used by FalconDB)
Disabled kernels return "no value" and fallback to JIT

Key Quote: "Every factory kernel has a disable flag. So this entire file can be disabled. I don't disable these, they're basic, very simple functions. But I disable the bigger things."

Template System

[01:03:00]

Macro-based Templates

[01:03:05]

Tim's templates are code fragments, not complete functions:

Function headers are separate to allow stitching multiple templates together
Heavy use of macros for flexibility
Enables both factory and JIT to share the same template code

Key Quote: "This is not going to be a full function. It's going to be a code fragment that's going to sit inside of a function, because the function header can differ depending on how I want to use the template."

Template Organization

[00:57:35]

Three subdirectories in each operation folder:

factory/: Pre-compiled factory kernel wrappers (.c files)
include/: Header files for both factory and JIT (.h files)
template/: Core algorithm templates for both factory and JIT (.c files)

JIT limitation: Only uses include/ and template/, never factory/

Sparse Add Algorithm Structure

[01:18:00]

Multi-phase Approach

[01:19:00]

The sparse add algorithm is divided into phases:

Phase 0: Determine output sparsity structure

Find which vectors contain entries (for hypersparse)
Perform set union of vector patterns

Phase 1: Split work and count entries

Divide work across threads
Count the size of the output

Phase 2: Numerical computation

Populate output with actual computed values
Only this phase uses JIT or factory kernels

Key Quote: "Most of my sparse algorithms are multiple phases - the symbolic phase to count and figure out what I'm doing and how big, and then I divvy up the work and compute tasks, and I do the numerical work."

Work Division Strategy

[00:06:40 - 00:08:10]

Tim explained his approach to parallel task distribution:

Problem: Simply dividing by columns can create load imbalance if one column is very dense

Solution: Allow threads to split individual vectors

Thread 1 might take columns 1-4 plus half of column 5
Thread 2 takes the rest of column 5 plus columns 6-N
Works for both row and column storage
Critical for vectors and matrices with dense rows/columns

Key Quote: "What if you just have a vector? You've got to chop the vector into pieces. What if you have a matrix with a dense vector column in it or row stored by row? Well, you might have to put one thread on this whole chunk of this matrix, and you might have to put a bunch of threads on that one dense vector all by itself."

Control Flow Through Element-wise Add

[01:14:00]

Entry Points

[01:14:20]

The call hierarchy for element-wise add:

User API: GrB_eWiseAdd_* or GrB_eWiseUnion_* methods
Top-level dispatcher: Checks inputs, extracts descriptors
GB_ewise(): Handles both add and multiply, manages transposes and format alignment
Special case handlers: GB_ewise_fulla() for all-dense matrices
GB_add(): Main sparse add implementation with phase 0, 1, 2

Accumulator-Mask Phase

[01:16:00]

After computing the temporary result matrix T, the final step determines if additional work is needed:

Direct transplant (no additional work) when:

No accumulator present
No transpose of output needed
Either no mask OR mask was already applied
C is empty or C_replace is true

Otherwise: Call the mask/accumulator method which may recursively call add again

Key Quote: "Just about all of my GraphBLAS methods finish either with a transplant or a transplant conform."

Language and API Design Choices

[01:04:00]

Why C Instead of C++

[01:04:08]

Tim explained why GraphBLAS uses C with macros rather than C++:

Reasons for C:

Easier to bind to other languages (Python, PostgreSQL, etc.)
C++ templates only work at compile time
GraphBLAS needs runtime templating (via JIT)
GraphBLAS matrices can change type at runtime (spec requirement)
Must support user-defined types that can be freed and reallocated

Key Quote: "C++ can only do compile time templating. I can do runtime templating. The GraphBLAS matrix can change its type. You can free it and reallocate it with a new type - that's in the spec. So I have to be able to handle that."

Technical Details and Optimizations

Operator Value Independence

[00:41:30]

Some operators don't actually need the values from input matrices:

Index-only operators (depend only on i,j positions)
Positional operators like second (only use one input)

Optimization: When operator doesn't need values from a matrix:

Only transpose the pattern, not the values
Significant performance improvement
Only works with element-wise union (not add, because add passes raw values through)

Key Quote: "If the operator doesn't care what the value of its first input is, then I can just get the pattern of A. I don't care, I'll never read the values of A. Great, that's good to know."

Special Case: Full Matrix Add with Accumulator

[00:47:00]

The GB_ewise_fulla kernel handles a very specific case:

All three matrices (C, A, B) are completely dense
No mask or mask is not complemented
No transpose of output
No typecasting
No positional operators
Matrices are not iso-valued
Accumulator is the same operator as the element-wise operator

This allows for a highly optimized simple loop without any pattern checking.

ISO-valued Matrices

[00:55:00]

ISO-valued optimization:

Matrix where all entries have the same value
Store value once instead of for every entry
For operations producing ISO output: compute operator once, then only compute pattern
One kernel for all operators (no factory kernel needed)
Mentioned in context of matrix data structure session

Recursive Nature of GraphBLAS

[01:20:40]

Interconnected Operations

[01:20:50]

Tim and Michel discussed the highly interconnected nature of the codebase:

Element-wise add can call the mask phase
The mask phase can call element-wise add again (when accumulator is present)
Matrix operations recursively use other matrix operations
Creates a complex call graph that doesn't fit traditional tree structures

Key Quote (Tim): "It's all because everything talks to everything. I tried to use doxygen for a while for my code. I had this call return structure of my code, and it was a rat's nest."

Key Quote (Michel): "It's algebra, so they're all interconnected. You go up, down, left, right, you're gonna run into mirrors."

Naming Conventions and Code Navigation

[00:50:00]

Strict Function Call Format

[00:50:10]

Tim maintains a strict convention for function calls:

Function name followed by single space and opening parenthesis
No variation from this rule throughout the codebase
Enables easy searching: function_name ( finds all call sites

Key Quote: "I'm very, very consistent about how I call a function. Always I have the name of the function, I have a single space followed by a single parenthesis, and I never ever deviate from that rule."

Method Naming

[00:49:20]

GB_ewise_fulla: Full (dense) add with accumulator
Terse naming convention
'a' suffix indicates accumulator is present
Format indicators in name (full, sparse, bitmap, hypersparse)

Next Session Preview