- Overview
- JIT System Architecture
- Kernel Loading and Compilation
- Kernel Creation and Macrofication
- Binary Operators and Monoids
- Matrix Format Accessors
- Template Code and Compilation
- Next Steps
- Technical Details Summary
This session provides a deep dive into the JIT (Just-In-Time) compilation system in SuiteSparse GraphBLAS, focusing on how kernels are created, hashed, compiled, verified, and loaded. Dr. Davis walks through the complete macrofication process, showing how template code is specialized for specific operators and types.
Discussion: The session starts with an overview of the core JIT system components: kernel creation, hashing, compilation, loading, and verification.
Key Points:
- The JIT system handles kernel lifecycle from creation to execution
- Verification prevents using stale kernels when operator definitions change
- Example: If you define
add_gaussoperator and later change its definition without changing the name, the JIT detects this mismatch
Key Quote:
"What if you create an operator called
add_gauss, and you did this a week ago, and you change your code and you change the definition but you don't change the name? I think oh, I've gotadd_gausshere. Here it is, and it's the wrong string. Well, that's not good. I check for that." [00:00:23]
Discussion: GraphBLAS provides five different JIT control settings that give users fine-grained control over compilation and execution.
Settings:
- On - Full JIT: compile, load, and run
- Load - Load and run pre-compiled kernels, but don't compile new ones
- Run - Only run already-loaded kernels, no file access
- Pause - Don't use JIT at all, but keep loaded kernels in memory
- Off - Completely disable JIT and deallocate everything
Key Quote:
"If you have a set of operations which are not performance critical but they're doing some crazy type casting, they may compile a bunch of JIT kernels you don't need. Just turn, pause the JIT, or just say, JIT run. Don't load anything. Don't try to compile the thing. Just go with what you got." [00:20:40]
Discussion:
The jitifier_load function is the main entry point for obtaining a JIT kernel function pointer. Despite its name, it does more than just loading.
Process Flow:
- Check if kernel already exists in hash table (fast path)
- If not found, attempt to load from disk
- If not on disk or stale, compile from scratch
- Query the kernel to verify it matches expectations
- Add to hash table and return function pointer
Key Quote:
"This is the goal of this is to give me that function pointer to the jit kernel I want to call. I may have to construct that, may have to compile it. It might be able to quickly look up in the hash table." [00:14:25]
Timestamp: [00:24:00]
Discussion: The JIT system uses carefully managed global state to track compiled kernels, cache locations, and compiler settings.
Global Variables:
- Hash table of compiled kernels
- Cache directory path for compiled kernels and source files
- Compiler name and flags (configurable via
GrB_getandGrB_set) - JIT control state
Concurrency Handling:
- Critical sections protect hash table modifications between user threads
- File locking prevents conflicts between different applications compiling the same kernel (feature removed in GraphBLAS v10.1.1; see NOTE below)
- Run mode allows lock-free kernel lookups for maximum performance
Key Quote:
"I have to have them because I have to keep track of permanent track of all this stuff, the hash table, the settings like where do you keep all your cache? Where is the cache of all the compiled kernels and their source files." [00:12:50]
Timestamp: [00:23:00]
Discussion: To handle the case where two different applications might try to compile the same kernel simultaneously, the JIT uses file-based locking.
Implementation:
- Creates a lock file based on the kernel hash
- Hash collisions are acceptable - different kernels will just wait on each other
- If locking fails, the JIT is disabled (perhaps too pessimistic, as Tim notes)
Key Quote:
"I'm paranoid. What if you have 2 different applications that are both calling GraphBLAS, and they both want to compile the same jit kernel at the same time. That's just going to be a mess. I have a file locking mechanism." [00:40:08]
NOTE: File locking is removed in GraphBLAS 10.1.1. Not all file systems support it. If two applications want to use GraphBLAS at the same time, each can be given their own JIT cache. With internal critical sections, the cache is protected if a single application wants to compile multiple JIT kernels in parallel.
Discussion: Kernels are organized into families (reduce, mxm, apply, etc.), each with specific encoding requirements.
Kernel Families:
- Reduce family: reduction to scalar (7 hex digits in encoding)
- Mxm family: matrix-matrix multiply (16 hex digits, 62 bits)
- Apply family: element-wise operations (12 digits)
- Each family has different numbers of operators, types, and format requirements
Encoding Components:
- Sparsity format (2 bits)
- Zombies flag (1 bit)
- Data type (4 bits for 13 built-in types)
- Operator codes (5-8 bits depending on operator type)
- Monoid properties (identity, terminal values)
Timestamp: [00:35:15]
Discussion: Macrofication is the process of transforming template code into specialized C code by generating preprocessor macros that define types, operators, and operations.
Macrofy Methods:
macrofy_name- generates the kernel file namemacrofy_family- dispatches to family-specific macroficationmacrofy_reduce- specializes reduction kernelsmacrofy_typedefs- generates user-defined type definitionsmacrofy_monoid- generates monoid operator macrosmacrofy_binop- generates binary operator definitionsmacrofy_input- generates matrix accessor macros
Key Quote:
"The macrofying is the process of all these macrofy methods here. It's the process of creating the strings that go along or define a particular jit kernel, or the name of the file, or the name of the kernel, which is the same thing." [00:37:00]
Discussion: JIT kernel files follow a strict naming convention that encodes the kernel type and its specialization.
Naming Pattern:
GB_jit__<kernel_name>__<encoding>__<suffix>
Example:
GB_jit__reduce__03f31e1
Components:
GB_jit- namespace prefix (always)- Double underscores separate regions
reduce- kernel name03f31e1- 7 hex digits encoding (28 bits for reduce family)- Optional suffix for user-defined operators/types
Timestamp: [00:38:00]
Discussion: User-defined types must be defined twice in the generated code: once as actual C code and once as a string for verification.
Dual Definition Pattern:
// Actual typedef for compilation
typedef struct { double x, y; } MyCx;
// String version for verification
#define GB_MyCx_DEFN "typedef struct { double x, y; } MyCx"Purpose:
- The actual typedef is used during compilation
- The string version is embedded in the kernel for query/verification
- This allows checking if the type definition has changed
Key Quote:
"I have the C code, and I re-inject it as a string, and then I can call this query function to say, Cough up your strings, please. And so I can look at them to see if they're different than the ones I'm getting now as a mismatch. I delete this kernel, recompile it." [00:27:00]
Discussion: GraphBLAS has 139 different built-in binary operators, each with a unique encoding.
Operator Categories:
- First 31 codes are permissible for monoids (require only 5 bits)
- Full operator set requires 8 bits
- User-defined operators all encode to code 0, distinguished by suffix
Special Operators:
- Integer division (handles divide-by-zero safely)
- Complex operations (Microsoft compiler workarounds)
- Reverse operators (rdiv, rsub, etc.) for non-commutative operations
- Positional operators (uses row/column indices)
Timestamp: [01:16:00]
Discussion: Each operator can have up to three different string representations for different contexts and platforms.
Three String Types:
- Basic expression - How to compute
z = f(x, y) - Update string (U) - CPU-optimized update like
z += y - GPU string (G) - CUDA-optimized version
Example (min operator):
// CPU: uses if statement
if (x < y) z = x;
// CUDA: uses ternary (avoids branches)
z = (x < y) ? x : y;Key Quote:
"I decided this is faster on CUDA because CUDA doesn't like branches. CUDA doesn't like ifs, it doesn't mind ternaries as much as ifs. You'd think the compilers wouldn't care between those 2 right? But this is just a matter of tweaking the code and optimizing." [01:18:30]
Timestamp: [01:15:00]
Discussion: When matrices are transposed or reordered, non-commutative operators must have their operands flipped.
Implementation:
- Flipping is handled at the macro level, not by changing strings
- If
flip_xyis true, the macro becomesOP(y, x)instead ofOP(x, y) - All built-in non-commutative operators have reverse variants (div/rdiv, sub/rsub, etc.)
Key Quote:
"I may decide to say, Hey, wait, I don't have that. I'm going to do this instead. I may choose to multiply instead of a times b, I might choose to do b transpose times a transpose. What if the multiplicative operator is not commutative? Oh, okay, then I have to eventually reverse those operands." [01:12:00]
Timestamp: [01:12:25]
Discussion: Integer division requires special handling because C's undefined behavior for division by zero would abort the program.
GraphBLAS Integer Division Rules:
0 / 0 = 0(integer NaN equivalent)x / 0 = INT_MAX(positive infinity)x / 0 = INT_MIN(negative infinity)- Never aborts the program
Rationale:
- Matrix operations may encounter many zeros
- Aborting on division by zero is unacceptable in a library
- Follows MATLAB's approach to integer division
Key Quote:
"If you happen to divide by 0 integer division by 0, what do you get? Well, you don't get a Nan. You don't get infinity like in 32 Max and 64. Your program dies. You get an abort. Oh, my gosh, you gotta be kidding! That's the undefined behavior that gcc says it's undefined. So let's abort? I don't do that. I do the MATLAB thing." [01:22:00]
Timestamp: [01:21:30]
Discussion: Monoids require extensive macro definitions for identity values, terminal values, atomics, and SIMD operations.
Generated Macros:
GB_Z_TYPE- the monoid's data typeGB_ADD(z,x,y)- the binary operationGB_UPDATE(z,y)- update operation (z op= y)GB_IDENTITY- identity value constantGB_DECLARE_IDENTITY- declares and initializes identityGB_Z_ATOMIC- whether atomic updates are availableGB_Z_HAS_OMP_ATOMIC- OpenMP atomic supportGB_Z_HAS_CUDA_ATOMIC- CUDA atomic supportGB_Z_NBITS- size in bitsGB_PRAGMA_SIMD_REDUCTION- SIMD reduction pragma
Timestamp: [01:03:00]
Discussion: A shared header file provides default definitions for any macros not defined during macrofication.
Default Handling:
- If atomic support isn't explicitly defined, it defaults to false
- If no identity byte exists, memset optimization is disabled
- Terminal values default to "no terminal"
- This allows family-specific macrofication to omit irrelevant features
Timestamp: [01:37:00]
Discussion: Each matrix gets its own set of accessor macros that abstract away the storage format.
Accessor Categories:
- Position accessors (
GB_A_P(k)- start of column k) - Index accessors (
GB_A_H(p),GB_A_I(p)- get row index) - Value accessors (
GB_A_X(p)- get value at position p) - Format flags (
GB_A_IS_BITMAP,GB_A_IS_HYPER, etc.) - Property flags (
GB_A_HAS_ZOMBIES,GB_A_ISO)
Type Casting:
GB_A_TYPE- actual storage type of matrix AGB_A2_TYPE- type after casting for operator input- Handles both flipped and non-flipped operator cases
Discussion: Template files use the generated macros to create specialized kernel implementations.
Template Features:
- Compile-time selection between panel-based and panel-free algorithms
- Separate handling for zombies, bitmap format, and standard sparse
- Uses macros for all type-dependent and operator-dependent code
- Multiple algorithm variants selected by
#ifdirectives
Example Selection:
#if (GB_A_HAS_ZOMBIES || GB_A_IS_BITMAP || GB_PANEL_SIZE == 1)
// Panel-free reduction
#else
// Panel-based reduction (faster for most cases)
#endifTimestamp: [01:48:30]
Discussion: Panel-based reduction is significantly faster than scalar reduction due to better pipelining.
Panel Size Selection:
- Boolean operators: 8 elements
- Min/max: 16 elements
- Floating-point plus/times: depends on type (8-64 elements)
- Integer operations: type-dependent (8-256 elements)
Why It's Faster:
- Reduces to temporary array instead of single accumulator
- Eliminates data dependencies between operations
- Allows better CPU pipelining
- Final reduction across panel is amortized
Timestamp: [01:45:00]
Discussion: The same template code can generate either runtime JIT kernels or compile-time pre-JIT kernels through conditional compilation.
JIT Kernel (runtime):
#define GB_JIT_RUNTIME
// Function named: GB_jit_kernel
// Compiled at runtime, stored in cachePre-JIT Kernel (compile-time):
// No GB_JIT_RUNTIME defined
// Function named: GB_jit__reduce__03f31e1
// Compiled into GraphBLAS libraryDifferences:
- JIT kernels always named
GB_jit_kernelin their .so files - Pre-JIT kernels have unique names matching their specialization
- Both use the same template code with different macro definitions
Timestamp: [01:51:27]
Discussion: Every kernel includes a query function that returns its defining characteristics for verification.
Query Function Returns:
- Hash code for sanity checking
- GraphBLAS version number (e.g., 9.3.0)
- Type definition strings
- Operator definition strings
- Identity and terminal values
Verification Process:
- Load kernel from disk or hash table
- Call query function
- Compare returned values with expected values
- If mismatch, delete kernel and recompile
- If match, use the kernel
Timestamp: [01:55:00]
The session ends with plans for the next session to cover:
- How to compile a kernel (invoking the C compiler)
- Loading compiled kernels with dlopen
- Hash table management
- Kernel verification and validation
Key Quote:
"Next session: How do I compile a kernel? Session 6. Compiling a kernel and hashing it." [01:56:00]
- JIT source code:
Source/JITPackage/GB_jitifyer.c(2,700 lines) - Template kernels:
Source/JITPackage/GB_jit_kernel_*.c - Macro definitions: Various
GB_*_shared_definitions.hfiles - Pre-JIT kernels: Built into GraphBLAS library
- Reduce family: 28 bits (7 hex digits)
- MxM family: 62 bits (16 hex digits)
- Apply family: 48 bits (12 hex digits)
- Monoid operators: 5 bits (31 possible)
- All operators: 8 bits (139 built-in)
- Data types: 4 bits (13 built-in + user-defined)
- Hash table lookups are very fast in JIT_RUN mode (no locks)
- Compilation is slow but happens once per kernel specialization
- Panel sizes are platform-dependent (tuned for Intel)
- Atomic operations enable parallel reductions where supported
- SIMD pragmas provide additional vectorization hints
