- Overview
- JIT Control States and Error Handling
- Kernel Loading and Validation
- Critical Sections and Locking
- Macro System and Code Generation
- Sparsity Format Handling
- Typecasting Implementation
- Preface and Include System
- Compilation Process
- JIT Kernel Structure
- Algorithm Example: Apply Bind-First
- Source Code Organization
- User-Defined Types and Masks
- Operator Dependency Analysis
- Hash Table Implementation
- Encoding System
- Hash Code and Kernel Names
- Lessons and Design Philosophy
- Performance Considerations
- Future Work and Open Questions
- Key Insights
This is the third session on the JIT (Just-In-Time) compiler in SuiteSparse GraphBLAS. The session provides a deep dive into the JIT compilation infrastructure, focusing on how GraphBLAS generates, compiles, and manages JIT kernels at runtime.

- On - Can load and compile
- Load - Can load from disk but cannot compile
- Run - Can use kernels already in memory (dlopen) but cannot load from disk
- Pause - Keep kernels in memory but don't call them
- Off - Flush everything, dlclose all kernels
Key Quote: "If I fail to compile, I ratchet down to we can only load. If I fail to load, I ratchet down to run." [00:03:46]

Issue: Need better user control over JIT failure behavior. Currently returns GRB_NO_VALUE internally, but may need new error codes like GRB_JIT_FAIL for development scenarios where users want to know compilation failed. [00:04:16]
Panic vs Failures: GRB_PANIC is too extreme - it means "shut off GraphBLAS entirely, something catastrophic happened." JIT failures need a less severe error code. [00:06:48]
UPDATE: The GxB_JIT_ERROR error code has been added to GraphBLAS 10.0.0.
- Construct .so filename from hash code and kernel name
- Attempt
dlopen()on the library - If successful, query the kernel to verify it's valid
- If stale or invalid, remove the file and recompile
Cache Organization: [00:24:05]
- Files organized into 256 buckets (8-bit number) to avoid huge single directories
- For test coverage: 22,000 JIT kernels generated
- Path format:
~/.SuiteSparse/GrB9.3.0/lib/<bucket>/<lib><kernel_name>.so

jit_query- Validates the kernel is correct versionjit_kernel- The actual kernel function
If query fails or kernel is stale, the file is removed and recompiled. [00:26:51]
- Critical sections protect against multiple threads within same process
- File locking protects against multiple processes accessing same kernel (see NOTE in Session 5)
Locking Strategy: [00:16:50]
- Only one thread per process can JIT at a time (critical section)
- Different processes can JIT independently
- File locks prevent two processes from compiling same kernel simultaneously
Current Behavior: If file lock fails, JIT is turned off rather than waiting. This is a "fix me" area that could be improved with configurable wait/no-wait options similar to PostgreSQL. [00:17:00]
UPDATE: File locking has been removed; see Session 5.

- Apply
- Assign
- Build
- E-wise (element-wise)
- Matrix multiply (MxM)
- Reduce to scalar
- Select
- User-defined operators/types

- C, A, B matrix types (type1, type2, type3)
- X, Y, Z semiring types (multiply inputs and output)
For MxM: Can have 6 different types due to typecasting between matrix types and semiring types.
Type Macrofication: [00:55:00]
Only unique types are output to avoid duplication. Uses string comparison to detect when types are identical and only generates typedef once with guards (#ifndef).
Binary Operators: Two paths:
- Built-in operators - Simple inline definitions (e.g.,
z = x ^ ~yfor bitwise XNOR) - User-defined operators - Function calls with full definitions guarded by
#ifndef
Flipping operators: Handled in macros by swapping X,Y parameters rather than creating new operators. [01:11:55]

- Sparse: Look up indices in arrays
- Hyper-sparse: Additional indirection through hyperlist
- Bitmap: Check bitmap for presence
- Full: Computed positions (multiplication and modulo)
Example: Getting row index
- Sparse:
I = Ai[p](array lookup) - Full:
I = p % nrows(modulo computation)
Bitmap Optimization: [02:06:02]
Single line if (!GBb_B(p)) continue; handles bitmap case. For non-bitmap matrices, this macro evaluates to if (!1) continue; which compiler optimizes away entirely.
Problem 1 - Complex to Bool: [01:48:28] Windows doesn't properly handle casting from complex types to bool. GraphBLAS manually checks real and imaginary parts against zero.
Problem 2 - Float to Integer Overflow: [01:49:11] ANSI C leaves casting from out-of-range float to integer as undefined behavior. Different compilers/platforms return different garbage values.
Solution: Custom cast functions that return sensible values:
- NaN → 0 (zero being "not a number" historically)
- Negative value to unsigned → 0
- Overflow → INT_MAX or equivalent
Quote: "MATLAB doesn't do that. They define it. So if you cast a floating point infinity to int32, you get int32 max. That's a pretty decent answer. If you do that in ANSI C, you get 42 or something ridiculous." [01:50:13]

GB_jit_cast_double_to_uint16_t handle all edge cases with proper range checking before calling standard C cast.

#include directives and comments for JIT kernels via GrB_set():
- C preface for CPU/OpenMP kernels
- CUDA preface for GPU kernels (separate)
Example:
// Kernel by gauss_demo.c
#include <math.h>This allows user-defined operators to include necessary headers without GraphBLAS knowing about them in advance.
-
Direct compiler call (Linux/Mac default)
- Single
system()call with full compile+link command - Faster than CMake
- Uses shell-based command construction
- Single
-
CMake (Windows default, optional elsewhere)
- Creates temporary CMakeLists.txt
- Configures, builds, installs
- Removes build directory when done
- More portable but slower
-
NVCC (CUDA kernels)
- Direct shell out to nvcc
- Single command for compile and link
- Default: Uses same compiler that built GraphBLAS (from
GB_config.h) - Customizable via
GrB_set()to specify different compiler - C flags can be augmented (e.g., adding
-Ifor custom include paths)
Platform Differences: [02:19:00]
- Linux/Mac: Can use direct compiler calls with standard flags
- Windows: Must use CMake (with MSVC) or CMake with MinGW
- Windows doesn't support direct shell-out compilation method
Flag: GB_JIT_RUNTIME
- Defined: Kernel compiled at runtime by JITifyer
- Not defined: Pre-compiled kernel baked into
libgraphblas.so
Pre-JIT kernels: [01:59:30]
- Copied into
GraphBLAS/PreJITfolder - Compiled by CMake during GraphBLAS build
- Each has unique function name (not just
GB_jit_kernel) - Stored in tables: kernel names, query pointers, kernel pointers
- Loaded into hash table at
GrB_init()time

// No function signature here
{
// Algorithm starts directly
for (int64_t p = 0; p < bnz; p++)
{
GB_DECLARE_B(bij);
GB_GET_B(bij, Bx, p);
GB_EWISEOP(Cx, p, bij, i, j);
}
}The function signature is defined in separate wrapper file that #includes the template.
for (int64_t p = 0; p < bnz; p++)
{
if (!GBb_B(p)) continue; // Bitmap handling
GB_DECLARE_B(bij); // Type declaration
GB_GET_B(bij, Bx, p); // Load value (with typecast)
GB_EWISEOP(Cx, p, bij, i, j); // Apply operator
}All complexity hidden in macros:
GBb_B(p)- Returns true for non-bitmap, checks bitmap for bitmap formatGB_DECLARE_B(bij)- Declares variable of correct type (possibly after typecast)GB_GET_B(bij, Bx, p)- Loads value with typecasting if neededGB_EWISEOP- Applies the binary operator (expands to user function or built-in op)
Quote: "The algorithm is trivially simple. But everything in this algorithm goes through these macros." [02:05:15]

libgraphblas.so as compressed binary using zstandard compression. On first use, JITifyer decompresses and writes to cache folder.
Cache Contents (~27,000 lines):
- 244 files total
GraphBLAS.hheader- All include files
- All templates (CPU and CUDA mixed together)
- Template directory: ~14,000 lines of algorithms and kernels
Size Perspective: [02:31:00]
- GraphBLAS total: ~120,000 lines
- JIT kernels: ~1/8 of total codebase
- Much work in GraphBLAS doesn't need to be JITted
CPU Side (well organized):
Source/jitifyer/- Main JIT infrastructureSource/jit_kernels/template/GB_jit_kernel_*.c- 44 kernel wrappers (now 88 as of GraphBLAS 10.3)
CUDA Side (needs reorganization): [02:28:00]
- 8 JIT wrappers in CUDA folder
- 8 kernel templates mixed in template folder
- Needs reorganization to match CPU structure
- UPDATE: as of GraphBLAS 10.3, the CUDA folder has been reorganized
Current limitation: Cannot use valued mask with user-defined types. User-defined types cannot be typecast to bool (no typecast functions defined).
Structural masks work fine - only pattern matters, not values.
Potential enhancement: [01:32:40] Allow users to register a "cast to bool" function for user-defined types. This would enable valued masks with user types.
Quote: "It's a hole in the spec. It's not a bug. It's just by design. It's not a super nasty limitation." [01:33:03]
Ray's request: Add support for typecast from user-defined type to bool. [01:32:53]
Slippery slope concern: If we add bool typecast, why not all typecasts?
- User type → 12 built-in types = 12 functions
- User type → other user types = unlimited functions
- Management nightmare: "My brain explodes" [01:35:20]
Decision: Only add bool typecast if there's a clear use case. Don't venture into use cases we don't have. [01:35:56]
Problem: Index unary operators receive 4 parameters (value, row, column, theta) but may only use subset. GraphBLAS still does work to compute all 4.
Potential solution: Let users declare which parameters operator actually needs:
GrB_set(op, "I don't use the X value")
GrB_set(op, "I don't use row index")
Benefit: Skip unnecessary work (finding row/column indices, loading values).
Challenge: How to validate this? Can't trust user to be correct (would cause null dereferences). Compiler dependency analysis is complex.
Quote: "It'd be nice to know. I don't have a mechanism for this yet." [00:52:28]

Special hash values:
0- Built-in operator, no need to hashUINT64_MAX- Not jittable (e.g., missing user-defined strings)
Hash collision handling: [02:25:26] If xxHash returns 0 or UINT64_MAX by chance (very rare), return magic number instead to preserve special meanings.
Magic number: Arbitrary 64-bit value used to detect if matrix has been initialized/freed. [02:26:10]

Example - Element-wise (48 bits):
- Format of C, M, A, B: 8 bits (2 bits each for 4 matrices)
- Binary operator: 20 bits (5 hex digits)
- Operator code: 8 bits
- X, Y, Z types: 4 bits each
- Additional flags and parameters
Hexadecimal alignment: [00:41:07] Carefully designed so bit boundaries align with hex digit boundaries for easier debugging and parsing.
- Built-in types: 1-12 (int, float, double, etc.)
- User-defined: 0xE (14) = "user-defined, don't know which"
- 0 = void (type not used by operator)
Example: 0xEEE means all three types (X, Y, Z) are user-defined. Which specific user type? That's in the operator's string definition. [01:05:23]
Hash determines everything: If operator uses user-defined types, cannot typecast them, so operator name alone determines X, Y, Z types.
Kernel naming: [00:24:40]
lib + GB_jit_<operation>_<encoding>.so
Example: libGB_jit_ewise_bind1st_4000eee0efec1.so
- Operation: ewise_bind1st
- Encoding: 4000eee0efec1 (hex representation of bit encoding)
Pre-JIT naming: [01:59:40]
When compiled into GraphBLAS library, kernel functions must have full unique names (not just GB_jit_kernel) to avoid symbol conflicts.
- Factory kernels (built-in specializations)
- Generic kernels (fallback)
- JIT kernels (runtime generated)
All differ only in macro definitions and compilation flags.

GB_EWISEOP→GB_BINOP→GB_ADD→add_gauss(x,y)- Each layer handles different concerns (operation, typecasting, flipping)
- Allows one template to handle all variations
- Can't use standard shell commands
- Forward slash vs backslash issues
- CMake required (slower but more portable)
- MinGW build system has special quirks

Example: Computing X^N via repeated squaring calls operator only log2(N) times. [01:22:00]

- Direct compiler call: Fast (one
system()call) - CMake: Slower (5
system()calls minimum) - Why 22,000 CMake projects in test coverage? Each JIT kernel gets own CMakeLists.txt in build folder (deleted after)
- Hash table insertion/lookup details [02:23:28]
- Pre-JIT initialization process [02:22:01]
- Source code extraction from zstandard compressed package [02:22:20]
- CUDA atomic operations enumification [02:26:54]
- Complete sanitization of user-provided paths [02:21:26]
- File locking should retry with timeout instead of immediately disabling JIT [00:17:14]
- Need error code for JIT failures (not panic, not no-value) [00:09:00]
- CUDA folder needs reorganization to match CPU structure [02:28:41]
- Windows could potentially use direct compiler calls with proper flag translation [02:20:00]
- User-defined type to bool typecasting [01:32:40]
- Operator dependency declarations (which parameters actually used) [00:52:28]
- Better control over JIT failure behavior [00:05:58]
- Defensive JIT: System automatically downgrades capabilities when compilation fails to avoid repeated failures
- Macro complexity hides algorithm simplicity: Most kernels are simple loops, but macro preprocessing makes them work for all type/format combinations
- Pre-hashing optimization: Avoid rehashing by caching hash values for all components
- Template versatility: Same code serves factory, generic, and JIT kernels through different macro definitions
- Platform portability is hard: Significant differences between Linux/Mac/Windows require different compilation strategies
- Type safety through macros: Void type used to make accessing unused values a compile error rather than runtime error
Session Conclusion: This session covered the complete JIT infrastructure from hash computation through code generation, compilation, and loading. The next session will focus more on the actual algorithms themselves. [02:09:00]