- Overview
- Transpose and Apply Integration
- Operator Structure Details
- Transpose Implementation Architecture
- Transpose Algorithms
- Bucket Transpose Details
- Iso-Valued Optimization
- Template and JIT System
- Shallow vs Deep Copies
- Accumulator Complexity
- Format Agnostic Design
- Testing and Verification
- Integration with Other Operations
- Performance Considerations
- Key Takeaways
This session focuses on the transpose operation in SuiteSparse GraphBLAS, including its fusion with the apply operation, multiple algorithm implementations, and extensive optimization strategies. This Session is a continuation of the discussion of the transpose method in Session 10.
Next, starting with the topic of Shallow vs Deep Copies the discussion moves away from the transpose methods to discuss more details of the matrix data structure, the GraphBLAS accumulator step, testing/validation of GraphBLAS,
Finally, the topic of discussion moves back to the transpose, with the Integration with Other Operations topic, and following.

- Complex conjugate transpose can be done as a single fused operation
- Typecasting during transpose avoids extra memory passes
- The descriptor can request transpose on any operation
- Many methods may need to implicitly transpose inputs when formats don't match
Key quote: "I wanted to be able to do a complex conjugate transpose fast. That's just an apply with a conjugate operator and a transpose descriptor, one line of code, one line call to GraphBLAS. I should be able to do that as a single fuse operation." [00:20:16]

- Four function pointers for different operator types
- OP code indicating which operator variant
- Z, X, Y, and theta types for inputs/outputs
- JIT name and hash for compilation
- Size tracking for all pointers (programming in a Rust-like manner)
Key insight: The unusual struct definition using typedef is necessary to work around C's generic keyword limitations. The compiler would treat normal typedefs as identical types, breaking the generic switch mechanism. [00:07:17]
- Username: Settable via
GrB_NAME(spec-defined feature) - JIT name: Required for code generation, actual function names
Note: Tim requested something different from the spec committee - he needed function names for the JIT compiler, but got the user-settable name feature instead. [00:10:10]

- Unary operators: Start at 1 (identity is 2)
- Index unary: 52-71
- Binary operators: Start at 72
- Index binary: Placed in middle (only user-defined, no built-ins)
Important: These codes can change between versions - they're internal only. [00:14:07]

GrB_transpose: User-facing methodGB_transpose: Internal worker- From
applywhen transpose descriptor is set - From internal methods needing format conversion

Case 1: Double transpose
If the descriptor requests transpose on GrB_transpose, it's a double transpose (identity).
Case 2: Format mismatch If input is CSC and output is CSR (or vice versa), requested transpose becomes a simple copy.

- Sequential bucket sort
- Atomic parallel bucket sort
- Non-atomic parallel bucket sort
- Builder (sort-based)
- Bitmap transpose
- Full matrix transpose
- Column-to-row vector (hypersparse result)
- Row-to-column vector
Bucket vs Builder Decision:
bucket_work = (nnz + nrows + ncols) * fudge_factor
builder_work = nnz * log(nnz)
The fudge factor grows with problem size - builder becomes more attractive as matrices get larger. [01:12:13]
Key factors:
- Hypersparse matrices always use builder (can't allocate 2^60 buckets)
- Bucket sort is O(n + e) linear time but doesn't scale well in parallel
- Builder is O(e log e) but more parallel-friendly
- "These are highly tuned. For a different machine, this tuning might be off." [01:12:07]
Column to Row Vector - The Clever Trick: When transposing a sparse column vector to a row vector:
- Input: CSC column vector with indices I and values X
- Output: Hypersparse row vector
- The magic: Input row indices become output hyperlist (H array)
- Shallow copy, O(1) time!
- Only need to construct P array as 0,1,2,3,... to number of entries
Key quote: "The input row indices become the output hyperlist. Boom. No work. It's just a parlor trick. But it's really important. This is a super important thing." [01:04:26]
1. Sequential [01:31:51]
- Single workspace of size n
- Simple iteration over vectors
- Linear time O(n + e)
- Template code is simplest
2. Atomic Parallel [01:32:07]
- One shared workspace
- Multiple threads with atomic increments
- Each thread processes different columns
- Atomic capture for bucket positions
Ugly workaround: Microsoft Visual Studio only supports OpenMP 2.0, requiring ugly macros instead of clean #pragma omp atomic capture directives. [01:33:28]
3. Non-Atomic Parallel [01:34:14]
- One workspace per thread (expensive in memory)
- No atomic contention
- Two-pass algorithm: count, then cumsum, then reverse cumsum to position
- Doesn't scale beyond ~8-16 threads due to memory

- Always returns 1 (one operator)
- Returns specific scalar S
- Identity with typecast
- Built-in operator applied to iso input
- Binary operator bind first with scalar
- Binary operator bind second with scalar
Benefit: Only the sparsity pattern needs to be computed in parallel. The single value is computed once. Massive speedup for structural operations.

GB_transpose_sparse: Handles sparse/hypersparse (3 variants: sequential, atomic, non-atomic)GB_transpose_bitmap: Bitmap matricesGB_transpose_full: Full/dense matrices
Problem with bitmap/full: "These 2 methods I'm not happy with. But they work." Uses div/mod for position calculation instead of proper tiling. [01:30:25]

- Operator is compiled into kernel
- No runtime function pointer overhead
- Macros like
GB_DECLAREA,GB_GETA,GB_EWISEOPhide type complexity - Same template handles: unary operators, typecasts, bind-first, bind-second
Key quote: "It's templatized in compile time to death. And then finally, at the very end of the day, the final kernel is just one of those little simple 3 loops." [01:47:47]

b_shallow: Bitmap arrayx_shallow: Values arrayi_shallow: Row indicesh_shallow: Hyperlistp_shallow: Column pointers
- Shallow components point to data owned by another matrix
- Internal temporary matrices can have shallow components
- Matrices returned to user applications never have shallow components
- When freeing a matrix, shallow components are not freed
Use case: Apply without operator can share sparsity pattern between input and output, only allocating new space for values. [00:40:05]

- If entry exists in A but not C: typecasts from A to C (bypasses operator!)
- If exists in both: must typecast A to operator input type, then operator runs
- Different entries can follow different typecast paths
Key quote: "Typecasting with the accumulator has very strange behavior in GraphBLAS. It's what the spec told me to do. So that's what I did." [00:29:41]
- Too much code depends on current accumulator behavior
- Extensively fused into every operation for performance
- Assign operation has 40 different kernels all designed for accumulator behavior
- Changing it would be a nightmare
Alternative: Union operation avoids this - every entry goes through the operator with an identity value when only one input is present. [00:30:38]

- Treat everything as a pile of vectors
- Don't care if stored by row or column
- Descriptor transpose + CSR/CSC format = 4 cases collapsed to 2
- "I pretend they're by column internally"
Implementation detail: The transpose descriptor is XORed with format difference. If formats differ between input/output, the effective transpose operation flips. [00:25:13]

- Complex parallel version
- Simple sequential version (easy to understand)
In debug mode, runs both and asserts they match:
// Parallel work here
#ifndef NDEBUG
// Sequential version - must match
assert(parallel_result == sequential_result);
#endifNote: "Debugging is terribly slow. You have to edit the code to turn the feature on." [01:10:26]


- User requests CSR format
- Matrix is actually stored as CSC
- Must transpose before unpacking (expensive!)

Key quote: "Transpose is used so many other places... every descriptor could potentially invoke a deep or shallow transpose." [01:53:01]

- Need atomics (slow) OR
- Each thread needs own buckets (memory intensive)
Builder method scales better:
- Just parallel sort of tuples
- No bucket contention
- Higher memory bandwidth utilization

- Values array can often be shallow-copied
- Sparsity pattern reuse when only applying operators
- In-place transpose for format conversion
- Vector transpose via pointer manipulation
- Fusion is Critical: Transpose + apply fusion enables complex conjugate transpose and avoids extra passes
- Algorithm Diversity: 8+ different algorithms selected via sophisticated heuristics
- Shallow Copies: Extensive use of O(1) shallow operations for efficiency
- Iso-Valued: Special optimization for structural matrices saves enormous computation
- JIT Integration: Type-specific compilation eliminates runtime overhead
- Vector Special Cases: Brilliant hypersparse trick for vector transpose
- Accumulator Complexity: Strange but required by spec, can't be changed
- Testing Rigor: Debug mode runs parallel and sequential versions in parallel to verify correctness
Final thought: "How many algorithms do you have to do the transpose? I don't know. I lose track of counting... It's mix and match. How do you count them? I don't know." [01:48:19]