Skip to content

Latest commit

 

History

History
380 lines (293 loc) · 17.1 KB

File metadata and controls

380 lines (293 loc) · 17.1 KB

Session 12 Summary

Table of Contents


Overview

This session focuses on the transpose operation in SuiteSparse GraphBLAS, including its fusion with the apply operation, multiple algorithm implementations, and extensive optimization strategies. This Session is a continuation of the discussion of the transpose method in Session 10.

Next, starting with the topic of Shallow vs Deep Copies the discussion moves away from the transpose methods to discuss more details of the matrix data structure, the GraphBLAS accumulator step, testing/validation of GraphBLAS,

Finally, the topic of discussion moves back to the transpose, with the Integration with Other Operations topic, and following.


Transpose and Apply Integration

Why Transpose is Fused with Apply [00:00:02]

00:00:02 The transpose operation is tightly integrated with the apply operation because:

  • Complex conjugate transpose can be done as a single fused operation
  • Typecasting during transpose avoids extra memory passes
  • The descriptor can request transpose on any operation
  • Many methods may need to implicitly transpose inputs when formats don't match

Key quote: "I wanted to be able to do a complex conjugate transpose fast. That's just an apply with a conjugate operator and a transpose descriptor, one line of code, one line call to GraphBLAS. I should be able to do that as a single fuse operation." [00:20:16]

GB_Operator Structure [00:06:00]

00:06:00 All operator types (unary, binary, index unary, index binary) share the same internal struct:

  • Four function pointers for different operator types
  • OP code indicating which operator variant
  • Z, X, Y, and theta types for inputs/outputs
  • JIT name and hash for compilation
  • Size tracking for all pointers (programming in a Rust-like manner)

Key insight: The unusual struct definition using typedef is necessary to work around C's generic keyword limitations. The compiler would treat normal typedefs as identical types, breaking the generic switch mechanism. [00:07:17]


Operator Structure Details

Multiple Names Per Object [00:10:00]

00:10:00 Every operator has two names:

  • Username: Settable via GrB_NAME (spec-defined feature)
  • JIT name: Required for code generation, actual function names

Note: Tim requested something different from the spec committee - he needed function names for the JIT compiler, but got the user-settable name feature instead. [00:10:10]

OP Code Organization [00:13:28]

00:13:28 Operators are organized by numeric ranges:

  • Unary operators: Start at 1 (identity is 2)
  • Index unary: 52-71
  • Binary operators: Start at 72
  • Index binary: Placed in middle (only user-defined, no built-ins)

Important: These codes can change between versions - they're internal only. [00:14:07]


Transpose Implementation Architecture

Multiple Entry Points [00:01:26]

00:01:26 The transpose functionality is accessed from several places:

  • GrB_transpose: User-facing method
  • GB_transpose: Internal worker
  • From apply when transpose descriptor is set
  • From internal methods needing format conversion

Cheap vs Expensive Transpose [00:26:00]

00:26:00 GraphBLAS can sometimes perform "cheap" (O(1)) transpose operations:

Case 1: Double transpose If the descriptor requests transpose on GrB_transpose, it's a double transpose (identity).

Case 2: Format mismatch If input is CSC and output is CSR (or vice versa), requested transpose becomes a simple copy.


Transpose Algorithms

Overview of Methods [01:10:40]

01:10:40 Tim identifies approximately 8 different transpose algorithms:

  1. Sequential bucket sort
  2. Atomic parallel bucket sort
  3. Non-atomic parallel bucket sort
  4. Builder (sort-based)
  5. Bitmap transpose
  6. Full matrix transpose
  7. Column-to-row vector (hypersparse result)
  8. Row-to-column vector

Algorithm Selection Heuristics [01:11:58]

01:11:58

Bucket vs Builder Decision:

bucket_work = (nnz + nrows + ncols) * fudge_factor
builder_work = nnz * log(nnz)

The fudge factor grows with problem size - builder becomes more attractive as matrices get larger. [01:12:13]

Key factors:

  • Hypersparse matrices always use builder (can't allocate 2^60 buckets)
  • Bucket sort is O(n + e) linear time but doesn't scale well in parallel
  • Builder is O(e log e) but more parallel-friendly
  • "These are highly tuned. For a different machine, this tuning might be off." [01:12:07]

Vector Transpose Optimization [01:00:30]

01:00:30

Column to Row Vector - The Clever Trick: When transposing a sparse column vector to a row vector:

  • Input: CSC column vector with indices I and values X
  • Output: Hypersparse row vector
  • The magic: Input row indices become output hyperlist (H array)
  • Shallow copy, O(1) time!
  • Only need to construct P array as 0,1,2,3,... to number of entries

Key quote: "The input row indices become the output hyperlist. Boom. No work. It's just a parlor trick. But it's really important. This is a super important thing." [01:04:26]


Bucket Transpose Details

Three Parallel Strategies [01:23:40]

01:23:40

1. Sequential [01:31:51]

  • Single workspace of size n
  • Simple iteration over vectors
  • Linear time O(n + e)
  • Template code is simplest

2. Atomic Parallel [01:32:07]

  • One shared workspace
  • Multiple threads with atomic increments
  • Each thread processes different columns
  • Atomic capture for bucket positions

Ugly workaround: Microsoft Visual Studio only supports OpenMP 2.0, requiring ugly macros instead of clean #pragma omp atomic capture directives. [01:33:28]

3. Non-Atomic Parallel [01:34:14]

  • One workspace per thread (expensive in memory)
  • No atomic contention
  • Two-pass algorithm: count, then cumsum, then reverse cumsum to position
  • Doesn't scale beyond ~8-16 threads due to memory

Iso-Valued Optimization

What is Iso-Valued? [00:50:23]

00:50:23 A matrix where all entries share one value - essentially a symbolic/pattern matrix with a single scalar.

Six Types of Iso Operations [00:52:24]

00:52:24

  1. Always returns 1 (one operator)
  2. Returns specific scalar S
  3. Identity with typecast
  4. Built-in operator applied to iso input
  5. Binary operator bind first with scalar
  6. Binary operator bind second with scalar

Benefit: Only the sparsity pattern needs to be computed in parallel. The single value is computed once. Massive speedup for structural operations.


Template and JIT System

Template Hierarchy [01:26:30]

01:26:30 Three core templates for different matrix formats:

  • GB_transpose_sparse: Handles sparse/hypersparse (3 variants: sequential, atomic, non-atomic)
  • GB_transpose_bitmap: Bitmap matrices
  • GB_transpose_full: Full/dense matrices

Problem with bitmap/full: "These 2 methods I'm not happy with. But they work." Uses div/mod for position calculation instead of proper tiling. [01:30:25]

JIT Code Generation [01:38:18]

01:38:18 Example: Transpose with bind-second operator

  • Operator is compiled into kernel
  • No runtime function pointer overhead
  • Macros like GB_DECLAREA, GB_GETA, GB_EWISEOP hide type complexity
  • Same template handles: unary operators, typecasts, bind-first, bind-second

Key quote: "It's templatized in compile time to death. And then finally, at the very end of the day, the final kernel is just one of those little simple 3 loops." [01:47:47]


Shallow vs Deep Copies

Shallow Component Flags [00:38:29]

00:38:29 Every matrix has boolean flags indicating which components are shallow:

  • b_shallow: Bitmap array
  • x_shallow: Values array
  • i_shallow: Row indices
  • h_shallow: Hyperlist
  • p_shallow: Column pointers

Rules for Shallow Copies [00:39:40]

00:39:40

  • Shallow components point to data owned by another matrix
  • Internal temporary matrices can have shallow components
  • Matrices returned to user applications never have shallow components
  • When freeing a matrix, shallow components are not freed

Use case: Apply without operator can share sparsity pattern between input and output, only allocating new space for values. [00:40:05]


Accumulator Complexity

Strange Typecasting Behavior [00:29:00]

00:29:00 The accumulator has unusual semantics:

  • If entry exists in A but not C: typecasts from A to C (bypasses operator!)
  • If exists in both: must typecast A to operator input type, then operator runs
  • Different entries can follow different typecast paths

Key quote: "Typecasting with the accumulator has very strange behavior in GraphBLAS. It's what the spec told me to do. So that's what I did." [00:29:41]

Why It Can't Change [00:30:56]

00:30:56

  • Too much code depends on current accumulator behavior
  • Extensively fused into every operation for performance
  • Assign operation has 40 different kernels all designed for accumulator behavior
  • Changing it would be a nightmare

Alternative: Union operation avoids this - every entry goes through the operator with an identity value when only one input is present. [00:30:38]


Format Agnostic Design

The Agnostic Trick [00:23:28]

00:23:28 Many internal methods are "format agnostic":

  • Treat everything as a pile of vectors
  • Don't care if stored by row or column
  • Descriptor transpose + CSR/CSC format = 4 cases collapsed to 2
  • "I pretend they're by column internally"

Implementation detail: The transpose descriptor is XORed with format difference. If formats differ between input/output, the effective transpose operation flips. [00:25:13]


Testing and Verification

Debug Mode Verification [01:10:03]

01:10:03 For parallel algorithms, Tim implements both:

  • Complex parallel version
  • Simple sequential version (easy to understand)

In debug mode, runs both and asserts they match:

// Parallel work here
#ifndef NDEBUG
// Sequential version - must match
assert(parallel_result == sequential_result);
#endif

Note: "Debugging is terribly slow. You have to edit the code to turn the feature on." [01:10:26]

JIT Kernel Count [01:48:37]

01:48:37 Exhaustive test suite compiles approximately 4,000 JIT kernels covering all type and operator combinations for transpose.


Integration with Other Operations

Pack/Unpack Complications [01:50:30]

01:50:30 When unpacking a matrix in a specific format:

  • User requests CSR format
  • Matrix is actually stored as CSC
  • Must transpose before unpacking (expensive!)

Matrix Multiply Usage [01:53:00]

01:53:00 Several places in matrix multiply need to transpose input matrices based on descriptor settings and internal algorithm requirements.

Key quote: "Transpose is used so many other places... every descriptor could potentially invoke a deep or shallow transpose." [01:53:01]


Performance Considerations

Parallel Scalability [00:42:00]

00:42:00 Bucket sorts face parallelism challenges:

  • Need atomics (slow) OR
  • Each thread needs own buckets (memory intensive)

Builder method scales better:

  • Just parallel sort of tuples
  • No bucket contention
  • Higher memory bandwidth utilization

Memory Optimization [00:37:43]

00:37:43 Transpose leverages shallow copies extensively:

  • Values array can often be shallow-copied
  • Sparsity pattern reuse when only applying operators
  • In-place transpose for format conversion
  • Vector transpose via pointer manipulation

Key Takeaways

  1. Fusion is Critical: Transpose + apply fusion enables complex conjugate transpose and avoids extra passes
  2. Algorithm Diversity: 8+ different algorithms selected via sophisticated heuristics
  3. Shallow Copies: Extensive use of O(1) shallow operations for efficiency
  4. Iso-Valued: Special optimization for structural matrices saves enormous computation
  5. JIT Integration: Type-specific compilation eliminates runtime overhead
  6. Vector Special Cases: Brilliant hypersparse trick for vector transpose
  7. Accumulator Complexity: Strange but required by spec, can't be changed
  8. Testing Rigor: Debug mode runs parallel and sequential versions in parallel to verify correctness

Final thought: "How many algorithms do you have to do the transpose? I don't know. I lose track of counting... It's mix and match. How do you count them? I don't know." [01:48:19]