This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.
The main/default branch of this repository is dev.
Sirius is a GPU-native SQL engine that integrates with DuckDB as an extension. It leverages NVIDIA CUDA-X libraries (cuDF, RMM) to accelerate SQL query execution on GPUs. Sirius intercepts DuckDB's physical plan execution and routes supported operations to GPU execution while gracefully falling back to DuckDB's CPU execution for unsupported cases.
Key Integration Points:
- DuckDB extension architecture: Sirius loads as a DuckDB extension (
sirius.duckdb_extension) - cuCascade: Third-party library for GPU memory management (tiered memory across GPU/host/disk)
- RAPIDS cuDF: GPU DataFrame library for data manipulation
- RMM: RAPIDS Memory Manager for GPU memory allocation
Using Pixi (Recommended):
pixi shell # Activate environment with all dependencies
source setup_sirius.sh # Set SIRIUS_HOME_PATH and LDFLAGSManual Setup:
source setup_sirius.sh
export LIBCUDF_ENV_PREFIX=/path/to/miniconda3/envs/libcudf-env # If using conda# Full build (uses all cores by default)
CMAKE_BUILD_PARALLEL_LEVEL=$(nproc) make
# If build consumes too much memory, reduce parallelism
CMAKE_BUILD_PARALLEL_LEVEL=8 make
# After build errors, clean build directory
rm -rf build
CMAKE_BUILD_PARALLEL_LEVEL=$(nproc) makeBuild outputs:
- Static extension:
build/release/extension/sirius/sirius.duckdb_extension - Loadable extension:
build/release/extension/sirius/sirius_loadable.duckdb_extension - Unit test binary:
build/release/extension/sirius/test/cpp/sirius_unittest
cd duckdb-python
pip install .
cd ..make test # Run all SQLLogicTests
make test_debug # Debug build tests
# Run specific test file
CMAKE_BUILD_PARALLEL_LEVEL=$(nproc) make
build/release/test/unittest --test-dir . test/sql/tpch-sirius.test# Build and run all unit tests
CMAKE_BUILD_PARALLEL_LEVEL=$(nproc) make
build/release/extension/sirius/test/cpp/sirius_unittest
# Run tests with specific tag
build/release/extension/sirius/test/cpp/sirius_unittest "[cpu_cache]"
# Run specific test
build/release/extension/sirius/test/cpp/sirius_unittest "test_cpu_cache_basic_string_single_col"Test logs are saved to: build/release/extension/sirius/test/cpp/log
Unit tests use Catch2 framework. Test files are in test/cpp/ organized by component.
# Requires duckdb-python to be built
python3 test/tpch_performance/generate_test_data.py {SCALE_FACTOR}
python3 test/tpch_performance/performance_test.py {SCALE_FACTOR}Sirius uses pre-commit hooks for code quality:
pre-commit run -a # Run all hooks on all files
pre-commit install # Install git hooks (runs on every commit)Code style tools:
- C++/CUDA: clang-format (style defined in
.clang-format) - Python: black
- CMake: cmake-format
- Spell check: codespell (custom words in
.codespell_words)
Configuration files:
.clang-format: C++/CUDA formatting rules.clang-tidy: C++ linting rules.pre-commit-config.yaml: All pre-commit hooks
Sirius has two parallel execution modes, both coexisting in src/:
Legacy Sirius (gpu_processing):
- Uses
namespace duckdb - Entry point:
CALL gpu_processing('SELECT ...') - Physical plan generator:
GPUPhysicalPlanGenerator(src/gpu_physical_plan_generator.cpp) - Operators:
GPUPhysicalOperatorsubclasses insrc/operator/(e.g.,gpu_physical_hash_join.cpp) - Plan builders:
src/plan/(e.g.,gpu_plan_filter.cpp,gpu_plan_aggregate.cpp) - Executor:
src/gpu_executor.cpp - Memory: requires
gpu_buffer_init()before use; usesGPUBufferManagerandGPUContext
New Sirius (gpu_execution):
- Uses
namespace sirius - Entry point:
CALL gpu_execution('SELECT ...') - Physical plan generator:
sirius_physical_plan_generator(src/planner/sirius_physical_plan_generator.cpp) - Operators:
sirius_physical_operatorsubclasses insrc/op/(e.g.,sirius_physical_hash_join.cpp) - Plan builders:
src/planner/(e.g.,sirius_plan_filter.cpp,sirius_plan_aggregate.cpp) - Engine:
src/sirius_engine.cpp, pipelines insrc/pipeline/ - Interface:
src/sirius_interface.cpp(usessirius_interfaceclass) - Includes task-based execution:
src/creator/,src/downgrade/,src/op/scan/
Shared code (used by both, in namespace duckdb):
src/sirius_extension.cpp: Extension entry point, registers bothgpu_processingandgpu_executiontable functionssrc/expression_executor/: GPU expression evaluationsrc/config.cpp/src/include/config.hpp: Runtime configurationsrc/cuda/: CUDA kernels (cuDF wrappers, expression dispatch)
New development should target the new Sirius (namespace sirius / gpu_execution) code path.
Sirius implements a custom execution engine that processes DuckDB's physical plans:
- Thread Coordinator: Main thread receives logical plan from DuckDB, populates Pipeline Metadata Hash Map
- Task Creator: Creates Scan Tasks and Pipeline Tasks based on data availability in Data Repository
- Scan Executor: Uses DuckDB to scan data from storage, converts to GPU format, stores in Data Repository
- Pipeline Executor: GPU thread pool executing operators via cuDF, stores results in Data Repository
- Downgrade Executor: Moves data from GPU to CPU when GPU memory is constrained
Data Flow:
Data Batch: Wrapper for pipeline input/output (cudf::table or spilling::allocation)Data Repository: Container for Data Batches, manages movement across memory tiers (GPU/CPU/disk via cuCascade)Pipeline Task: Operators chain + Data Batch to be executed on GPUScan Task: DuckDB-based data scan that produces Data Batches
Execution:
sirius_engine: Top-level orchestrator, owns pipelines and physical plansirius_pipeline: Collection of operators that can be executed togethersirius_meta_pipeline: Manages pipeline dependencies and schedulingGPU Thread Pool: Stream-per-thread model for parallel GPU executionMemory Reservation Manager: Prevents GPU OOM by enforcing memory limits
Operators (src/include/operator/):
See Supported Features for the full list of implemented operators.
Core source code:
src/include/: Header files organized by moduleoperator/: GPU physical operators (filter, join, aggregate, etc.)pipeline/: Pipeline execution framework (tasks, executors, queues)memory/: Memory management interfaces (integrates with cuCascade)op/: Sirius-specific physical operator wrappersplanner/: Physical plan generation and optimizationdata/: Data structures (columns, batches)cudf/: cuDF integration utilitiesexpression_executor/: Expression evaluation on GPU
Important files:
src/sirius_extension.cpp: Extension entry point, registers functions with DuckDBsrc/sirius_interface.cpp: API forgpu_buffer_initandgpu_processingsrc/gpu_executor.cpp: Main GPU execution coordinatorsrc/gpu_buffer_manager.cpp: GPU memory allocation and caching
Third-party dependencies:
cucascade/: GPU memory management library (built as subdirectory)duckdb/: DuckDB core (git submodule)third_party/: spdlog (logging), other dependencies via CMake
Build configuration:
CMakeLists.txt: Main build configurationextension_config.cmake: Extension-specific DuckDB configthird_party/*.cmake: External dependency fetching (spdlog, cucascade)pixi.toml: Pixi environment specification (CUDA versions, dependencies)
Sirius uses cuCascade for sophisticated GPU memory management:
- GPU Caching Region: Stores raw input data on GPU
- GPU Processing Region: Holds intermediate results (hash tables, join results)
- Pinned Host Memory: Fast CPU-GPU transfers
- Memory Reservations: Pre-allocation strategy to avoid OOM during execution
Initialization via gpu_buffer_init("1 GB", "2 GB", pinned_memory_size = "4 GB")
Sirius uses spdlog for structured logging:
export SIRIUS_LOG_DIR=/path/to/logs # Default: ${CMAKE_BINARY_DIR}/log
export SIRIUS_LOG_LEVEL=debug # Levels: trace, debug, info, warn, errorLogs are essential for debugging GPU execution, memory allocation, and pipeline scheduling.
Sirius gracefully falls back to DuckDB CPU execution when:
- Data size exceeds GPU memory regions (caching or processing)
- Unsupported data types (nested types, some temporal types)
- Unsupported operators (window functions, ASOF JOIN, etc.)
- libcudf row count limitations (~2B rows due to int32_t row IDs)
The fallback mechanism is implemented in src/fallback.cpp and integrates with DuckDB's execution engine.
Data types: INTEGER, BIGINT, FLOAT, DOUBLE, VARCHAR, DATE, TIMESTAMP, DECIMAL Operators: FILTER, PROJECTION, JOIN (Hash/Nested Loop/Delim), GROUP BY, ORDER BY, AGGREGATION, TOP-N, LIMIT, CTE, TABLE SCAN Join types: INNER, LEFT, RIGHT, OUTER (implemented via cudf::left_join, cudf::inner_join, etc.)
- GPU kernels (
.cufiles) are insrc/cuda/and subdirectories - CPU-side logic (
.cppfiles) coordinates GPU execution - Header files (
.hpp) insrc/include/mirror source structure - Each operator has both a DuckDB-facing interface (
operator/) and cuDF implementation (cuda/operator/)
- Create header in
src/include/operator/gpu_physical_<operator>.hpp - Implement DuckDB integration in
src/operator/gpu_physical_<operator>.cpp - Add cuDF/CUDA implementation in
src/cuda/operator/<operator>.cu - Register in physical plan generator (
src/gpu_physical_plan_generator.cpp) - Add tests in
test/cpp/operator/andtest/sql/
- Uses CUDA 13+ (specified in
pixi.tomlfeatures) - Requires C++20 and CUDA standard 20
- Separable compilation enabled for CUDA (
CMAKE_CUDA_SEPARABLE_COMPILATION ON) - GPU architectures: Turing through Blackwell (75, 80, 86, 90a, 100f, 120a, 120)
- Links against: cudf::cudf, rmm::rmm, libnuma, libconfig++, absl::any_invocable, spdlog, cuCascade
Build Issues:
If you see undefined reference errors related to GLIBCXX or CXXABI:
export LDFLAGS="-Wl,-rpath,$CONDA_PREFIX/lib -L$CONDA_PREFIX/lib $LDFLAGS"
rm -rf build
CMAKE_BUILD_PARALLEL_LEVEL=$(nproc) makeMemory Issues:
If build consumes too much RAM, reduce parallel jobs:
CMAKE_BUILD_PARALLEL_LEVEL=4 makeTest Datasets:
TPC-H and ClickBench datasets must be generated before running tests. See test_datasets/ and run setup_test_datasets.sh (automatically run in pixi activation).
This is a DuckDB extension project using the extension template. The build system integrates with DuckDB's extension infrastructure via extension-ci-tools.
Key files for extension integration:
Makefile: Thin wrapper includingextension-ci-tools/makefiles/duckdb_extension.Makefileextension_config.cmake: Specifies which extensions to load (sirius, json, tpcds, tpch, parquet, icu)src/sirius_extension.cpp: Extension registration (LoadInternal function)
Extension API Usage:
CLI:
LOAD 'build/release/extension/sirius/sirius.duckdb_extension';
CALL gpu_buffer_init('1 GB', '2 GB');
-- Legacy mode:
CALL gpu_processing('SELECT ...');
-- New mode (preferred):
CALL gpu_execution('SELECT ...');Python:
con = duckdb.connect('db.duckdb', config={"allow_unsigned_extensions": "true"})
con.execute("LOAD '/path/to/sirius.duckdb_extension'")
con.execute("CALL gpu_buffer_init('1 GB', '2 GB')")
# Legacy mode:
con.execute("CALL gpu_processing('SELECT ...')").fetchall()
# New mode (preferred):
con.execute("CALL gpu_execution('SELECT ...')").fetchall()- Cold runs are slow: First query loads data from storage and converts DuckDB format to GPU format
- Warm runs benefit from GPU caching: Subsequent queries use cached GPU data
- Best for: Interactive analytics, financial workloads, ETL jobs, large aggregations/joins
- Benchmark: ~8x speedup on TPC-H SF=100 vs CPU at equivalent hardware cost
Key terminology used throughout the codebase (see docs/glossary.md for complete definitions):
- Pipeline: Chain of operators executed together as a unit
- Data Batch: Input/output wrapper for pipeline execution
- Data Repository: Central storage for Data Batches with tier management
- GPU Scheduling Thread: Stream-associated thread that pulls tasks from queue
- Memory Reservation: Lease on memory to prevent oversubscription
- Task Creator: Thread that polls completions and creates new tasks
- Thread Coordinator: Main thread orchestrating Sirius execution
Sirius includes Claude Code skills for performance analysis and dataset management. Invoke them via slash commands:
| Skill | Command | Description |
|---|---|---|
| Profile Analyzer | /profile-analyzer |
Analyzes GPU performance from nsys profiles — kernel occupancy, memory bandwidth, operator attribution, and regression detection. |
| Dataset Manager | /dataset-manager |
Manages TPC-H parquet datasets — generate at any scale factor, consolidate files, inspect layout, optimize row groups. |
| Optimization Advisor | /optimization-advisor |
Maps GPU hotspots from nsys profiles to source functions, detects efficiency bottlenecks, sync overhead, and parallelism opportunities. |
| TPC-DS Benchmark | /tpcds-benchmark |
Runs TPC-DS benchmarks on Legacy Sirius, Super Sirius, or DuckDB CPU baseline — generate data, execute queries, and compare results. |
Useful debugging tools:
tools/parse_pipeline_log.py: Parses Sirius pipeline logs to show per-operator row counts for debugging incorrect query results.