diff --git a/Makefile b/Makefile index b1f1110..3317417 100644 --- a/Makefile +++ b/Makefile @@ -71,7 +71,7 @@ test: local # Clean build artifacts clean: @echo "=== Cleaning build artifacts ===" - @rm -rf build build-local bin lib zig-out .zig-cache + @rm -rf build build-local bin lib @rm -rf CMakeCache.txt CMakeFiles/ cmake_install.cmake compile_commands.json @echo "Clean complete" diff --git a/README.md b/README.md index d2898a6..d4891c3 100644 --- a/README.md +++ b/README.md @@ -1,26 +1,23 @@ # parcagpu - CUPTI Profiler with USDT Probes -CUDA profiling library that exposes kernel and graph execution events via USDT/DTRACE probes for eBPF/bpftrace monitoring. +CUDA profiling library that exposes GPU activity via USDT/DTRACE probes for eBPF consumption. Captures kernel executions, PC sampling with stall reasons, and cubin module loading. ## Building ```bash -make # Build everything -make test # Build and run tests -make clean # Clean all build artifacts +make local # Build libparcagpucupti.so locally (CMake, RelWithDebInfo) +make debug # Build with full debug, no optimizations +make clean # Clean all build artifacts ``` -### Components Built +Docker cross-compilation: -1. **libparcagpucupti.so** (CMake + real CUPTI) - - Production library for CUDA injection - - Located in `cupti/build/` - - Links against real NVIDIA CUPTI - -2. **Test Infrastructure** (Zig) - - Mock CUPTI library for testing - - Test program that simulates CUDA events - - Located in `zig-out/` +```bash +make build-amd64 # Build .so for AMD64 +make build-arm64 # Build .so for ARM64 +make build-all # Both architectures +make docker-push # Push multi-arch image to ghcr.io +``` ## Usage @@ -28,70 +25,110 @@ make clean # Clean all build artifacts ```bash export CUDA_INJECTION64_PATH=/path/to/libparcagpucupti.so -# Run your CUDA application ./my_cuda_app ``` +### Environment Variables + +| Variable | Default | Description | +|---|---|---| +| `PARCAGPU_DEBUG` | off | Enable debug logging | +| `PARCAGPU_RATE_LIMIT` | 100 | Token-bucket rate limit for callback probes (events/sec per thread) | +| `PARCAGPU_SAMPLING_FACTOR` | 18 | PC sampling period; set to 0 to disable PC sampling | +| `PARCAGPU_PC_SAMPLING_PROBABILITY` | 0.01 | Probability of sampling in each interval window (0-1) | +| `PARCAGPU_PC_SAMPLING_INTERVAL` | 1.0 | PC sampling interval window in seconds | + ### Monitoring with bpftrace ```bash -# Terminal 1: Monitor probes sudo bpftrace parcagpu.bt +``` -# Terminal 2: Run CUDA application -./my_cuda_app +### Monitoring with the BPF Activity Parser + +```bash +make bpf-test +sudo test/bpf/activity_parser -pid -lib -v ``` +The activity parser attaches to all USDT probes via eBPF, captures events through a ring buffer, and resolves PC samples to source lines using `llvm-dwarfdump`. + ## Testing ```bash -# Run test suite with bpftrace monitoring -make test +make test # Basic mock CUPTI test (no GPU, no BPF) +make test-pc-mock # Mock PC sampling with BPF activity parser (no GPU, requires root) +make test-pc-real # Real PC sampling with GPU (requires root + GPU) +make test-multi # test_cupti_prof + BPF activity parser in parallel (requires root) +``` -# Run test continuously (for extended monitoring) -LD_LIBRARY_PATH=zig-out/lib zig-out/bin/test_cupti_prof cupti/build/libparcagpucupti.so --forever +### BPF Prerequisites + +The BPF-based tests (`test-pc-mock`, `test-pc-real`, `test-multi`) require: + +- Root (sudo) for BPF +- clang, libbpf-dev, bpftool +- Go 1.21+ + +Build just the BPF activity parser: + +```bash +make generate # Compile BPF objects via bpf2go +make bpf-test # generate + build the Go binary ``` -See [README_TEST.md](README_TEST.md) for detailed testing documentation. +### Microbenchmarks -## USDT Probes +CUDA microbenchmarks for testing with real hardware: -The library exposes two USDT probes: +```bash +make microbenchmarks # Build all .cu files in microbenchmarks/ +make test-pc-real # Run pc_sample_toy under parcagpu with BPF +``` + +## USDT Probes -### parcagpu:kernel_executed -- **arg0**: start timestamp (ns) -- **arg1**: end timestamp (ns) -- **arg2**: correlationId | (deviceId << 32) -- **arg3**: streamId -- **arg4**: kernel name (string pointer) +Defined in `src/probes.d`, provider `parcagpu`: -### parcagpu:graph_executed -- **arg0**: start timestamp (ns) -- **arg1**: end timestamp (ns) -- **arg2**: correlationId | (deviceId << 32) -- **arg3**: streamId -- **arg4**: graphId +| Probe | Arguments | Description | +|---|---|---| +| `cuda_correlation` | correlationId, cbid, name | API callback correlation | +| `kernel_executed` | start, end, correlationId, deviceId, streamId, graphId, graphNodeId, name | Kernel execution timing | +| `activity_batch` | ptrs, count | Batch of CUPTI activity records | +| `pc_sample_batch` | records, count | Batch of PC sampling records | +| `stall_reason_map` | names, count | Stall reason name table | +| `cubin_loaded` | cubinCrc, cubin, cubinSize | Module load event | +| `cubin_unloaded` | cubinCrc | Module unload event | +| `error` | code, message, component | Profiler error event | ## Requirements -- CUDA Toolkit (CUPTI libraries) -- Zig (for building test infrastructure) -- CMake (for building production library) +- CUDA Toolkit (CUPTI headers/libraries) +- CMake +- dtrace (systemtap-sdt-dev) - bpftrace (for probe monitoring) +- clang, libbpf-dev, bpftool, Go 1.21+ (for BPF tests) ## Directory Structure ``` . ├── Makefile # Top-level build orchestration -├── build.zig # Zig build for test infrastructure -├── cupti/ -│ ├── CMakeLists.txt # CMake build for production library -│ ├── cupti-prof.c # Main profiler implementation -│ └── build/ # CMake build output +├── CMakeLists.txt # CMake build for library and test infrastructure +├── src/ +│ ├── cupti.cpp # Main CUPTI profiler implementation +│ ├── pc_sampling.cpp # PC sampling support +│ ├── probes.d # USDT probe definitions +│ └── ... +├── ebpf/ +│ └── cupti_bpf.h # Shared BPF struct definitions ├── test/ -│ ├── mock_cupti.c # Mock CUPTI for testing -│ └── test_cupti_prof.c # Test program -├── parcagpu.bt # bpftrace monitoring script -└── test.sh # Test runner +│ ├── test_cupti_prof.c # Mock CUPTI test harness +│ ├── mock_cupti.c # Mock CUPTI library +│ ├── mock_cuda.c # Mock CUDA driver library +│ ├── test-pc-mock.sh # Mock PC sampling end-to-end test +│ ├── test-pc-real.sh # Real GPU PC sampling end-to-end test +│ └── bpf/ # BPF activity parser (Go + eBPF) +├── microbenchmarks/ # CUDA microbenchmarks (.cu) +└── parcagpu.bt # bpftrace monitoring script ``` diff --git a/README_TEST.md b/README_TEST.md deleted file mode 100644 index f797556..0000000 --- a/README_TEST.md +++ /dev/null @@ -1,83 +0,0 @@ -# Testing libparcagpucupti.so - -This project includes comprehensive test infrastructure for the CUPTI profiler library. - -## Building - -```bash -# Build everything (libparcagpucupti.so + test infrastructure) -make - -# Or build and test in one step -make test -``` - -This builds: -- `build/lib/libparcagpucupti.so` - Production library (CMake) -- `build/lib/libcupti.so` - Mock CUPTI for test infrastructure -- `build/bin/test_cupti_prof` - Test program - -## Quick Start - -```bash -# Build and run the test (generates ~500 events) -make test - -# Or run test script directly (builds automatically) -./test.sh -``` - -This will: -1. Build the mock CUPTI library and test program -2. Start bpftrace to monitor DTRACE probes -3. Run the test with debug output enabled -4. Generate ~500 CUDA events (kernels and graph launches) at 1000 events/second -5. Show detailed output with timestamps for all CUPTI operations -6. Display captured DTRACE probe results - -## Running Continuously - -For extended testing or continuous probe monitoring: -```bash -LD_LIBRARY_PATH=build/lib build/bin/test_cupti_prof build/lib/libparcagpucupti.so --forever -``` - -This runs indefinitely at 1000 events/second until interrupted (Ctrl-C). - -## Verifying DTRACE Probes - -To verify that USDT probes are firing with correct values: - -**Terminal 1** - Monitor with bpftrace: -```bash -sudo bpftrace parcagpu.bt -``` - -**Terminal 2** - Run the test: -```bash -./test.sh -``` - -Expected probe output: -``` -[PID] Kernel executed: - start=1006000000, end=1006500000, duration=500000 ns - correlationId=6, deviceId=0, streamId=1 - name=mock_cuda_kernel_name - -[PID] Graph executed: - start=1007000000, end=1007300000, duration=300000 ns - correlationId=7, deviceId=0, streamId=1, graphId=3 - -=== Summary === -Graph executions: @graph_launches: count 117 -Kernel executions: @kernel_executions: count 117 -``` - -## Test Details - -See [test/README.md](test/README.md) for complete documentation including: -- Test architecture and components -- Build system details -- Manual test execution -- Implementation notes diff --git a/test/README.md b/test/README.md deleted file mode 100644 index ff74c58..0000000 --- a/test/README.md +++ /dev/null @@ -1,128 +0,0 @@ -# CUPTI Profiler Test Infrastructure - -This directory contains test infrastructure for `libparcagpucupti.so` using CMake as the build system. - -## Components - -- **test/mock_cupti.c**: Mock CUPTI library that provides stub implementations of all CUPTI APIs used by the profiler -- **test/test_cupti_prof.c**: Test program that dynamically loads libparcagpucupti.so and simulates CUPTI callbacks -- **CMakeLists.txt**: CMake build configuration (at project root) -- **test.sh**: Test script (at project root) - -## Building - -From the project root: -```bash -make -``` - -This builds: -1. `libcupti.so` - Mock CUPTI library with stub implementations -2. `libparcagpucupti.so` - The profiler library linked against the mock CUPTI -3. `test_cupti_prof` - Test executable that loads and exercises the profiler - -All outputs go to `build/lib/` and `build/bin/`. - -## Running - -Using the test script (recommended): -```bash -./test.sh -``` - -Using Make directly: -```bash -make test -``` - -Or manually: -```bash -make -LD_LIBRARY_PATH=build/lib build/bin/test_cupti_prof build/lib/libparcagpucupti.so -``` - -### Running Continuously - -To run the test in continuous mode (useful for monitoring probes with bpftrace): -```bash -LD_LIBRARY_PATH=build/lib build/bin/test_cupti_prof build/lib/libparcagpucupti.so --forever -``` - -In this mode, the test will: -- Generate events indefinitely at 1000 events/second -- Print status every 100 iterations (~500 events) -- Run until interrupted with Ctrl-C - -## Test Behavior - -The test program: -1. Dynamically loads `libparcagpucupti.so` -2. Calls `InitializeInjection()` to initialize the profiler -3. Simulates ~1000 CUDA events per second by calling `runtimeApiCallback()` in a loop -4. Alternates between kernel launches and graph launches -5. Periodically calls `bufferCompleted()` with mock activity records containing: - - Kernel execution records (CUPTI_ACTIVITY_KIND_CONCURRENT_KERNEL) - - Graph execution records (CUPTI_ACTIVITY_KIND_GRAPH_TRACE) -6. Each activity record triggers the DTRACE probes: - - `parcagpu:cuda_correlation` - Fired on API callback with correlationId - - `parcagpu:kernel_executed` - Fired with kernel timing and metadata - - `parcagpu:graph_executed` - Fired with graph timing and metadata - -## Rate Limiting - -The test generates exactly 1000 events per second (5 events every 5ms) to match the requirement. The loop runs for 100 iterations, generating ~500 total events. - -## Debugging - -The test script automatically enables `PARCAGPU_DEBUG=1` to show detailed debug output including: -- CUPTI initialization steps -- Activity buffer management with timestamps -- All callback invocations -- Cleanup operations - -To run without debug output: -```bash -LD_LIBRARY_PATH=build/lib build/bin/test_cupti_prof build/lib/libparcagpucupti.so -``` - -## Verifying DTRACE Probes - -To verify that the DTRACE/USDT probes are firing correctly, use the provided bpftrace script: - -**Terminal 1** - Run bpftrace to monitor probes: -```bash -sudo bpftrace parcagpu.bt -``` - -**Terminal 2** - Run the test: -```bash -./test.sh -``` - -You should see output like: -``` -[PID] Kernel executed: - start=1006000000, end=1006500000, duration=500000 ns - correlationId=6, deviceId=0, streamId=1 - name=mock_cuda_kernel_name - -[PID] Graph executed: - start=1007000000, end=1007300000, duration=300000 ns - correlationId=7, deviceId=0, streamId=1, graphId=3 -``` - -The summary at the end will show total counts for kernel and graph executions. - -## Implementation Notes - -- The mock CUPTI library stores callback function pointers in global variables that are exported to the test program -- The test program retrieves these callbacks after `InitializeInjection()` is called -- Activity records are properly formatted and parsed by `cuptiActivityGetNextRecord` -- The DTRACE probes (`parcagpu:kernel_executed` and `parcagpu:graph_executed`) fire with correct values: - - **deviceId**: 0 - - **streamId**: 1 - - **start/end**: Reasonable timestamp values with 500μs duration for kernels, 300μs for graphs - - **name**: "mock_cuda_kernel_name" for kernel events -- The test generates approximately 500 events over the course of execution at 1000 events/second -- **Cleanup is idempotent**: The `cleanup()` function in cupti-prof.c can safely be called multiple times -- The test explicitly calls `cleanup()` before `dlclose()` and uses `_exit()` to avoid the atexit handler being called after the library is unloaded