|
1 | 1 | # parcagpu - CUPTI Profiler with USDT Probes |
2 | 2 |
|
3 | | -CUDA profiling library that exposes kernel and graph execution events via USDT/DTRACE probes for eBPF/bpftrace monitoring. |
| 3 | +CUDA profiling library that exposes GPU activity via USDT/DTRACE probes for eBPF consumption. Captures kernel executions, PC sampling with stall reasons, and cubin module loading. |
4 | 4 |
|
5 | 5 | ## Building |
6 | 6 |
|
7 | 7 | ```bash |
8 | | -make # Build everything |
9 | | -make test # Build and run tests |
10 | | -make clean # Clean all build artifacts |
| 8 | +make local # Build libparcagpucupti.so locally (CMake, RelWithDebInfo) |
| 9 | +make debug # Build with full debug, no optimizations |
| 10 | +make clean # Clean all build artifacts |
11 | 11 | ``` |
12 | 12 |
|
13 | | -### Components Built |
| 13 | +Docker cross-compilation: |
14 | 14 |
|
15 | | -1. **libparcagpucupti.so** (CMake + real CUPTI) |
16 | | - - Production library for CUDA injection |
17 | | - - Located in `cupti/build/` |
18 | | - - Links against real NVIDIA CUPTI |
19 | | - |
20 | | -2. **Test Infrastructure** (Zig) |
21 | | - - Mock CUPTI library for testing |
22 | | - - Test program that simulates CUDA events |
23 | | - - Located in `zig-out/` |
| 15 | +```bash |
| 16 | +make build-amd64 # Build .so for AMD64 |
| 17 | +make build-arm64 # Build .so for ARM64 |
| 18 | +make build-all # Both architectures |
| 19 | +make docker-push # Push multi-arch image to ghcr.io |
| 20 | +``` |
24 | 21 |
|
25 | 22 | ## Usage |
26 | 23 |
|
27 | 24 | ### As CUDA Injection Library |
28 | 25 |
|
29 | 26 | ```bash |
30 | 27 | export CUDA_INJECTION64_PATH=/path/to/libparcagpucupti.so |
31 | | -# Run your CUDA application |
32 | 28 | ./my_cuda_app |
33 | 29 | ``` |
34 | 30 |
|
| 31 | +### Environment Variables |
| 32 | + |
| 33 | +| Variable | Default | Description | |
| 34 | +|---|---|---| |
| 35 | +| `PARCAGPU_DEBUG` | off | Enable debug logging | |
| 36 | +| `PARCAGPU_RATE_LIMIT` | 100 | Token-bucket rate limit for callback probes (events/sec per thread) | |
| 37 | +| `PARCAGPU_SAMPLING_FACTOR` | 18 | PC sampling period; set to 0 to disable PC sampling | |
| 38 | +| `PARCAGPU_PC_SAMPLING_PROBABILITY` | 0.01 | Probability of sampling in each interval window (0-1) | |
| 39 | +| `PARCAGPU_PC_SAMPLING_INTERVAL` | 1.0 | PC sampling interval window in seconds | |
| 40 | + |
35 | 41 | ### Monitoring with bpftrace |
36 | 42 |
|
37 | 43 | ```bash |
38 | | -# Terminal 1: Monitor probes |
39 | 44 | sudo bpftrace parcagpu.bt |
| 45 | +``` |
40 | 46 |
|
41 | | -# Terminal 2: Run CUDA application |
42 | | -./my_cuda_app |
| 47 | +### Monitoring with the BPF Activity Parser |
| 48 | + |
| 49 | +```bash |
| 50 | +make bpf-test |
| 51 | +sudo test/bpf/activity_parser -pid <PID> -lib <path/to/libparcagpucupti.so> -v |
43 | 52 | ``` |
44 | 53 |
|
| 54 | +The activity parser attaches to all USDT probes via eBPF, captures events through a ring buffer, and resolves PC samples to source lines using `llvm-dwarfdump`. |
| 55 | + |
45 | 56 | ## Testing |
46 | 57 |
|
47 | 58 | ```bash |
48 | | -# Run test suite with bpftrace monitoring |
49 | | -make test |
| 59 | +make test # Basic mock CUPTI test (no GPU, no BPF) |
| 60 | +make test-pc-mock # Mock PC sampling with BPF activity parser (no GPU, requires root) |
| 61 | +make test-pc-real # Real PC sampling with GPU (requires root + GPU) |
| 62 | +make test-multi # test_cupti_prof + BPF activity parser in parallel (requires root) |
| 63 | +``` |
50 | 64 |
|
51 | | -# Run test continuously (for extended monitoring) |
52 | | -LD_LIBRARY_PATH=zig-out/lib zig-out/bin/test_cupti_prof cupti/build/libparcagpucupti.so --forever |
| 65 | +### BPF Prerequisites |
| 66 | + |
| 67 | +The BPF-based tests (`test-pc-mock`, `test-pc-real`, `test-multi`) require: |
| 68 | + |
| 69 | +- Root (sudo) for BPF |
| 70 | +- clang, libbpf-dev, bpftool |
| 71 | +- Go 1.21+ |
| 72 | + |
| 73 | +Build just the BPF activity parser: |
| 74 | + |
| 75 | +```bash |
| 76 | +make generate # Compile BPF objects via bpf2go |
| 77 | +make bpf-test # generate + build the Go binary |
53 | 78 | ``` |
54 | 79 |
|
55 | | -See [README_TEST.md](README_TEST.md) for detailed testing documentation. |
| 80 | +### Microbenchmarks |
56 | 81 |
|
57 | | -## USDT Probes |
| 82 | +CUDA microbenchmarks for testing with real hardware: |
58 | 83 |
|
59 | | -The library exposes two USDT probes: |
| 84 | +```bash |
| 85 | +make microbenchmarks # Build all .cu files in microbenchmarks/ |
| 86 | +make test-pc-real # Run pc_sample_toy under parcagpu with BPF |
| 87 | +``` |
| 88 | + |
| 89 | +## USDT Probes |
60 | 90 |
|
61 | | -### parcagpu:kernel_executed |
62 | | -- **arg0**: start timestamp (ns) |
63 | | -- **arg1**: end timestamp (ns) |
64 | | -- **arg2**: correlationId | (deviceId << 32) |
65 | | -- **arg3**: streamId |
66 | | -- **arg4**: kernel name (string pointer) |
| 91 | +Defined in `src/probes.d`, provider `parcagpu`: |
67 | 92 |
|
68 | | -### parcagpu:graph_executed |
69 | | -- **arg0**: start timestamp (ns) |
70 | | -- **arg1**: end timestamp (ns) |
71 | | -- **arg2**: correlationId | (deviceId << 32) |
72 | | -- **arg3**: streamId |
73 | | -- **arg4**: graphId |
| 93 | +| Probe | Arguments | Description | |
| 94 | +|---|---|---| |
| 95 | +| `cuda_correlation` | correlationId, cbid, name | API callback correlation | |
| 96 | +| `kernel_executed` | start, end, correlationId, deviceId, streamId, graphId, graphNodeId, name | Kernel execution timing | |
| 97 | +| `activity_batch` | ptrs, count | Batch of CUPTI activity records | |
| 98 | +| `pc_sample_batch` | records, count | Batch of PC sampling records | |
| 99 | +| `stall_reason_map` | names, count | Stall reason name table | |
| 100 | +| `cubin_loaded` | cubinCrc, cubin, cubinSize | Module load event | |
| 101 | +| `cubin_unloaded` | cubinCrc | Module unload event | |
| 102 | +| `error` | code, message, component | Profiler error event | |
74 | 103 |
|
75 | 104 | ## Requirements |
76 | 105 |
|
77 | | -- CUDA Toolkit (CUPTI libraries) |
78 | | -- Zig (for building test infrastructure) |
79 | | -- CMake (for building production library) |
| 106 | +- CUDA Toolkit (CUPTI headers/libraries) |
| 107 | +- CMake |
| 108 | +- dtrace (systemtap-sdt-dev) |
80 | 109 | - bpftrace (for probe monitoring) |
| 110 | +- clang, libbpf-dev, bpftool, Go 1.21+ (for BPF tests) |
81 | 111 |
|
82 | 112 | ## Directory Structure |
83 | 113 |
|
84 | 114 | ``` |
85 | 115 | . |
86 | 116 | ├── Makefile # Top-level build orchestration |
87 | | -├── build.zig # Zig build for test infrastructure |
88 | | -├── cupti/ |
89 | | -│ ├── CMakeLists.txt # CMake build for production library |
90 | | -│ ├── cupti-prof.c # Main profiler implementation |
91 | | -│ └── build/ # CMake build output |
| 117 | +├── CMakeLists.txt # CMake build for library and test infrastructure |
| 118 | +├── src/ |
| 119 | +│ ├── cupti.cpp # Main CUPTI profiler implementation |
| 120 | +│ ├── pc_sampling.cpp # PC sampling support |
| 121 | +│ ├── probes.d # USDT probe definitions |
| 122 | +│ └── ... |
| 123 | +├── ebpf/ |
| 124 | +│ └── cupti_bpf.h # Shared BPF struct definitions |
92 | 125 | ├── test/ |
93 | | -│ ├── mock_cupti.c # Mock CUPTI for testing |
94 | | -│ └── test_cupti_prof.c # Test program |
95 | | -├── parcagpu.bt # bpftrace monitoring script |
96 | | -└── test.sh # Test runner |
| 126 | +│ ├── test_cupti_prof.c # Mock CUPTI test harness |
| 127 | +│ ├── mock_cupti.c # Mock CUPTI library |
| 128 | +│ ├── mock_cuda.c # Mock CUDA driver library |
| 129 | +│ ├── test-pc-mock.sh # Mock PC sampling end-to-end test |
| 130 | +│ ├── test-pc-real.sh # Real GPU PC sampling end-to-end test |
| 131 | +│ └── bpf/ # BPF activity parser (Go + eBPF) |
| 132 | +├── microbenchmarks/ # CUDA microbenchmarks (.cu) |
| 133 | +└── parcagpu.bt # bpftrace monitoring script |
97 | 134 | ``` |
0 commit comments