Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion Makefile
Original file line number Diff line number Diff line change
Expand Up @@ -71,7 +71,7 @@ test: local
# Clean build artifacts
clean:
@echo "=== Cleaning build artifacts ==="
@rm -rf build build-local bin lib zig-out .zig-cache
@rm -rf build build-local bin lib
@rm -rf CMakeCache.txt CMakeFiles/ cmake_install.cmake compile_commands.json
@echo "Clean complete"

Expand Down
135 changes: 86 additions & 49 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,97 +1,134 @@
# parcagpu - CUPTI Profiler with USDT Probes

CUDA profiling library that exposes kernel and graph execution events via USDT/DTRACE probes for eBPF/bpftrace monitoring.
CUDA profiling library that exposes GPU activity via USDT/DTRACE probes for eBPF consumption. Captures kernel executions, PC sampling with stall reasons, and cubin module loading.

## Building

```bash
make # Build everything
make test # Build and run tests
make clean # Clean all build artifacts
make local # Build libparcagpucupti.so locally (CMake, RelWithDebInfo)
make debug # Build with full debug, no optimizations
make clean # Clean all build artifacts
```

### Components Built
Docker cross-compilation:

1. **libparcagpucupti.so** (CMake + real CUPTI)
- Production library for CUDA injection
- Located in `cupti/build/`
- Links against real NVIDIA CUPTI

2. **Test Infrastructure** (Zig)
- Mock CUPTI library for testing
- Test program that simulates CUDA events
- Located in `zig-out/`
```bash
make build-amd64 # Build .so for AMD64
make build-arm64 # Build .so for ARM64
make build-all # Both architectures
make docker-push # Push multi-arch image to ghcr.io
```

## Usage

### As CUDA Injection Library

```bash
export CUDA_INJECTION64_PATH=/path/to/libparcagpucupti.so
# Run your CUDA application
./my_cuda_app
```

### Environment Variables

| Variable | Default | Description |
|---|---|---|
| `PARCAGPU_DEBUG` | off | Enable debug logging |
| `PARCAGPU_RATE_LIMIT` | 100 | Token-bucket rate limit for callback probes (events/sec per thread) |
| `PARCAGPU_SAMPLING_FACTOR` | 18 | PC sampling period; set to 0 to disable PC sampling |
| `PARCAGPU_PC_SAMPLING_PROBABILITY` | 0.01 | Probability of sampling in each interval window (0-1) |
| `PARCAGPU_PC_SAMPLING_INTERVAL` | 1.0 | PC sampling interval window in seconds |

### Monitoring with bpftrace

```bash
# Terminal 1: Monitor probes
sudo bpftrace parcagpu.bt
```

# Terminal 2: Run CUDA application
./my_cuda_app
### Monitoring with the BPF Activity Parser

```bash
make bpf-test
sudo test/bpf/activity_parser -pid <PID> -lib <path/to/libparcagpucupti.so> -v
```

The activity parser attaches to all USDT probes via eBPF, captures events through a ring buffer, and resolves PC samples to source lines using `llvm-dwarfdump`.

## Testing

```bash
# Run test suite with bpftrace monitoring
make test
make test # Basic mock CUPTI test (no GPU, no BPF)
make test-pc-mock # Mock PC sampling with BPF activity parser (no GPU, requires root)
make test-pc-real # Real PC sampling with GPU (requires root + GPU)
make test-multi # test_cupti_prof + BPF activity parser in parallel (requires root)
```

# Run test continuously (for extended monitoring)
LD_LIBRARY_PATH=zig-out/lib zig-out/bin/test_cupti_prof cupti/build/libparcagpucupti.so --forever
### BPF Prerequisites

The BPF-based tests (`test-pc-mock`, `test-pc-real`, `test-multi`) require:

- Root (sudo) for BPF
- clang, libbpf-dev, bpftool
- Go 1.21+

Build just the BPF activity parser:

```bash
make generate # Compile BPF objects via bpf2go
make bpf-test # generate + build the Go binary
```

See [README_TEST.md](README_TEST.md) for detailed testing documentation.
### Microbenchmarks

## USDT Probes
CUDA microbenchmarks for testing with real hardware:

The library exposes two USDT probes:
```bash
make microbenchmarks # Build all .cu files in microbenchmarks/
make test-pc-real # Run pc_sample_toy under parcagpu with BPF
```

## USDT Probes

### parcagpu:kernel_executed
- **arg0**: start timestamp (ns)
- **arg1**: end timestamp (ns)
- **arg2**: correlationId | (deviceId << 32)
- **arg3**: streamId
- **arg4**: kernel name (string pointer)
Defined in `src/probes.d`, provider `parcagpu`:

### parcagpu:graph_executed
- **arg0**: start timestamp (ns)
- **arg1**: end timestamp (ns)
- **arg2**: correlationId | (deviceId << 32)
- **arg3**: streamId
- **arg4**: graphId
| Probe | Arguments | Description |
|---|---|---|
| `cuda_correlation` | correlationId, cbid, name | API callback correlation |
| `kernel_executed` | start, end, correlationId, deviceId, streamId, graphId, graphNodeId, name | Kernel execution timing |
| `activity_batch` | ptrs, count | Batch of CUPTI activity records |
| `pc_sample_batch` | records, count | Batch of PC sampling records |
| `stall_reason_map` | names, count | Stall reason name table |
| `cubin_loaded` | cubinCrc, cubin, cubinSize | Module load event |
| `cubin_unloaded` | cubinCrc | Module unload event |
| `error` | code, message, component | Profiler error event |

## Requirements

- CUDA Toolkit (CUPTI libraries)
- Zig (for building test infrastructure)
- CMake (for building production library)
- CUDA Toolkit (CUPTI headers/libraries)
- CMake
- dtrace (systemtap-sdt-dev)
- bpftrace (for probe monitoring)
- clang, libbpf-dev, bpftool, Go 1.21+ (for BPF tests)

## Directory Structure

```
.
├── Makefile # Top-level build orchestration
├── build.zig # Zig build for test infrastructure
├── cupti/
│ ├── CMakeLists.txt # CMake build for production library
│ ├── cupti-prof.c # Main profiler implementation
│ └── build/ # CMake build output
├── CMakeLists.txt # CMake build for library and test infrastructure
├── src/
│ ├── cupti.cpp # Main CUPTI profiler implementation
│ ├── pc_sampling.cpp # PC sampling support
│ ├── probes.d # USDT probe definitions
│ └── ...
├── ebpf/
│ └── cupti_bpf.h # Shared BPF struct definitions
├── test/
│ ├── mock_cupti.c # Mock CUPTI for testing
│ └── test_cupti_prof.c # Test program
├── parcagpu.bt # bpftrace monitoring script
└── test.sh # Test runner
│ ├── test_cupti_prof.c # Mock CUPTI test harness
│ ├── mock_cupti.c # Mock CUPTI library
│ ├── mock_cuda.c # Mock CUDA driver library
│ ├── test-pc-mock.sh # Mock PC sampling end-to-end test
│ ├── test-pc-real.sh # Real GPU PC sampling end-to-end test
│ └── bpf/ # BPF activity parser (Go + eBPF)
├── microbenchmarks/ # CUDA microbenchmarks (.cu)
└── parcagpu.bt # bpftrace monitoring script
```
83 changes: 0 additions & 83 deletions README_TEST.md

This file was deleted.

Loading
Loading