|
| 1 | +# .ci — CI Images and Pipeline |
| 2 | + |
| 3 | +``` |
| 4 | +.ci/ |
| 5 | +├── config.yaml # Unified config (images, jobs, agent definitions) |
| 6 | +├── utils.py # Shared utilities (load_config, normalize_config, get_git_commit) |
| 7 | +├── agent.py # Runner Agent (scheduler, webhooks, remote dispatch) |
| 8 | +├── build.py # Image builder |
| 9 | +├── run.py # CI pipeline runner (Docker layer) |
| 10 | +├── ci_resource.py # GPU/memory detection and allocation |
| 11 | +├── github_status.py # GitHub Commit Status reporting |
| 12 | +├── images/ |
| 13 | +│ ├── nvidia/Dockerfile |
| 14 | +│ ├── iluvatar/Dockerfile |
| 15 | +│ ├── metax/Dockerfile |
| 16 | +│ ├── moore/Dockerfile |
| 17 | +│ ├── cambricon/Dockerfile |
| 18 | +│ └── ascend/Dockerfile |
| 19 | +└── tests/ # Unit tests |
| 20 | + ├── conftest.py |
| 21 | + ├── test_agent.py |
| 22 | + ├── test_build.py |
| 23 | + ├── test_run.py |
| 24 | + ├── test_resource.py |
| 25 | + ├── test_github_status.py |
| 26 | + └── test_utils.py |
| 27 | +``` |
| 28 | + |
| 29 | +**Prerequisites**: Docker, Python 3.10+, `pip install pyyaml` |
| 30 | + |
| 31 | +--- |
| 32 | + |
| 33 | +## Configuration `config.yaml` |
| 34 | + |
| 35 | +Config uses a **platform-centric** top-level structure. Each platform defines its image, platform-level defaults, and job list. |
| 36 | +At load time, jobs are flattened to `{platform}_{job}` format (e.g., `nvidia_gpu`). |
| 37 | + |
| 38 | +```yaml |
| 39 | +repo: |
| 40 | + url: https://github.com/InfiniTensor/InfiniOps.git |
| 41 | + branch: master |
| 42 | + |
| 43 | +github: |
| 44 | + status_context_prefix: "ci/infiniops" |
| 45 | + |
| 46 | +agents: # Remote agent URLs (used by CLI for cross-machine dispatch) |
| 47 | + nvidia: |
| 48 | + url: http://nvidia-host:8080 |
| 49 | + iluvatar: |
| 50 | + url: http://iluvatar-host:8080 |
| 51 | + |
| 52 | +platforms: |
| 53 | + nvidia: |
| 54 | + image: # Image definition |
| 55 | + dockerfile: .ci/images/nvidia/ |
| 56 | + build_args: |
| 57 | + BASE_IMAGE: nvcr.io/nvidia/pytorch:24.10-py3 |
| 58 | + setup: pip install .[dev] --no-build-isolation |
| 59 | + jobs: |
| 60 | + gpu: # Flattened as nvidia_gpu |
| 61 | + resources: |
| 62 | + ngpus: 1 # Scheduler auto-picks this many free GPUs |
| 63 | + memory: 32GB |
| 64 | + shm_size: 16g |
| 65 | + timeout: 3600 |
| 66 | + stages: |
| 67 | + - name: test |
| 68 | + run: pytest tests/ -n 8 -v --tb=short --junitxml=/workspace/results/test-results.xml |
| 69 | + |
| 70 | + iluvatar: |
| 71 | + image: |
| 72 | + dockerfile: .ci/images/iluvatar/ |
| 73 | + build_args: |
| 74 | + BASE_IMAGE: corex:qs_pj20250825 |
| 75 | + APT_MIRROR: http://archive.ubuntu.com/ubuntu |
| 76 | + PIP_INDEX_URL: https://pypi.org/simple |
| 77 | + docker_args: # Platform-level docker args, inherited by all jobs |
| 78 | + - "--privileged" |
| 79 | + - "--cap-add=ALL" |
| 80 | + - "--pid=host" |
| 81 | + - "--ipc=host" |
| 82 | + volumes: |
| 83 | + - /dev:/dev |
| 84 | + - /lib/firmware:/lib/firmware |
| 85 | + - /usr/src:/usr/src |
| 86 | + - /lib/modules:/lib/modules |
| 87 | + setup: pip install .[dev] --no-build-isolation |
| 88 | + jobs: |
| 89 | + gpu: # Flattened as iluvatar_gpu |
| 90 | + resources: |
| 91 | + gpu_ids: "0" |
| 92 | + gpu_style: none # CoreX: passthrough via --privileged + /dev mount |
| 93 | + memory: 32GB |
| 94 | + shm_size: 16g |
| 95 | + timeout: 3600 |
| 96 | + stages: |
| 97 | + - name: test |
| 98 | + run: pytest tests/ -n 8 -v --tb=short --junitxml=/workspace/results/test-results.xml |
| 99 | +``` |
| 100 | +
|
| 101 | +### Config hierarchy |
| 102 | +
|
| 103 | +| Level | Field | Description | |
| 104 | +|---|---|---| |
| 105 | +| **Platform** | `image` | Image definition (dockerfile, build_args) | |
| 106 | +| | `image_tag` | Default image tag (defaults to `latest`) | |
| 107 | +| | `docker_args` | Extra `docker run` args (e.g., `--privileged`) | |
| 108 | +| | `volumes` | Extra volume mounts | |
| 109 | +| | `setup` | In-container setup command | |
| 110 | +| | `env` | Injected container env vars | |
| 111 | +| **Job** | `resources.ngpus` | Number of GPUs — scheduler auto-picks free ones (NVIDIA only) | |
| 112 | +| | `resources.gpu_ids` | Static GPU device IDs (e.g., `"0"`, `"0,2"`) | |
| 113 | +| | `resources.gpu_style` | GPU passthrough: `nvidia` (default), `none`, or `mlu` | |
| 114 | +| | `resources.memory` | Container memory limit | |
| 115 | +| | `resources.shm_size` | Shared memory size | |
| 116 | +| | `resources.timeout` | Max run time in seconds | |
| 117 | +| | `stages` | Execution stage list | |
| 118 | +| | Any platform field | Jobs can override any platform-level default | |
| 119 | + |
| 120 | +--- |
| 121 | + |
| 122 | +## Image builder `build.py` |
| 123 | + |
| 124 | +| Flag | Description | |
| 125 | +|---|---| |
| 126 | +| `--platform nvidia\|iluvatar\|metax\|moore\|ascend\|all` | Target platform (default: `all`) | |
| 127 | +| `--commit` | Use specific commit ref as image tag (default: HEAD) | |
| 128 | +| `--force` | Skip Dockerfile change detection | |
| 129 | +| `--dry-run` | Print commands without executing | |
| 130 | + |
| 131 | +```bash |
| 132 | +# Build with change detection (skips if no Dockerfile changes) |
| 133 | +python .ci/build.py --platform nvidia |
| 134 | +
|
| 135 | +# Build Iluvatar image |
| 136 | +python .ci/build.py --platform iluvatar --force |
| 137 | +
|
| 138 | +# Force build all platforms |
| 139 | +python .ci/build.py --force |
| 140 | +``` |
| 141 | + |
| 142 | +Build artifacts are stored as local Docker image tags: `infiniops-ci/<platform>:<commit-hash>` and `:latest`. |
| 143 | +Proxy and `no_proxy` env vars are forwarded from the host to `docker build` automatically. |
| 144 | + |
| 145 | +> `--push` is reserved for future use; requires a `registry` section in `config.yaml`. |
| 146 | + |
| 147 | +--- |
| 148 | + |
| 149 | +## Pipeline runner `run.py` |
| 150 | + |
| 151 | +Platform is auto-detected (via `nvidia-smi`/`ixsmi`/`mx-smi`/`mthreads-gmi`/`cnmon` on PATH), no manual specification needed. |
| 152 | + |
| 153 | +| Flag | Description | |
| 154 | +|---|---| |
| 155 | +| `--config` | Config file path (default: `.ci/config.yaml`) | |
| 156 | +| `--job` | Job name: short (`gpu`) or full (`nvidia_gpu`). Defaults to all jobs for the current platform | |
| 157 | +| `--branch` | Override clone branch (default: config `repo.branch`) | |
| 158 | +| `--stage` | Run only the specified stage | |
| 159 | +| `--image-tag` | Override image tag | |
| 160 | +| `--gpu-id` | Override GPU device IDs (nvidia via `--gpus`, others via `CUDA_VISIBLE_DEVICES`) | |
| 161 | +| `--test` | Override pytest test path (e.g., `tests/test_gemm.py::test_gemm`) | |
| 162 | +| `--results-dir` | Host directory mounted to `/workspace/results` inside the container | |
| 163 | +| `--local` | Mount current directory (read-only) instead of cloning from git | |
| 164 | +| `--dry-run` | Print docker command without executing | |
| 165 | + |
| 166 | +```bash |
| 167 | +# Simplest usage: auto-detect platform, run all jobs, use config default branch |
| 168 | +python .ci/run.py |
| 169 | +
|
| 170 | +# Specify short job name |
| 171 | +python .ci/run.py --job gpu |
| 172 | +
|
| 173 | +# Full job name (backward compatible) |
| 174 | +python .ci/run.py --job nvidia_gpu |
| 175 | +
|
| 176 | +# Run only the test stage, preview mode |
| 177 | +python .ci/run.py --job gpu --stage test --dry-run |
| 178 | +
|
| 179 | +# Test local uncommitted changes without pushing |
| 180 | +python .ci/run.py --local |
| 181 | +``` |
| 182 | + |
| 183 | +Container execution flow: `git clone` → `checkout` → `setup` → stages. |
| 184 | +With `--local`, the current directory is mounted read-only at `/workspace/repo` and copied to a writable temp directory inside the container before setup runs — host files are never modified. |
| 185 | +Proxy vars are forwarded from the host. Test results are written to `--results-dir`. Each run uses a clean environment (no host pip cache mounted). |
| 186 | + |
| 187 | +--- |
| 188 | + |
| 189 | +## Platform differences |
| 190 | + |
| 191 | +| Platform | GPU passthrough | `gpu_style` | Base image | Detection tool | |
| 192 | +|---|---|---|---|---| |
| 193 | +| NVIDIA | `--gpus` (NVIDIA Container Toolkit) | `nvidia` (default) | `nvcr.io/nvidia/pytorch:24.10-py3` | `nvidia-smi` | |
| 194 | +| Iluvatar | `--privileged` + `/dev` mount | `none` | `corex:qs_pj20250825` | `ixsmi` | |
| 195 | +| MetaX | `--privileged` | `none` | `maca-pytorch:3.2.1.4-...` | `mx-smi` | |
| 196 | +| Moore | `--privileged` | `none` | `vllm_musa:20251112_hygon` | `mthreads-gmi` | |
| 197 | +| Cambricon | `--privileged` | `mlu` | `cambricon/pytorch:v1.25.3` | `cnmon` | |
| 198 | +| Ascend | TODO | — | `ascend-pytorch:24.0.0` | — | |
| 199 | + |
| 200 | +`gpu_style` controls the Docker device injection mechanism: `nvidia` uses `--gpus`, `none` uses `CUDA_VISIBLE_DEVICES` (or skips injection for Moore), `mlu` uses `MLU_VISIBLE_DEVICES`. |
| 201 | + |
| 202 | +--- |
| 203 | + |
| 204 | +## Runner Agent `agent.py` |
| 205 | + |
| 206 | +The Runner Agent supports CLI manual dispatch, GitHub webhook triggers, resource-aware dynamic scheduling, and cross-machine remote dispatch. |
| 207 | + |
| 208 | +### CLI manual execution |
| 209 | + |
| 210 | +```bash |
| 211 | +# Run all jobs (dispatched to remote agents, using config default branch) |
| 212 | +python .ci/agent.py run |
| 213 | +
|
| 214 | +# Specify branch |
| 215 | +python .ci/agent.py run --branch feat/xxx |
| 216 | +
|
| 217 | +# Run a specific job |
| 218 | +python .ci/agent.py run --job nvidia_gpu |
| 219 | +
|
| 220 | +# Filter by platform |
| 221 | +python .ci/agent.py run --platform nvidia |
| 222 | +
|
| 223 | +# Preview mode |
| 224 | +python .ci/agent.py run --dry-run |
| 225 | +``` |
| 226 | + |
| 227 | +| Flag | Description | |
| 228 | +|---|---| |
| 229 | +| `--branch` | Test branch (default: config `repo.branch`) | |
| 230 | +| `--job` | Specific job name | |
| 231 | +| `--platform` | Filter jobs by platform | |
| 232 | +| `--commit` | Override commit SHA used for GitHub status reporting | |
| 233 | +| `--image-tag` | Override image tag | |
| 234 | +| `--dry-run` | Preview mode | |
| 235 | + |
| 236 | +### Webhook server |
| 237 | + |
| 238 | +Deploy one Agent instance per platform machine (platform is auto-detected). On each machine: |
| 239 | + |
| 240 | +```bash |
| 241 | +python .ci/agent.py serve --port 8080 |
| 242 | +``` |
| 243 | + |
| 244 | +Additional `serve` flags: |
| 245 | + |
| 246 | +| Flag | Description | |
| 247 | +|---|---| |
| 248 | +| `--port` | Listen port (default: 8080) | |
| 249 | +| `--host` | Listen address (default: `0.0.0.0`) | |
| 250 | +| `--webhook-secret` | GitHub webhook signing secret (or `WEBHOOK_SECRET` env var) | |
| 251 | +| `--api-token` | `/api/run` Bearer auth token (or `AGENT_API_TOKEN` env var) | |
| 252 | +| `--results-dir` | Results directory (default: `ci-results`) | |
| 253 | +| `--utilization-threshold` | GPU idle threshold percentage (default: 10) | |
| 254 | + |
| 255 | +| Endpoint | Method | Description | |
| 256 | +|---|---|---| |
| 257 | +| `/webhook` | POST | GitHub webhook (push/pull_request) | |
| 258 | +| `/api/run` | POST | Remote job trigger | |
| 259 | +| `/api/job/{id}` | GET | Query job status | |
| 260 | +| `/health` | GET | Health check | |
| 261 | +| `/status` | GET | Queue + resource status | |
| 262 | + |
| 263 | +Webhook supports `X-Hub-Signature-256` signature verification via `--webhook-secret` or `WEBHOOK_SECRET` env var. |
| 264 | + |
| 265 | +### Remote agent configuration |
| 266 | + |
| 267 | +Configure agent URLs in `config.yaml`; the CLI automatically dispatches remote jobs to the corresponding agents: |
| 268 | + |
| 269 | +```yaml |
| 270 | +agents: |
| 271 | + nvidia: |
| 272 | + url: http://<nvidia-ip>:8080 |
| 273 | + iluvatar: |
| 274 | + url: http://<iluvatar-ip>:8080 |
| 275 | + metax: |
| 276 | + url: http://<metax-ip>:8080 |
| 277 | + moore: |
| 278 | + url: http://<moore-ip>:8080 |
| 279 | +``` |
| 280 | + |
| 281 | +### Resource scheduling |
| 282 | + |
| 283 | +The Agent auto-detects GPU utilization and system memory to dynamically determine parallelism: |
| 284 | +- GPU utilization < threshold (default 10%) and not allocated by Agent → available |
| 285 | +- When resources are insufficient, jobs are queued automatically; completed jobs release resources and trigger scheduling of queued tasks |
| 286 | + |
| 287 | +### GitHub Status |
| 288 | + |
| 289 | +Set the `GITHUB_TOKEN` env var and the Agent will automatically report commit status: |
| 290 | +- `pending` — job started |
| 291 | +- `success` / `failure` — job completed |
| 292 | + |
| 293 | +Status context format: `ci/infiniops/{job_name}` |
| 294 | + |
| 295 | +--- |
| 296 | + |
| 297 | +## Multi-machine deployment guide |
| 298 | + |
| 299 | +### Per-platform setup |
| 300 | + |
| 301 | +Each machine needs Docker installed, the platform runtime, and the base CI image built. |
| 302 | + |
| 303 | +| Platform | Runtime check | Base image | Build command | |
| 304 | +|---|---|---|---| |
| 305 | +| NVIDIA | `nvidia-smi` (+ [Container Toolkit](https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/install-guide.html)) | `nvcr.io/nvidia/pytorch:24.10-py3` (public) | `python .ci/build.py --platform nvidia` | |
| 306 | +| Iluvatar | `ixsmi` | `corex:qs_pj20250825` (import in advance) | `python .ci/build.py --platform iluvatar` | |
| 307 | +| MetaX | `mx-smi` | `maca-pytorch:3.2.1.4-...` (import in advance) | `python .ci/build.py --platform metax` | |
| 308 | +| Moore | `mthreads-gmi` | `vllm_musa:20251112_hygon` (import in advance) | `python .ci/build.py --platform moore` | |
| 309 | + |
| 310 | +### Start Agent services |
| 311 | + |
| 312 | +On each machine (platform is auto-detected): |
| 313 | + |
| 314 | +```bash |
| 315 | +python .ci/agent.py serve --port 8080 |
| 316 | +``` |
| 317 | + |
| 318 | +### Configure remote agent URLs |
| 319 | + |
| 320 | +On the trigger machine, add the `agents` section to `config.yaml` (see [Remote agent configuration](#remote-agent-configuration) above for the format). |
| 321 | + |
| 322 | +### Trigger cross-platform tests |
| 323 | + |
| 324 | +```bash |
| 325 | +# Run all platform jobs at once (using config default branch) |
| 326 | +python .ci/agent.py run |
| 327 | +
|
| 328 | +# Preview mode (no actual execution) |
| 329 | +python .ci/agent.py run --dry-run |
| 330 | +
|
| 331 | +# Run only a specific platform |
| 332 | +python .ci/agent.py run --platform nvidia |
| 333 | +``` |
| 334 | + |
| 335 | +### Optional configuration |
| 336 | + |
| 337 | +#### GitHub Status reporting |
| 338 | + |
| 339 | +Set the env var on all machines so each reports its own platform's test status: |
| 340 | + |
| 341 | +```bash |
| 342 | +export GITHUB_TOKEN=ghp_xxxxxxxxxxxx |
| 343 | +``` |
| 344 | + |
| 345 | +#### API Token authentication |
| 346 | + |
| 347 | +When agents are exposed on untrusted networks, enable token auth: |
| 348 | + |
| 349 | +```bash |
| 350 | +python .ci/agent.py serve --port 8080 --api-token <secret> |
| 351 | +# Or: export AGENT_API_TOKEN=<secret> |
| 352 | +``` |
| 353 | + |
| 354 | +#### GitHub Webhook auto-trigger |
| 355 | + |
| 356 | +In GitHub repo → Settings → Webhooks, add a webhook for each machine: |
| 357 | + |
| 358 | +| Field | Value | |
| 359 | +|---|---| |
| 360 | +| Payload URL | `http://<machine-ip>:8080/webhook` | |
| 361 | +| Content type | `application/json` | |
| 362 | +| Secret | Must match `--webhook-secret` | |
| 363 | +| Events | `push` and `pull_request` | |
| 364 | + |
| 365 | +```bash |
| 366 | +python .ci/agent.py serve --port 8080 --webhook-secret <github-secret> |
| 367 | +# Or: export WEBHOOK_SECRET=<github-secret> |
| 368 | +``` |
| 369 | + |
| 370 | +### Verification checklist |
| 371 | + |
| 372 | +```bash |
| 373 | +# 1. Dry-run each machine individually |
| 374 | +for platform in nvidia iluvatar metax moore; do |
| 375 | + python .ci/agent.py run --platform $platform --dry-run |
| 376 | +done |
| 377 | +
|
| 378 | +# 2. Health and resource checks |
| 379 | +for ip in <nvidia-ip> <iluvatar-ip> <metax-ip> <moore-ip>; do |
| 380 | + curl http://$ip:8080/health |
| 381 | + curl http://$ip:8080/status |
| 382 | +done |
| 383 | +
|
| 384 | +# 3. Cross-platform test |
| 385 | +python .ci/agent.py run --branch master |
| 386 | +``` |
0 commit comments