Skip to content

Commit 8fd8a3e

Browse files
authored
Add HBM microbenchmark guide for tpu7x-8. (#69)
1 parent 0cf3898 commit 8fd8a3e

3 files changed

Lines changed: 86 additions & 0 deletions

File tree

Ironwood/guides/hbm/hbm.md

Lines changed: 53 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,53 @@
1+
# HBM Microbnechmarks on tpu7x-2x2x1
2+
3+
This guide provides instructions for running High Bandwidth Memory (HBM) microbenchmarks on tpu7x-2x2x1 Google Kubernetes Engine (GKE) clusters. It covers creating a node pool, running the benchmarks, and viewing the expected output.
4+
5+
## Create Node Pools
6+
7+
Follow [Setup section](../../Ironwood_Microbenchmarks_readme.md#setup) to create a GKE cluster with one 2x2x1 nodepool.
8+
9+
## Run HBM Microbenchmarks
10+
11+
To run the HBM microbenchmarks, apply the following Kubernetes configuration:
12+
```bash
13+
kubectl apply -f tpu7x-2x2x1-hbm-microbenchmark.yaml
14+
```
15+
16+
To extract the log of HBM microbenchmark, use `kubectl log`:
17+
```bash
18+
kubectl log tpu7x-2x2x1-hbm-microbenchmark
19+
```
20+
21+
Once the benchmark completes, you should see logs similar to the example below:
22+
23+
```bash
24+
Tensor size: 8192.0 MB, time taken (median): 5.3523 ms, bandwidth (median): 3209.812 GB/s
25+
26+
Writing metrics to JSONL file: ../microbenchmarks/hbm/metrics_report.jsonl
27+
Metrics written to CSV at ../microbenchmarks/hbm/t_single_device_hbm_copy_[A-Z0-9]+.tsv.
28+
```
29+
30+
To retrieve the complete results, including the trace and TSV output files, you must keep the pod running after the benchmark completes. To do this, add a `sleep` command to the `tpu7x-2x2x1-hbm-microbenchmark.yaml` file. You can then use `kubectl cp` to copy the output from the pod.
31+
32+
```bash
33+
kubectl cp tpu7x-2x2x1-hbm-microbenchmark:/microbenchmarks/hbm hbm
34+
```
35+
36+
## Expected bandwidth for different matrix size
37+
38+
39+
| Matrix Size (Bytes) | Bandwidth (GB/s/core) | Bandwidth (GB/s/chip) |
40+
|---------------------|-----------------------|-----------------------|
41+
| 2097152 | 1379.335021 | 2758.670041 |
42+
| 4194304 | 2249.746091 | 4499.492181 |
43+
| 8388608 | 2246.129937 | 4492.259875 |
44+
| 16777216 | 2757.308985 | 5514.61797 |
45+
| 33554432 | 3009.83593 | 6019.67186 |
46+
| 67108864 | 3097.217778 | 6194.435556 |
47+
| 134217728 | 3176.50274 | 6353.005481 |
48+
| 268435456 | 3167.144485 | 6334.288969 |
49+
| 536870912 | 3199.020504 | 6398.041009 |
50+
| 1073741824 | 3198.414211 | 6396.828421 |
51+
| 2147483648 | 3203.486119 | 6406.972238 |
52+
| 4294967296 | 3197.879607 | 6395.759214 |
53+
| 8589934592 | 3210.480912 | 6420.961823 |
Lines changed: 33 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,33 @@
1+
apiVersion: v1
2+
kind: Pod
3+
metadata:
4+
name: tpu7x-2x2x1-hbm-microbenchmark
5+
spec:
6+
restartPolicy: Never
7+
nodeSelector:
8+
cloud.google.com/gke-tpu-accelerator: tpu7x
9+
cloud.google.com/gke-tpu-topology: 2x2x1
10+
containers:
11+
- name: tpu-job
12+
image: python:3.12
13+
ports:
14+
- containerPort: 8431
15+
securityContext:
16+
privileged: false
17+
command:
18+
- bash
19+
- -c
20+
- |
21+
set -ex
22+
23+
git clone https://github.com/AI-Hypercomputer/accelerator-microbenchmarks.git
24+
cd accelerator-microbenchmarks
25+
pip install -r requirements.txt
26+
27+
python Ironwood/src/run_benchmark.py --config="Ironwood/configs/hbm/hbm.yaml"
28+
29+
resources:
30+
requests:
31+
google.com/tpu: 4
32+
limits:
33+
google.com/tpu: 4

0 commit comments

Comments
 (0)