Skip to content

Commit 7f698e3

Browse files
Merge pull request #377 from AI-Hypercomputer:mldiag
PiperOrigin-RevId: 900805196
2 parents a3747b1 + 1d35d2f commit 7f698e3

39 files changed

Lines changed: 351 additions & 55 deletions

README.md

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -773,3 +773,5 @@ This script will automatically format your code with `pyink` and help you identi
773773
774774
The full suite of -end-to end tests is in `tests` and `src/maxdiffusion/tests`. We run them with a nightly cadance.
775775
776+
## Profiling
777+
To learn how to enable ML Diagnostics and XProf profiling for your runs, please see our [ML Diagnostics Guide](docs/profiling.md).

docs/profiling.md

Lines changed: 34 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,34 @@
1+
# ML Diagnostics and Profiling
2+
3+
MaxDiffusion supports automated profiling and performance tracking via [Google Cloud ML Diagnostics](https://docs.cloud.google.com/tpu/docs/ml-diagnostics/sdk).
4+
5+
## 1. Manual Installation
6+
To keep the core MaxDiffusion repository lightweight and ensure it runs without dependencies for users who don't need profiling, the ML Diagnostics packages are **not** installed by default.
7+
8+
To use this feature, you must manually install the required package in your environment:
9+
```bash
10+
pip install google-cloud-mldiagnostics
11+
```
12+
13+
## 2. Configuration Settings
14+
To enable ML Diagnostics for your training or generation jobs, you need to update your configuration. You can either add these directly to your .yml config file or pass them as command-line arguments:
15+
16+
```yaml
17+
# ML Diagnostics settings
18+
enable_ml_diagnostics: True
19+
profiler_gcs_path: "gs://<your-bucket-name>/profiler/ml_diagnostics"
20+
enable_ondemand_xprof: True
21+
```
22+
23+
## 3. GCS Bucket Permissions (Troubleshooting)
24+
The GCS bucket you provide in `profiler_gcs_path` **must** have the correct IAM permissions to allow the Hypercompute Cluster service account to write data.
25+
26+
If permissions are not configured correctly, your job will fail with an error similar to this:
27+
> `message: 'service-32478767326@gcp-sa-hypercomputecluster.iam.gserviceaccount.com does not have storage.buckets.get access to the GCS bucket <your-bucket>: permission denied'`
28+
29+
**Fix:** Ensure you grant the required Storage roles (e.g., `Storage Object Admin`) to the service account mentioned in your error message for your specific GCS bucket.
30+
31+
## 4. Viewing Your Runs
32+
Once your job is running with diagnostics enabled, you can monitor the profiles, execution times, and metrics in the Cluster Director console here:
33+
34+
🔗 **https://pantheon.corp.google.com/cluster-director/diagnostics**

src/maxdiffusion/configs/base14.yml

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -247,3 +247,8 @@ quantization: ''
247247
quantization_local_shard_count: -1
248248
use_qwix_quantization: False
249249
compile_topology_num_slices: -1 # Number of target slices, set to a positive integer.
250+
251+
# ML Diagnostics settings
252+
enable_ml_diagnostics: False
253+
profiler_gcs_path: ""
254+
enable_ondemand_xprof: False

src/maxdiffusion/configs/base21.yml

Lines changed: 6 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -247,4 +247,9 @@ quantization: ''
247247
# Shard the range finding operation for quantization. By default this is set to number of slices.
248248
quantization_local_shard_count: -1
249249
compile_topology_num_slices: -1 # Number of target slices, set to a positive integer.
250-
use_qwix_quantization: False
250+
use_qwix_quantization: False
251+
252+
# ML Diagnostics settings
253+
enable_ml_diagnostics: False
254+
profiler_gcs_path: ""
255+
enable_ondemand_xprof: False

src/maxdiffusion/configs/base_2_base.yml

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -263,3 +263,8 @@ quantization: ''
263263
quantization_local_shard_count: -1
264264
use_qwix_quantization: False
265265
compile_topology_num_slices: -1 # Number of target slices, set to a positive integer.
266+
267+
# ML Diagnostics settings
268+
enable_ml_diagnostics: False
269+
profiler_gcs_path: ""
270+
enable_ondemand_xprof: False

src/maxdiffusion/configs/base_flux_dev.yml

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -306,3 +306,7 @@ quantization_local_shard_count: -1
306306
use_qwix_quantization: False
307307
compile_topology_num_slices: -1 # Number of target slices, set to a positive integer.
308308

309+
# ML Diagnostics settings
310+
enable_ml_diagnostics: False
311+
profiler_gcs_path: ""
312+
enable_ondemand_xprof: False

src/maxdiffusion/configs/base_flux_dev_multi_res.yml

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -291,3 +291,7 @@ quantization_local_shard_count: -1
291291
use_qwix_quantization: False
292292
compile_topology_num_slices: -1 # Number of target slices, set to a positive integer.
293293

294+
# ML Diagnostics settings
295+
enable_ml_diagnostics: False
296+
profiler_gcs_path: ""
297+
enable_ondemand_xprof: False

src/maxdiffusion/configs/base_flux_schnell.yml

Lines changed: 6 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -300,4 +300,9 @@ quantization_local_shard_count: -1
300300
use_qwix_quantization: False
301301
compile_topology_num_slices: -1 # Number of target slices, set to a positive integer.
302302

303-
save_final_checkpoint: False
303+
save_final_checkpoint: False
304+
305+
# ML Diagnostics settings
306+
enable_ml_diagnostics: False
307+
profiler_gcs_path: ""
308+
enable_ondemand_xprof: False

src/maxdiffusion/configs/base_wan_14b.yml

Lines changed: 6 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -395,4 +395,9 @@ eval_data_dir: ""
395395
enable_generate_video_for_eval: False # This will increase the used TPU memory.
396396
eval_max_number_of_samples_in_bucket: 60 # The number of samples per bucket for evaluation. This is calculated by num_eval_samples / len(timesteps_list).
397397

398-
enable_ssim: False
398+
enable_ssim: False
399+
400+
# ML Diagnostics settings
401+
enable_ml_diagnostics: False
402+
profiler_gcs_path: ""
403+
enable_ondemand_xprof: False

src/maxdiffusion/configs/base_wan_1_3b.yml

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -337,3 +337,8 @@ enable_generate_video_for_eval: False # This will increase the used TPU memory.
337337
eval_max_number_of_samples_in_bucket: 60 # The number of samples per bucket for evaluation. This is calculated by num_eval_samples / len(timesteps_list).
338338

339339
enable_ssim: False
340+
341+
# ML Diagnostics settings
342+
enable_ml_diagnostics: False
343+
profiler_gcs_path: ""
344+
enable_ondemand_xprof: False

0 commit comments

Comments
 (0)