MaxDiffusion supports automated profiling and performance tracking via Google Cloud ML Diagnostics.
To keep the core MaxDiffusion repository lightweight and ensure it runs without dependencies for users who don't need profiling, the ML Diagnostics packages are not installed by default.
To use this feature, you must manually install the required package in your environment:
pip install google-cloud-mldiagnosticsTo enable ML Diagnostics for your training or generation jobs, you need to update your configuration. You can either add these directly to your .yml config file or pass them as command-line arguments:
# ML Diagnostics settings
enable_ml_diagnostics: True
profiler_gcs_path: "gs://<your-bucket-name>/profiler/ml_diagnostics"
enable_ondemand_xprof: TrueThe GCS bucket you provide in profiler_gcs_path must have the correct IAM permissions to allow the Hypercompute Cluster service account to write data.
If permissions are not configured correctly, your job will fail with an error similar to this:
message: 'service-32478767326@gcp-sa-hypercomputecluster.iam.gserviceaccount.com does not have storage.buckets.get access to the GCS bucket <your-bucket>: permission denied'
Fix: Ensure you grant the required Storage roles (e.g., Storage Object Admin) to the service account mentioned in your error message for your specific GCS bucket.
Once your job is running with diagnostics enabled, you can monitor the profiles, execution times, and metrics in the Cluster Director console here:
🔗 https://pantheon.corp.google.com/cluster-director/diagnostics