Skip to content

Latest commit

 

History

History
164 lines (120 loc) · 6.73 KB

File metadata and controls

164 lines (120 loc) · 6.73 KB

Troubleshooting

Invalid machine type for CPUs.

XPK will create a regional GKE cluster. If you see issues like

Invalid machine type e2-standard-32 in zone $ZONE_NAME

Please select a CPU type that exists in all zones in the region.

# Find CPU Types supported in zones.
gcloud compute machine-types list --zones=$ZONE_LIST
# Adjust default cpu machine type.
xpk cluster create --default-pool-cpu-machine-type=CPU_TYPE ...

Workload creation fails

Some XPK cluster configuration might be missing, if workload creation fails with the below error.

[XPK] b'error: the server doesn\'t have a resource type "workloads"\n'

Mitigate this error by re-running your xpk cluster create ... command, to refresh the cluster configurations.

Permission Issues: requires one of ["permission_name"] permission(s).

  1. Determine the role needed based on the permission error:

    # For example: `requires one of ["container.*"] permission(s)`
    # Add [Kubernetes Engine Admin](https://cloud.google.com/iam/docs/understanding-roles#kubernetes-engine-roles) to your user.
  2. Add the role to the user in your project.

    Go to iam-admin or use gcloud cli:

    PROJECT_ID=my-project-id
    CURRENT_GKE_USER=$(gcloud config get account)
    ROLE=roles/container.admin  # container.admin is the role needed for Kubernetes Engine Admin
    gcloud projects add-iam-policy-binding $PROJECT_ID --member user:$CURRENT_GKE_USER --role=$ROLE
  3. Check the permissions are correct for the users.

    Go to iam-admin or use gcloud cli:

    PROJECT_ID=my-project-id
    CURRENT_GKE_USER=$(gcloud config get account)
    gcloud projects get-iam-policy $PROJECT_ID --filter="bindings.members:$CURRENT_GKE_USER" --flatten="bindings[].members"
  4. Confirm you have logged in locally with the correct user.

    gcloud auth login

Roles needed based on permission errors:

  • requires one of ["container.*"] permission(s)

    Add Kubernetes Engine Admin to your user.

  • ERROR: (gcloud.monitoring.dashboards.list) User does not have permission to access projects instance (or it may not exist)

    Add Monitoring Viewer to your user.

Reservation Troubleshooting:

How to determine your reservation and its size / utilization:

PROJECT_ID=my-project
ZONE=us-east5-b
RESERVATION=my-reservation-name
# Find the reservations in your project
gcloud beta compute reservations list --project=$PROJECT_ID
# Find the tpu machine type and current utilization of a reservation.
gcloud beta compute reservations describe $RESERVATION --project=$PROJECT_ID --zone=$ZONE

403 error on workload create when using --base-docker-image flag

You need authority to push to the registry from your local machine. Try running gcloud auth configure-docker.

Kubernetes API exception - 404 error

If error of this kind appeared after updating xpk version it's possible that you need to rerun cluster create command in order to update resource definitions.

TPU Workload Debugging

Verbose Logging

If you are having trouble with your workload, try setting the --enable-debug-logs when you schedule it. This will give you more detailed logs to help pinpoint the issue. For example:

xpk workload create \
--cluster --workload xpk-test-workload \
--command="echo hello world" --enable-debug-logs

Please check libtpu logging and Tensorflow logging for more information about the flags that are enabled to get the logs.

Collect Stack Traces

cloud-tpu-diagnostics PyPI package can be used to generate stack traces for workloads running in GKE. This package dumps the Python traces when a fault such as segmentation fault, floating-point exception, or illegal operation exception occurs in the program. Additionally, it will also periodically collect stack traces to help you debug situations when the program is unresponsive. You must make the following changes in the docker image running in a Kubernetes main container to enable periodic stack trace collection.

# main.py

from cloud_tpu_diagnostics import diagnostic
from cloud_tpu_diagnostics.configuration import debug_configuration
from cloud_tpu_diagnostics.configuration import diagnostic_configuration
from cloud_tpu_diagnostics.configuration import stack_trace_configuration

stack_trace_config = stack_trace_configuration.StackTraceConfig(
                      collect_stack_trace = True,
                      stack_trace_to_cloud = True)
debug_config = debug_configuration.DebugConfig(
                stack_trace_config = stack_trace_config)
diagnostic_config = diagnostic_configuration.DiagnosticConfig(
                      debug_config = debug_config)

with diagnostic.diagnose(diagnostic_config):
	main_method()  # this is the main method to run

This configuration will start collecting stack traces inside the /tmp/debugging directory on each Kubernetes Pod.

Explore Stack Traces

To explore the stack traces collected in a temporary directory in Kubernetes Pod, you can run the following command to configure a sidecar container that will read the traces from /tmp/debugging directory.

xpk workload create \
 --workload xpk-test-workload --command "python3 main.py" --cluster \
 xpk-test --tpu-type=v5litepod-16 --deploy-stacktrace-sidecar

Get information about jobs, queues and resources.

To list available resources and queues use xpk info command. It allows to see localqueues and clusterqueues and check for available resources.

To see queues with usage and workload info use:

xpk info --cluster my-cluster

You can specify what kind of resources(clusterqueue or localqueue) you want to see using flags --clusterqueue or --localqueue.

xpk info --cluster my-cluster --localqueue