XPK will create a regional GKE cluster. If you see issues like
Invalid machine type e2-standard-32 in zone $ZONE_NAMEPlease select a CPU type that exists in all zones in the region.
# Find CPU Types supported in zones.
gcloud compute machine-types list --zones=$ZONE_LIST
# Adjust default cpu machine type.
xpk cluster create --default-pool-cpu-machine-type=CPU_TYPE ...Some XPK cluster configuration might be missing, if workload creation fails with the below error.
[XPK] b'error: the server doesn\'t have a resource type "workloads"\n'
Mitigate this error by re-running your xpk cluster create ... command, to refresh the cluster configurations.
-
Determine the role needed based on the permission error:
# For example: `requires one of ["container.*"] permission(s)` # Add [Kubernetes Engine Admin](https://cloud.google.com/iam/docs/understanding-roles#kubernetes-engine-roles) to your user.
-
Add the role to the user in your project.
Go to iam-admin or use gcloud cli:
PROJECT_ID=my-project-id CURRENT_GKE_USER=$(gcloud config get account) ROLE=roles/container.admin # container.admin is the role needed for Kubernetes Engine Admin gcloud projects add-iam-policy-binding $PROJECT_ID --member user:$CURRENT_GKE_USER --role=$ROLE
-
Check the permissions are correct for the users.
Go to iam-admin or use gcloud cli:
PROJECT_ID=my-project-id CURRENT_GKE_USER=$(gcloud config get account) gcloud projects get-iam-policy $PROJECT_ID --filter="bindings.members:$CURRENT_GKE_USER" --flatten="bindings[].members"
-
Confirm you have logged in locally with the correct user.
gcloud auth login
-
requires one of ["container.*"] permission(s)Add Kubernetes Engine Admin to your user.
-
ERROR: (gcloud.monitoring.dashboards.list) User does not have permission to access projects instance (or it may not exist)Add Monitoring Viewer to your user.
PROJECT_ID=my-project
ZONE=us-east5-b
RESERVATION=my-reservation-name
# Find the reservations in your project
gcloud beta compute reservations list --project=$PROJECT_ID
# Find the tpu machine type and current utilization of a reservation.
gcloud beta compute reservations describe $RESERVATION --project=$PROJECT_ID --zone=$ZONEYou need authority to push to the registry from your local machine. Try running gcloud auth configure-docker.
If error of this kind appeared after updating xpk version it's possible that you need to rerun cluster create command in order to update resource definitions.
If you are having trouble with your workload, try setting the --enable-debug-logs when you schedule it. This will give you more detailed logs to help pinpoint the issue. For example:
xpk workload create \
--cluster --workload xpk-test-workload \
--command="echo hello world" --enable-debug-logsPlease check libtpu logging and Tensorflow logging for more information about the flags that are enabled to get the logs.
cloud-tpu-diagnostics PyPI package can be used to generate stack traces for workloads running in GKE. This package dumps the Python traces when a fault such as segmentation fault, floating-point exception, or illegal operation exception occurs in the program. Additionally, it will also periodically collect stack traces to help you debug situations when the program is unresponsive. You must make the following changes in the docker image running in a Kubernetes main container to enable periodic stack trace collection.
# main.py
from cloud_tpu_diagnostics import diagnostic
from cloud_tpu_diagnostics.configuration import debug_configuration
from cloud_tpu_diagnostics.configuration import diagnostic_configuration
from cloud_tpu_diagnostics.configuration import stack_trace_configuration
stack_trace_config = stack_trace_configuration.StackTraceConfig(
collect_stack_trace = True,
stack_trace_to_cloud = True)
debug_config = debug_configuration.DebugConfig(
stack_trace_config = stack_trace_config)
diagnostic_config = diagnostic_configuration.DiagnosticConfig(
debug_config = debug_config)
with diagnostic.diagnose(diagnostic_config):
main_method() # this is the main method to runThis configuration will start collecting stack traces inside the /tmp/debugging directory on each Kubernetes Pod.
To explore the stack traces collected in a temporary directory in Kubernetes Pod, you can run the following command to configure a sidecar container that will read the traces from /tmp/debugging directory.
xpk workload create \
--workload xpk-test-workload --command "python3 main.py" --cluster \
xpk-test --tpu-type=v5litepod-16 --deploy-stacktrace-sidecarTo list available resources and queues use xpk info command. It allows to see localqueues and clusterqueues and check for available resources.
To see queues with usage and workload info use:
xpk info --cluster my-clusterYou can specify what kind of resources(clusterqueue or localqueue) you want to see using flags --clusterqueue or --localqueue.
xpk info --cluster my-cluster --localqueue