[Core] Move server-only env vars under SKYPILOT_SERVER_ prefix#9664
[Core] Move server-only env vars under SKYPILOT_SERVER_ prefix#9664cg505 wants to merge 7 commits into
Conversation
`request_body_env_vars()` (sky/server/requests/payloads.py) scrapes every `SKYPILOT_*` env var that does not start with `SKYPILOT_SERVER_*` and forwards them in request bodies. Several env vars set on the API server pod via the Helm downward API or `extraEnvs` accidentally used the plain `SKYPILOT_` prefix, which meant they got scraped into request bodies, persisted (under postgres-backed deployments), and later re-applied via `os.environ.update()` when a daemon dequeued its request — poisoning the server's environ with values from a previous deployment generation. This change renames the affected env vars to use the `SKYPILOT_SERVER_` prefix in the Helm chart and at read sites, with `getenv_server_with_legacy` keeping the bare names working for one deprecation window. Variables renamed: APISERVER_UUID, POD_CPU_CORE_LIMIT, POD_MEMORY_BYTES_LIMIT, POD_MEMORY_GB_LIMIT, RELEASE_NAME, INGRESS_HOST, INGRESS_BASIC_AUTH_ENABLED, INITIAL_BASIC_AUTH, DISABLE_BASIC_AUTH_MIDDLEWARE, AUTH_OAUTH2_PROXY_*, AUTH_USER_HEADER, GRACE_PERIOD_SECONDS, ROLLING_UPDATE_ENABLED, API_SERVER_STORAGE_ENABLED, IN_CLUSTER_NAMESPACE, ALL_KUBERNETES_CONTEXTS_INCLUDES_IN_CLUSTER. Also: * Adds `enableServiceLinks: false` to the api-server pod spec to stop K8s from injecting `<SVC>_SERVICE_*` env vars under the SKYPILOT_ prefix when SkyPilot Services exist in the same namespace. * Adds a defensive deny-list filter in `request_body_env_vars()` for K8s service-link injection patterns (`SKYPILOT_AGENT_*_SERVICE_*`, `*_PORT_<n>_TCP*`). * Updates `_SENSITIVE_ENV_VARS` to include the new `SKYPILOT_SERVER_INITIAL_BASIC_AUTH` name. * Updates the dashboard's Next.js build-time env to read both names. * Updates Helm chart golden tests to the new names. Tested: * New unit tests cover the deny-list, legacy fallback, and "both names set" paths. * All 134 helm-unittest tests pass. * Reproduction script confirms 0 server-only env vars leak through `request_body_env_vars()` when a new deployment uses the new names. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…AGENTS.md Explains when to use each prefix, the rationale behind the request_body_env_vars() filter, and the legacy-name fallback helper. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
Code Review
This pull request implements a naming convention to distinguish between client-relevant (SKYPILOT_) and server-only (SKYPILOT_SERVER_) environment variables, preventing internal server configurations from leaking into client requests. It introduces a getenv_server_with_legacy helper to maintain backward compatibility during the transition. Additionally, the PR disables Kubernetes service-link environment variable injection and adds filtering logic to exclude these variables from request bodies. The review feedback identifies a redundant string fallback in the server health check logic that should be simplified for clarity.
| ingress_basic_auth_enabled=(constants.getenv_server_with_legacy( | ||
| constants.SKYPILOT_SERVER_INGRESS_BASIC_AUTH_ENABLED_ENV_VAR, | ||
| constants.LEGACY_SKYPILOT_INGRESS_BASIC_AUTH_ENABLED_ENV_VAR, | ||
| default='false') or 'false').lower() == 'true', |
There was a problem hiding this comment.
The or 'false' here is redundant. Since getenv_server_with_legacy is called with default='false', it is guaranteed to return a string and will never be None. You can simplify this expression for better clarity.
| ingress_basic_auth_enabled=(constants.getenv_server_with_legacy( | |
| constants.SKYPILOT_SERVER_INGRESS_BASIC_AUTH_ENABLED_ENV_VAR, | |
| constants.LEGACY_SKYPILOT_INGRESS_BASIC_AUTH_ENABLED_ENV_VAR, | |
| default='false') or 'false').lower() == 'true', | |
| ingress_basic_auth_enabled=(constants.getenv_server_with_legacy( | |
| constants.SKYPILOT_SERVER_INGRESS_BASIC_AUTH_ENABLED_ENV_VAR, | |
| constants.LEGACY_SKYPILOT_INGRESS_BASIC_AUTH_ENABLED_ENV_VAR, | |
| default='false').lower() == 'true'), |
|
Closing — wrong ticket; will reopen on the correct one. -Claude |
|
Reopening — wrong session earlier; this is in fact the right ticket. -Claude |
- yapf 3.9 wanted slightly different wraps in constants.py and the new legacy-fallback test file - pylint flagged a 81-char line in constants.py (now shortened) and unused imports in sky/server/auth/oauth2_proxy.py (removed) - dashboard prettier wanted the SKYPILOT_RELEASE_NAME fallback on a new line; reformatted next.config.mjs Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…name-server-only-env-vars-to-skypilot_server_-with
- server.py: drop redundant 'or false' after getenv_server_with_legacy(..., default='false') — overload guarantees a non-None return when default is a string (Gemini) - payloads.py: tighten K8s service-link regex by removing the bare PORT alternative. It would have falsely matched legitimate SKYPILOT_*_PORT env vars like SKYPILOT_RAY_PORT and SKYPILOT_RAY_DASHBOARD_PORT (set by host_network_probe). The numbered PORT_<n>_TCP* variants cover the same K8s service-link data, so dropping bare _PORT loses no real signal (Devin) - tests: restructure K8s service-link assertion to be explicit about which names are filtered; add new test asserting SKYPILOT_RAY_PORT and SKYPILOT_RAY_DASHBOARD_PORT pass through Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Address @cg505's review feedback: the previous single-emit approach would break if a new chart was deployed against an old image (the old image only reads the legacy SKYPILOT_* name; it wouldn't find the renamed SKYPILOT_SERVER_* one). Dual-emit covers all four chart/image skew combinations: - new chart + new image: reader prefers SERVER_*, legacy emit ignored - new chart + old image: old reader finds the legacy name (the case cg505 flagged) - old chart + new image: reader falls back to legacy name with a debug hint - old chart + old image: unchanged Drop the legacy emits in a future release once the deprecation window closes; the reader fallback can be removed at the same time. Helm goldens updated: env-count baseline rises from 9 to 15 (each of the dual-emitted server-only vars contributes 2 entries instead of 1). 134/134 helm-unittest tests pass. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Good catch — addressed in 2d729d1. The Helm chart now emits BOTH names for each renamed var: - name: SKYPILOT_SERVER_APISERVER_UUID
valueFrom: { fieldRef: { fieldPath: metadata.uid } }
- name: SKYPILOT_APISERVER_UUID # DEPRECATED — remove in a future release.
valueFrom: { fieldRef: { fieldPath: metadata.uid } }Skew matrix:
We can drop the legacy emit (and the reader fallback) after one or two release cycles. -Claude |
Summary
request_body_env_vars()(insky/server/requests/payloads.py) forwards everySKYPILOT_*env var that does not start withSKYPILOT_SERVER_*into request bodies. Several env vars set on the API server pod via the Helm downward API orextraEnvsaccidentally used the plainSKYPILOT_prefix — they were getting scraped into requests, persisted (under postgres-backed deployments), and later re-applied viaos.environ.update()when a daemon dequeued its request, poisoning the server's environ with values from previous deployment generations.SKYPILOT_SERVER_prefix in the Helm chart and at every read site.constants.getenv_server_with_legacy(new, legacy)keeps the old bare names working for one deprecation window and logs a debug hint if a legacy name is still in use.enableServiceLinks: falseto the api-server pod spec, plus a defensive deny-list filter inrequest_body_env_vars()for K8s service-link injection patterns underSKYPILOT_AGENT_*.Variables renamed
APISERVER_UUID,POD_CPU_CORE_LIMIT,POD_MEMORY_BYTES_LIMIT,POD_MEMORY_GB_LIMIT,RELEASE_NAME,INGRESS_HOST,INGRESS_BASIC_AUTH_ENABLED,INITIAL_BASIC_AUTH,DISABLE_BASIC_AUTH_MIDDLEWARE,AUTH_OAUTH2_PROXY_ENABLED,AUTH_OAUTH2_PROXY_BASE_URL,AUTH_USER_HEADER,GRACE_PERIOD_SECONDS,ROLLING_UPDATE_ENABLED,API_SERVER_STORAGE_ENABLED,IN_CLUSTER_NAMESPACE,ALL_KUBERNETES_CONTEXTS_INCLUDES_IN_CLUSTER— each gets a newSKYPILOT_SERVER_*name plus a legacy bare-name fallback at the read site.SKY_API_SERVER_METRICS_ENABLEDuses aSKY_prefix and is already excluded by the filter, so it's left alone (just documented).SKYPILOT_DB_CONNECTION_URIis already explicitly popped inrequest_body_env_vars(), so it's left alone too.Backwards compatibility
The Helm chart emits BOTH the new
SKYPILOT_SERVER_*name and the legacy bareSKYPILOT_*name for each renamed var. Combined with the reader's legacy fallback, this covers all four chart/image skew combinations during a rolling upgrade:SKYPILOT_SERVER_*; legacy emit is ignored.request_body.env_vars: afteros.environ.update()applies the OLD names to the server, readers still find them via fallback. The SERVER no longer sets the OLD names itself, so the leak source dries up. Stale rows decay as their requests finish.API request schema is unchanged — no
API_VERSIONbump.Test plan
tests/unit_tests/test_sky/server/requests/test_payloads.pycovers the prefix filter and the SKYPILOT_AGENT_* deny-list (including the case where legitimateSKYPILOT_AGENT_ID-style vars are not filtered).tests/unit_tests/test_sky/skylet/test_constants_legacy_env_var.pycovers the legacy fallback helper (new-only, legacy-only, both, neither, bool variant).test_common_utils.py(new + legacy env var names),test_kubernetes.py::TestKubernetesAllIncludesInCluster(cleanup for both env var names).charts/skypilot/tests/deployment_test.yaml,oauth2_test.yaml.helm unittest charts/skypilot→ 134/134 passing.helm template skypilot-test charts/skypilotconfirmsenableServiceLinks: falseis rendered and env vars use the new prefix.request_body_env_vars(); after, 0.Test commands
🤖 Generated with Claude Code
Known unrelated CI failure
Python Tests - Unit Testsfails ontest_api_login_user_hash_needs_authandtest_api_login_user_hash_needs_auth_bothwithTypeError: '>=' not supported between instances of 'Mock' and 'int'. The failure traces tosky/client/sdk.py:2927(remote_api_version >= 30), which was added in #8590without mocking
versions.get_remote_api_version()in those tests. This PRdoesn't touch either file; the failure reproduces on master.