Skip to content

[Core] Move server-only env vars under SKYPILOT_SERVER_ prefix#9664

Open
cg505 wants to merge 7 commits into
masterfrom
cooperc/sky-5503-rename-server-only-env-vars-to-skypilot_server_-with
Open

[Core] Move server-only env vars under SKYPILOT_SERVER_ prefix#9664
cg505 wants to merge 7 commits into
masterfrom
cooperc/sky-5503-rename-server-only-env-vars-to-skypilot_server_-with

Conversation

@cg505
Copy link
Copy Markdown
Collaborator

@cg505 cg505 commented May 19, 2026

Summary

  • request_body_env_vars() (in sky/server/requests/payloads.py) forwards every SKYPILOT_* env var that does not start with SKYPILOT_SERVER_* into request bodies. Several env vars set on the API server pod via the Helm downward API or extraEnvs accidentally used the plain SKYPILOT_ prefix — they were getting scraped into requests, persisted (under postgres-backed deployments), and later re-applied via os.environ.update() when a daemon dequeued its request, poisoning the server's environ with values from previous deployment generations.
  • This PR renames the affected env vars to use the SKYPILOT_SERVER_ prefix in the Helm chart and at every read site. constants.getenv_server_with_legacy(new, legacy) keeps the old bare names working for one deprecation window and logs a debug hint if a legacy name is still in use.
  • Adds enableServiceLinks: false to the api-server pod spec, plus a defensive deny-list filter in request_body_env_vars() for K8s service-link injection patterns under SKYPILOT_AGENT_*.

Variables renamed

APISERVER_UUID, POD_CPU_CORE_LIMIT, POD_MEMORY_BYTES_LIMIT, POD_MEMORY_GB_LIMIT, RELEASE_NAME, INGRESS_HOST, INGRESS_BASIC_AUTH_ENABLED, INITIAL_BASIC_AUTH, DISABLE_BASIC_AUTH_MIDDLEWARE, AUTH_OAUTH2_PROXY_ENABLED, AUTH_OAUTH2_PROXY_BASE_URL, AUTH_USER_HEADER, GRACE_PERIOD_SECONDS, ROLLING_UPDATE_ENABLED, API_SERVER_STORAGE_ENABLED, IN_CLUSTER_NAMESPACE, ALL_KUBERNETES_CONTEXTS_INCLUDES_IN_CLUSTER — each gets a new SKYPILOT_SERVER_* name plus a legacy bare-name fallback at the read site.

SKY_API_SERVER_METRICS_ENABLED uses a SKY_ prefix and is already excluded by the filter, so it's left alone (just documented).

SKYPILOT_DB_CONNECTION_URI is already explicitly popped in request_body_env_vars(), so it's left alone too.

Backwards compatibility

The Helm chart emits BOTH the new SKYPILOT_SERVER_* name and the legacy bare SKYPILOT_* name for each renamed var. Combined with the reader's legacy fallback, this covers all four chart/image skew combinations during a rolling upgrade:

  • New server + new chart: reader prefers SKYPILOT_SERVER_*; legacy emit is ignored.
  • New server + old chart: reader falls back to the legacy name; logs a debug hint.
  • Old server + new chart: old reader still finds the legacy name (this is the case @cg505 flagged — dual-emit makes this safe).
  • Old server + old chart: unchanged.
  • Rolling upgrade with stale daemon rows that still have OLD names in request_body.env_vars: after os.environ.update() applies the OLD names to the server, readers still find them via fallback. The SERVER no longer sets the OLD names itself, so the leak source dries up. Stale rows decay as their requests finish.

API request schema is unchanged — no API_VERSION bump.

Test plan

  • Unit tests added: tests/unit_tests/test_sky/server/requests/test_payloads.py covers the prefix filter and the SKYPILOT_AGENT_* deny-list (including the case where legitimate SKYPILOT_AGENT_ID-style vars are not filtered).
  • Unit tests added: tests/unit_tests/test_sky/skylet/test_constants_legacy_env_var.py covers the legacy fallback helper (new-only, legacy-only, both, neither, bool variant).
  • Existing tests updated: test_common_utils.py (new + legacy env var names), test_kubernetes.py::TestKubernetesAllIncludesInCluster (cleanup for both env var names).
  • Helm chart goldens updated: charts/skypilot/tests/deployment_test.yaml, oauth2_test.yaml.
  • helm unittest charts/skypilot → 134/134 passing.
  • helm template skypilot-test charts/skypilot confirms enableServiceLinks: false is rendered and env vars use the new prefix.
  • Smoke: spun up a reproduction script that mirrors the original leak — before the fix 18 server-only vars leaked through request_body_env_vars(); after, 0.

Test commands

# Unit tests for changed modules
pytest tests/unit_tests/test_sky/server/requests/test_payloads.py \
       tests/unit_tests/test_sky/skylet/test_constants_legacy_env_var.py

# Helm chart goldens
helm unittest charts/skypilot

🤖 Generated with Claude Code

Known unrelated CI failure

Python Tests - Unit Tests fails on test_api_login_user_hash_needs_auth and
test_api_login_user_hash_needs_auth_both with TypeError: '>=' not supported between instances of 'Mock' and 'int'. The failure traces to
sky/client/sdk.py:2927 (remote_api_version >= 30), which was added in #8590
without mocking versions.get_remote_api_version() in those tests. This PR
doesn't touch either file; the failure reproduces on master.

cg505 and others added 2 commits May 19, 2026 22:38
`request_body_env_vars()` (sky/server/requests/payloads.py) scrapes every
`SKYPILOT_*` env var that does not start with `SKYPILOT_SERVER_*` and forwards
them in request bodies. Several env vars set on the API server pod via the
Helm downward API or `extraEnvs` accidentally used the plain `SKYPILOT_`
prefix, which meant they got scraped into request bodies, persisted (under
postgres-backed deployments), and later re-applied via `os.environ.update()`
when a daemon dequeued its request — poisoning the server's environ with
values from a previous deployment generation.

This change renames the affected env vars to use the `SKYPILOT_SERVER_`
prefix in the Helm chart and at read sites, with `getenv_server_with_legacy`
keeping the bare names working for one deprecation window. Variables
renamed: APISERVER_UUID, POD_CPU_CORE_LIMIT, POD_MEMORY_BYTES_LIMIT,
POD_MEMORY_GB_LIMIT, RELEASE_NAME, INGRESS_HOST, INGRESS_BASIC_AUTH_ENABLED,
INITIAL_BASIC_AUTH, DISABLE_BASIC_AUTH_MIDDLEWARE, AUTH_OAUTH2_PROXY_*,
AUTH_USER_HEADER, GRACE_PERIOD_SECONDS, ROLLING_UPDATE_ENABLED,
API_SERVER_STORAGE_ENABLED, IN_CLUSTER_NAMESPACE,
ALL_KUBERNETES_CONTEXTS_INCLUDES_IN_CLUSTER.

Also:
* Adds `enableServiceLinks: false` to the api-server pod spec to stop K8s
  from injecting `<SVC>_SERVICE_*` env vars under the SKYPILOT_ prefix
  when SkyPilot Services exist in the same namespace.
* Adds a defensive deny-list filter in `request_body_env_vars()` for
  K8s service-link injection patterns (`SKYPILOT_AGENT_*_SERVICE_*`,
  `*_PORT_<n>_TCP*`).
* Updates `_SENSITIVE_ENV_VARS` to include the new
  `SKYPILOT_SERVER_INITIAL_BASIC_AUTH` name.
* Updates the dashboard's Next.js build-time env to read both names.
* Updates Helm chart golden tests to the new names.

Tested:
* New unit tests cover the deny-list, legacy fallback, and "both names set"
  paths.
* All 134 helm-unittest tests pass.
* Reproduction script confirms 0 server-only env vars leak through
  `request_body_env_vars()` when a new deployment uses the new names.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…AGENTS.md

Explains when to use each prefix, the rationale behind the
request_body_env_vars() filter, and the legacy-name fallback helper.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request implements a naming convention to distinguish between client-relevant (SKYPILOT_) and server-only (SKYPILOT_SERVER_) environment variables, preventing internal server configurations from leaking into client requests. It introduces a getenv_server_with_legacy helper to maintain backward compatibility during the transition. Additionally, the PR disables Kubernetes service-link environment variable injection and adds filtering logic to exclude these variables from request bodies. The review feedback identifies a redundant string fallback in the server health check logic that should be simplified for clarity.

Comment thread sky/server/server.py Outdated
Comment on lines +2731 to +2734
ingress_basic_auth_enabled=(constants.getenv_server_with_legacy(
constants.SKYPILOT_SERVER_INGRESS_BASIC_AUTH_ENABLED_ENV_VAR,
constants.LEGACY_SKYPILOT_INGRESS_BASIC_AUTH_ENABLED_ENV_VAR,
default='false') or 'false').lower() == 'true',
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The or 'false' here is redundant. Since getenv_server_with_legacy is called with default='false', it is guaranteed to return a string and will never be None. You can simplify this expression for better clarity.

Suggested change
ingress_basic_auth_enabled=(constants.getenv_server_with_legacy(
constants.SKYPILOT_SERVER_INGRESS_BASIC_AUTH_ENABLED_ENV_VAR,
constants.LEGACY_SKYPILOT_INGRESS_BASIC_AUTH_ENABLED_ENV_VAR,
default='false') or 'false').lower() == 'true',
ingress_basic_auth_enabled=(constants.getenv_server_with_legacy(
constants.SKYPILOT_SERVER_INGRESS_BASIC_AUTH_ENABLED_ENV_VAR,
constants.LEGACY_SKYPILOT_INGRESS_BASIC_AUTH_ENABLED_ENV_VAR,
default='false').lower() == 'true'),

@cg505
Copy link
Copy Markdown
Collaborator Author

cg505 commented May 19, 2026

Closing — wrong ticket; will reopen on the correct one. -Claude

@cg505 cg505 closed this May 19, 2026
@cg505 cg505 deleted the cooperc/sky-5503-rename-server-only-env-vars-to-skypilot_server_-with branch May 19, 2026 22:53
@cg505
Copy link
Copy Markdown
Collaborator Author

cg505 commented May 19, 2026

Reopening — wrong session earlier; this is in fact the right ticket. -Claude

@cg505 cg505 restored the cooperc/sky-5503-rename-server-only-env-vars-to-skypilot_server_-with branch May 19, 2026 22:56
@cg505 cg505 reopened this May 19, 2026
- yapf 3.9 wanted slightly different wraps in constants.py and the new
  legacy-fallback test file
- pylint flagged a 81-char line in constants.py (now shortened) and
  unused imports in sky/server/auth/oauth2_proxy.py (removed)
- dashboard prettier wanted the SKYPILOT_RELEASE_NAME fallback on a new
  line; reformatted next.config.mjs

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@cg505 cg505 marked this pull request as ready for review May 19, 2026 23:06
@cg505 cg505 requested review from Michaelvll and SeungjinYang May 19, 2026 23:06
@cg505
Copy link
Copy Markdown
Collaborator Author

cg505 commented May 19, 2026

We need to be careful with the upgrade process here to make sure we never use a chart that has this commit with an API server that does not yet have this commit. cc @aylei @zpoint

Copy link
Copy Markdown

@devin-ai-integration devin-ai-integration Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

✅ Devin Review: No Issues Found

Devin Review analyzed this PR and found no potential bugs to report.

View in Devin Review to see 5 additional findings.

Open in Devin Review

cg505 and others added 2 commits May 19, 2026 23:11
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…name-server-only-env-vars-to-skypilot_server_-with
devin-ai-integration[bot]

This comment was marked as resolved.

cg505 and others added 2 commits May 20, 2026 00:11
- server.py: drop redundant 'or false' after getenv_server_with_legacy(...,
  default='false') — overload guarantees a non-None return when default is
  a string (Gemini)
- payloads.py: tighten K8s service-link regex by removing the bare PORT
  alternative. It would have falsely matched legitimate SKYPILOT_*_PORT
  env vars like SKYPILOT_RAY_PORT and SKYPILOT_RAY_DASHBOARD_PORT
  (set by host_network_probe). The numbered PORT_<n>_TCP* variants cover
  the same K8s service-link data, so dropping bare _PORT loses no real
  signal (Devin)
- tests: restructure K8s service-link assertion to be explicit about which
  names are filtered; add new test asserting SKYPILOT_RAY_PORT and
  SKYPILOT_RAY_DASHBOARD_PORT pass through

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Address @cg505's review feedback: the previous single-emit approach
would break if a new chart was deployed against an old image (the old
image only reads the legacy SKYPILOT_* name; it wouldn't find the
renamed SKYPILOT_SERVER_* one).

Dual-emit covers all four chart/image skew combinations:
- new chart + new image: reader prefers SERVER_*, legacy emit ignored
- new chart + old image: old reader finds the legacy name (the case
  cg505 flagged)
- old chart + new image: reader falls back to legacy name with a debug
  hint
- old chart + old image: unchanged

Drop the legacy emits in a future release once the deprecation window
closes; the reader fallback can be removed at the same time.

Helm goldens updated: env-count baseline rises from 9 to 15 (each of
the dual-emitted server-only vars contributes 2 entries instead of 1).
134/134 helm-unittest tests pass.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@cg505
Copy link
Copy Markdown
Collaborator Author

cg505 commented May 20, 2026

We need to be careful with the upgrade process here to make sure we never use a chart that has this commit with an API server that does not yet have this commit. cc @aylei @zpoint

Good catch — addressed in 2d729d1. The Helm chart now emits BOTH names for each renamed var:

- name: SKYPILOT_SERVER_APISERVER_UUID
  valueFrom: { fieldRef: { fieldPath: metadata.uid } }
- name: SKYPILOT_APISERVER_UUID  # DEPRECATED — remove in a future release.
  valueFrom: { fieldRef: { fieldPath: metadata.uid } }

Skew matrix:

chart image works?
new new ✅ reader prefers SERVER_*, legacy emit ignored
new old ✅ old reader still finds the legacy name (your case)
old new ✅ reader falls back to the legacy name (debug-log hint)
old old ✅ unchanged

We can drop the legacy emit (and the reader fallback) after one or two release cycles. -Claude

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant