Skip to content

dex: run 2 replicas without sticky service#3376

Merged
google-oss-prow[bot] merged 16 commits into
kubeflow:masterfrom
danish9039:dex-replica-2-sticky-service
May 8, 2026
Merged

dex: run 2 replicas without sticky service#3376
google-oss-prow[bot] merged 16 commits into
kubeflow:masterfrom
danish9039:dex-replica-2-sticky-service

Conversation

@danish9039
Copy link
Copy Markdown
Member

@danish9039 danish9039 commented Mar 3, 2026

Summary

  • change Dex base deployment from replicas: 1 to replicas: 2
  • keep the Dex Service non-sticky and clarify that Kubernetes-backed storage shares Dex authentication state across replicas
  • update tests/dex_login_test.py to validate the authentication flow for the two-replica setup

Test changes

  • run a parallel authentication burst
  • verify authentication traffic reaches at least two running and Ready Dex pods
  • verify Dex authcode objects are garbage collected after the burst
  • harden pod readiness checks, authcode counting, timeout handling, and naming for clarity

Scope

This PR stays focused on Dex and oauth2-proxy authentication. It does not add a separate authorization test.

Signed-off-by: danish9039 <danishsiddiqui040@gmail.com>
Copilot AI review requested due to automatic review settings March 3, 2026 08:34
@github-actions
Copy link
Copy Markdown

github-actions Bot commented Mar 3, 2026

Welcome to the Kubeflow Manifests Repository

Thanks for opening your first PR. Your contribution means a lot to the Kubeflow community.

Before making more PRs:
Please ensure your PR follows our Contributing Guide.
Please also be aware that many components are synchronizes from upstream via the scripts in /scripts.
So in some cases you have to fix the problem in the upstream repositories first, but you can use a PR against kubeflow/manifests to test the platform integration.

Community Resources:

Thanks again for helping to improve Kubeflow.

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Updates the Kubeflow Dex base manifests to run Dex with two replicas and attempts to keep the multi-step login flow stable by enabling sticky routing at the Kubernetes Service layer.

Changes:

  • Increase Dex Deployment replicas from 1 to 2
  • Add sessionAffinity: ClientIP to the Dex Service

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.

File Description
common/dex/base/deployment.yaml Scales Dex to 2 replicas to improve availability.
common/dex/base/service.yaml Enables Service-level ClientIP session affinity to reduce cross-pod auth-flow breakage.

Comment thread common/dex/base/deployment.yaml
Comment thread common/dex/base/service.yaml Outdated
Signed-off-by: danish9039 <danishsiddiqui040@gmail.com>
@google-oss-prow google-oss-prow Bot added size/S and removed size/XS labels Mar 3, 2026
Signed-off-by: danish9039 <danishsiddiqui040@gmail.com>
Signed-off-by: danish9039 <danishsiddiqui040@gmail.com>
Comment thread common/dex/base/service.yaml Outdated
Signed-off-by: Julius von Kohout <45896133+juliusvonkohout@users.noreply.github.com>
@juliusvonkohout
Copy link
Copy Markdown
Member

juliusvonkohout commented Mar 4, 2026

Thank you @danish9039

Let us please be precise as explained in the copilot-instructions.md. This is about authentication and not authorization.
Also check the naming convention in general in this file.

Are you sure that the storage backend supports it well https://github.com/danish9039/manifests/blob/d1b9474e935ef96d6786722b6c2d4c077e50674b/common/dex/base/config-map.yaml#L11 ? And please extend the Dex test to verify that multiple parallel sessions to different replicas and garbage collection works.

Signed-off-by: danish9039 <danishsiddiqui040@gmail.com>
@google-oss-prow google-oss-prow Bot added size/XL and removed size/S labels Mar 7, 2026
Signed-off-by: danish9039 <danishsiddiqui040@gmail.com>
Signed-off-by: danish9039 <danishsiddiqui040@gmail.com>
@danish9039
Copy link
Copy Markdown
Member Author

danish9039 commented Mar 7, 2026

Thank you @danish9039

Let us please be precise as explained in the copilot-instructions.md. This is about authentication and not authorization. Also check the naming convention in general in this file.

Are you sure that the storage backend supports it well https://github.com/danish9039/manifests/blob/d1b9474e935ef96d6786722b6c2d4c077e50674b/common/dex/base/config-map.yaml#L11 ? And please extend the Dex test to verify that multiple parallel sessions to different replicas and garbage collection works.

@juliusvonkohout Thank you for the feedback, sorry for the delay!
I updated tests/dex_login_test.py to keep the scope strictly on authentication, not authorization. The test now treats a completed OAuth callback plus oauth2_proxy session cookie as authentication success even if the final landing page is rejected later by RBAC, and I cleaned up the naming/comments to stay explicit about that distinction.

I also extended the test to validate the storage.type=kubernetes setup from common/dex/base/config-map.yaml by running parallel login sessions against the 2-replica Dex deployment, checking from Dex pod logs that both replicas actually handled authentication traffic, and inspecting authcodes.dex.coreos.com before and after the wait window. The kubectl calls are needed because the HTTP client alone cannot prove replica distribution or backend state behavior: pod discovery confirms the multi-replica setup, pod logs show where the authentication requests landed, and authcode counts let us observe the Kubernetes-backed auth state directly.

For garbage collection, the intent is to make sureauthcodeobjects created as part of the login flow do not persist unexpectedly after the exchange completes. With storage.type=kubernetes, Dex stores that transient auth state in Kubernetes resources, so checking the authcode count before the parallel auth login burst, after the burst, and again after GC_WAIT_SECONDS gives us a direct cluster-side signal that cleanup is still working and that the parallel flow is not leaving stale auth state behind.

Signed-off-by: danish9039 <danishsiddiqui040@gmail.com>
@danish9039 danish9039 changed the title dex: use sticky service with two replicas dex: run 2 replicas without sticky service Mar 9, 2026
@danish9039
Copy link
Copy Markdown
Member Author

@juliusvonkohout can we proceed with this ?

Comment thread common/dex/base/service.yaml Outdated
Comment thread common/dex/base/service.yaml Outdated
Comment thread tests/dex_login_test.py Outdated
@juliusvonkohout
Copy link
Copy Markdown
Member

"The test now treats a completed OAuth callback plus oauth2_proxy session cookie as authentication success even if the final landing page is rejected later by RBAC" We must also test the authorization and RBAC later. Please do not remove it.

@danish9039
Copy link
Copy Markdown
Member Author

"The test now treats a completed OAuth callback plus oauth2_proxy session cookie as authentication success even if the final landing page is rejected later by RBAC" We must also test the authorization and RBAC later. Please do not remove it.

@juliusvonkohout Just to clarify, this PR does not remove any existing RBAC / authorization coverage from the Dex workflow; that coverage was not part of this test path before.
I removed the later sidecar authorization test and kept this PR focused on Dex authentication, while the existing authorization checks remain in the component integration tests.

Copy link
Copy Markdown
Contributor

@abdullahpathan22 abdullahpathan22 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hello @danish9039 a few technical changes are needed to be addresed :

  1. Revert Base Replica Tuning (common/dex/base/deployment.yaml): Scaling replicas to 2 forces an HA footprint on all baseline deployments, which strains local KinD environments. Multi-replica scaling belongs in overlays, not in base.

  2. Simplify dex_login_test.py: Strip out the kubectl subprocesses, pod log parsing, and concurrent GC asserts. This script must remain a lightweight, black-box HTTP E2E test. White-box validation of Dex internals tightly couples our CI to upstream behaviors ("less code is better").

  3. Restore 200 OK RBAC Validation: Accepting 403 Forbidden bypasses the Istio/Envoy E2E authorization check. If RBAC propagation delays cause flakiness, remove the 403 short-circuit and implement a standard HTTP retry/polling block until the endpoint yields 200 OK routing. This implicitly tests RBAC without needing external KFP testing scripts.

@juliusvonkohout
Copy link
Copy Markdown
Member

juliusvonkohout commented Apr 7, 2026

Hello @danish9039 a few technical changes are needed to be addresed :

1. **Revert Base Replica Tuning (`common/dex/base/deployment.yaml`)**: Scaling `replicas` to `2` forces an HA footprint on all baseline deployments, which strains local KinD environments. Multi-replica scaling belongs in overlays, not in `base`.

Not really if the requests are very low or empty. So I want to have it by default

@abdullahpathan22
Copy link
Copy Markdown
Contributor

Hello @juliusvonkohout Understood your valid point.

@danish9039
Copy link
Copy Markdown
Member Author

@abdullahpathan22 Replying inline to the three points from your review:

  1. I am keeping replicas: 2 in common/dex/base/deployment.yaml, because the point of this PR is to validate Dex with two replicas by default, and Julius already confirmed that direction.

  2. The cluster-side checks in tests/dex_login_test.py are intentional, because they are what let us verify that Dex authentication reached more than one replica and that authcode garbage collection still works.

  3. I am keeping this script scoped to Dex and oauth2-proxy authentication, because requiring a final 200 OK would turn it into an authorization check of the downstream endpoint rather than an authentication check.

Comment thread tests/dex_login_test.py Outdated
# try to get the session cookies
# NOTE: this will raise an exception if something goes wrong
session_cookies = dex_session_manager.get_session_cookies() No newline at end of file
except requests.RequestException as exc:
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please use proper variable names and check your code before commiting, especially if it is LLM generated. That ia a rule from the copilot-instructions.md.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll keep this in mind

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 3 out of 3 changed files in this pull request and generated 4 comments.

Comment thread tests/dex_login_test.py Outdated
Comment thread tests/dex_login_test.py Outdated
Comment thread tests/dex_login_test.py Outdated
Comment thread tests/dex_login_test.py Outdated
Signed-off-by: danish9039 <danishsiddiqui040@gmail.com>
Signed-off-by: danish9039 <danishsiddiqui040@gmail.com>
@danish9039
Copy link
Copy Markdown
Member Author

@juliusvonkohout Addressed the comments

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 3 out of 3 changed files in this pull request and generated 2 comments.

Comment thread tests/dex_login_test.py Outdated
# Dex authcode GC window: authcodes must be deleted after token exchange completes.
GC_WAIT_SECONDS = 90
REQUEST_TIMEOUT_SECONDS = 15
PARALLEL_TEST_TIMEOUT_SECONDS = PARALLEL_SESSIONS * REQUEST_TIMEOUT_SECONDS
Copy link

Copilot AI Apr 19, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PARALLEL_TEST_TIMEOUT_SECONDS = PARALLEL_SESSIONS * REQUEST_TIMEOUT_SECONDS assumes each authentication session is bounded by a single request timeout, but each session performs multiple sequential HTTP requests (GET endpoint, oauth2 start, Dex login GET/POST, optional approval, optional 403 recovery, etc.). This makes the batch-wide timeout too tight under slow CI conditions and can trigger the as_completed timeout even though every individual request respects REQUEST_TIMEOUT_SECONDS. Consider sizing this based on the maximum expected end-to-end requests per session (plus buffer), or removing the global timeout entirely.

Suggested change
PARALLEL_TEST_TIMEOUT_SECONDS = PARALLEL_SESSIONS * REQUEST_TIMEOUT_SECONDS
# One end-to-end authentication session can perform multiple sequential HTTP requests:
# endpoint GET, OAuth2 start, Dex login GET and POST, optional approval, and optional retry or recovery.
MAXIMUM_SEQUENTIAL_HTTP_REQUESTS_PER_SESSION = 6
PARALLEL_TEST_TIMEOUT_BUFFER_SECONDS = 30
PARALLEL_TEST_TIMEOUT_SECONDS = (
PARALLEL_SESSIONS * REQUEST_TIMEOUT_SECONDS * MAXIMUM_SEQUENTIAL_HTTP_REQUESTS_PER_SESSION
+ PARALLEL_TEST_TIMEOUT_BUFFER_SECONDS
)

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Adjusted this to a worst-case single-session timeout plus buffer, since the sessions run concurrently and the batch timeout should track the slowest session rather than sum across all sessions.

Comment thread tests/dex_login_test.py Outdated
Comment on lines +423 to +428
for future in concurrent.futures.as_completed(
futures, timeout=PARALLEL_TEST_TIMEOUT_SECONDS
):
authentication_result = future.result()
if not authentication_result.succeeded:
authentication_failures.append(authentication_result)
Copy link

Copilot AI Apr 19, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

concurrent.futures.as_completed(..., timeout=PARALLEL_TEST_TIMEOUT_SECONDS) will raise TimeoutError if not all futures finish by the deadline, and that exception is currently unhandled (it will fail the test without reporting which sessions/pods were still running). Catch TimeoutError, cancel any pending futures, and raise a clearer RuntimeError (including how many sessions completed vs pending) or remove the batch-wide timeout and rely on the per-request timeouts instead.

Suggested change
for future in concurrent.futures.as_completed(
futures, timeout=PARALLEL_TEST_TIMEOUT_SECONDS
):
authentication_result = future.result()
if not authentication_result.succeeded:
authentication_failures.append(authentication_result)
completed_futures = set()
try:
for future in concurrent.futures.as_completed(
futures, timeout=PARALLEL_TEST_TIMEOUT_SECONDS
):
completed_futures.add(future)
authentication_result = future.result()
if not authentication_result.succeeded:
authentication_failures.append(authentication_result)
except concurrent.futures.TimeoutError as timeout_error:
pending_futures = [
future for future in futures if future not in completed_futures
]
for future in pending_futures:
future.cancel()
raise RuntimeError(
"Parallel authentication sessions exceeded the batch timeout of "
f"{PARALLEL_TEST_TIMEOUT_SECONDS} seconds: "
f"completed={len(completed_futures)} "
f"pending={len(pending_futures)}"
) from timeout_error

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added explicit TimeoutError handling here so the test now cancels pending futures and reports completed versus pending sessions instead of failing with a raw timeout.

Signed-off-by: danish9039 <danishsiddiqui040@gmail.com>
@danish9039
Copy link
Copy Markdown
Member Author

@juliusvonkohout I have addressed the comments, I think we can merge this now

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 3 out of 3 changed files in this pull request and generated 2 comments.

Comment thread tests/dex_login_test.py Outdated
Comment on lines +332 to +340
Count the number of Dex authcode CRD objects currently in the cluster.
Dex creates one authcode object per login; the GC process deletes them after
the token exchange completes. Returns 0 if no instances exist.
"""
command_arguments = [
"kubectl",
"--request-timeout", KUBECTL_REQUEST_TIMEOUT,
"get", DEX_AUTHCODE_RESOURCE,
"-A", "-o", "json",
Copy link
Copy Markdown
Member Author

@danish9039 danish9039 May 4, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Scoped the authcode query to the auth namespace so this check only counts resources created by the Dex instance under test.

Comment thread tests/dex_login_test.py
Comment on lines +22 to +25
PARALLEL_SESSIONS = 8
# Dex authcode GC window: authcodes must be deleted after token exchange completes.
GC_WAIT_SECONDS = 90
REQUEST_TIMEOUT_SECONDS = 15
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Renamed this to GARBAGE_COLLECTION_WAIT_SECONDS so the test configuration stays explicit without relying on the GC abbreviation.

Signed-off-by: danish9039 <danishsiddiqui040@gmail.com>
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 3 out of 3 changed files in this pull request and generated no new comments.

@juliusvonkohout
Copy link
Copy Markdown
Member

THank you
/lgtm
/approve

@google-oss-prow
Copy link
Copy Markdown

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: juliusvonkohout

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@google-oss-prow google-oss-prow Bot merged commit 0a0f720 into kubeflow:master May 8, 2026
12 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants