Skip to content

docs(design): streaming timeouts (BackendTrafficPolicy) for NebariApp#122

Closed
viniciusdc wants to merge 1 commit into
mainfrom
docs/streaming-timeouts-design
Closed

docs(design): streaming timeouts (BackendTrafficPolicy) for NebariApp#122
viniciusdc wants to merge 1 commit into
mainfrom
docs/streaming-timeouts-design

Conversation

@viniciusdc

Copy link
Copy Markdown
Collaborator

Summary

Design document (no code in this PR — implementation will follow as a stacked PR). Proposes an opt-in routing.streaming: true flag on RoutingConfig that makes the operator emit an Envoy Gateway BackendTrafficPolicy covering the NebariApp's HTTPRoutes, with canned timeout values that match common SSE / long-poll / gRPC streaming workloads:

spec:
  timeout:
    http:
      requestTimeout: "0s"          # disable Envoy's 15s default
      connectionIdleTimeout: 300s   # cap idle connections at 5m

One policy, one or two targetRefs entries (main HTTPRoute + public HTTPRoute when publicRoutes is set), owner-referenced to the NebariApp for GC.

Why now

Envoy Gateway's 15s default requestTimeout cuts off any long-lived HTTP connection. The downstream PR openteams-ai/nebari.openteams.ai#12 is hand-rolling two separate BackendTrafficPolicy resources, each targetRefs-ing the operator-generated HTTPRoute by name — a fragile contract (the operator could rename its HTTPRoute and the policy silently stops matching) and one that pack authors shouldn't need to learn the Envoy gateway.envoyproxy.io/v1alpha1 schema for.

Relationship to #120

Independent design. The companion design multi-backend-routes.md (per-route port overrides + ServiceReference.Namespace removal) is in PR #120 and can land in either order. The only intersection: this design's policy targets the HTTPRoutes the other one produces, and the operator emits one rule per route — the same policy covers all rules.

What the doc covers

  • Why a boolean intent rather than Envoy-typed timeout knobs (contract independence).
  • Why target both main and public HTTPRoutes by default.
  • File-by-file operator impact: new streaming.go reconciler, RBAC bump for backendtrafficpolicies, scheme registration in the main controller.
  • Failure modes:
    • Envoy Gateway CRD not installed → StreamingReady=False (CRDMissing), rest of reconcile continues.
    • Hand-rolled policy already in the namespace without our owner ref → operator refuses to take ownership, reports ForeignPolicyExists.
  • Migration recipe for the downstream PR that motivated this.
  • Open questions: naming (streaming vs longLived), default idle timeout value, whether StreamingReady should block Ready (lean: no, graceful degradation).

Test plan

  • Reviewers confirm routing.streaming: bool shape (vs. a routing.streamingTimeouts: { ... } struct)
  • Reviewers confirm canned values (requestTimeout: 0s, connectionIdleTimeout: 300s)
  • Reviewers confirm targeting both main and public HTTPRoutes (vs. main only)
  • Reviewers confirm the ForeignPolicyExists handling (don't take ownership of foreign resources)
  • Reviewers weigh in on open questions

Follow-up

  • Stacked implementation PR will land the reconciler, the new field, the RBAC, and tests.

…ebariApp

Adds a design doc proposing an opt-in routing.streaming flag on
RoutingConfig that makes the operator emit an Envoy Gateway
BackendTrafficPolicy disabling the default 15s HTTP request timeout
and setting a 5-minute connection idle timeout. The policy targets
all HTTPRoutes the operator owns for the NebariApp (main plus public
when present), is owner-referenced for GC, and uses canned timeout
values rather than exposing Envoy-typed knobs on the CRD.

The companion proposal in docs/design/multi-backend-routes.md
(per-route port overrides + ServiceReference.Namespace removal) is
independent of this one.
@viniciusdc

Copy link
Copy Markdown
Collaborator Author

Closing — need to think more about the shape (boolean vs struct, canned values, foreign-policy handling). Branch docs/streaming-timeouts-design left in place.

@viniciusdc viniciusdc closed this May 20, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants