The data layer work (PR #60, #64, #75) introduces in-memory state into the IPP — runtime metrics (in-flight requests, token counts) accumulated across requests and stored in-process. This is a significant architectural shift from the current stateless request/response transformation pipeline, and I think it deserves an explicit discussion before we go further.
What changes
Today the IPP is stateless — each request is processed independently, no memory between requests. With the in-memory data layer, each IPP instance accumulates runtime metrics that scorers and filters consume for model selection decisions.
Consequences of making IPP stateful
-
Replica state synchronization — With multiple IPP replicas behind Envoy, each replica accumulates its own local metrics. Without a synchronization mechanism, replicas will have inconsistent views of model state. This means we need to either build cross-replica state synchronization (which adds complexity, latency, and a consistency model to design), or accept that each replica makes decisions based on partial data. Either way, turning IPP stateful creates a new responsibility that needs to be owned — keeping distributed state consistent across replicas.
-
High availability — Stateless pods are trivially replaceable. Stateful pods are not. When a pod restarts, all accumulated metrics are lost — scorers have zero data and make blind decisions until metrics rebuild. During rolling updates, every pod goes through this cold-start period. There's no handoff mechanism between old and new pods.
-
Replica count affects correctness — With 1 replica, the data is complete. With 10 replicas, each sees ~10% of traffic. Metrics like "in-flight requests per model" become proportionally less accurate as you scale out. The more you scale (which is when you need intelligent routing most), the worse the data quality gets.
-
No consistency guarantees — Two concurrent requests hitting different replicas can make contradictory routing decisions based on their local view. Replica 1 thinks model A is underloaded and sends traffic there. Replica 2 thinks the same thing. Both send to model A, which is actually overloaded — neither knew about the other's decision.
-
Operational burden — Stateful services need monitoring for state drift, cold-start behavior, per-replica metric accuracy. Operators need to understand that IPP routing quality degrades with replica count. This is a new operational concern that didn't exist before.
-
Scope creep — Today it's in-flight requests. Next it will be latency histograms, token usage, cost accumulators, rate limit state. The IPP gradually becomes a metrics aggregation system, which is not its original purpose.
Alternatives to consider
- External shared store (Redis, memcached) — All replicas read/write to a shared store. Consistent view across replicas. Survives restarts. Adds a dependency but solves the fundamental partial-visibility problem.
- Prometheus-based — Scorers query Prometheus at selection time for aggregated metrics across all replicas. Leverages existing infrastructure already present in most K8s deployments.
- Sidecar pattern — A dedicated metrics sidecar collects and aggregates, IPP queries it locally. IPP stays stateless.
- Hybrid — Some lightweight per-instance state (e.g., rate limit headers from the last response for a specific model) combined with external aggregation for cross-replica metrics like total in-flight requests.
- Accept single-replica constraint — If IPP is always deployed as a single replica (like the upstream EPP), in-memory state works. But this should be an explicit documented constraint, not an implicit assumption.
Questions
- Has the decision to make the IPP stateful been explicitly discussed and agreed on?
- What is the expected replica count for IPP in production? Does the data layer design assume single replica?
- How do we handle multi-replica scenarios where each instance has partial visibility?
- Should the model selector's data needs drive the IPP's architecture, or should metrics collection be an external concern?
Not blocking the current work — just want to make sure this architectural direction has been intentionally chosen rather than emerging incrementally through implementation PRs.
The data layer work (PR #60, #64, #75) introduces in-memory state into the IPP — runtime metrics (in-flight requests, token counts) accumulated across requests and stored in-process. This is a significant architectural shift from the current stateless request/response transformation pipeline, and I think it deserves an explicit discussion before we go further.
What changes
Today the IPP is stateless — each request is processed independently, no memory between requests. With the in-memory data layer, each IPP instance accumulates runtime metrics that scorers and filters consume for model selection decisions.
Consequences of making IPP stateful
Replica state synchronization — With multiple IPP replicas behind Envoy, each replica accumulates its own local metrics. Without a synchronization mechanism, replicas will have inconsistent views of model state. This means we need to either build cross-replica state synchronization (which adds complexity, latency, and a consistency model to design), or accept that each replica makes decisions based on partial data. Either way, turning IPP stateful creates a new responsibility that needs to be owned — keeping distributed state consistent across replicas.
High availability — Stateless pods are trivially replaceable. Stateful pods are not. When a pod restarts, all accumulated metrics are lost — scorers have zero data and make blind decisions until metrics rebuild. During rolling updates, every pod goes through this cold-start period. There's no handoff mechanism between old and new pods.
Replica count affects correctness — With 1 replica, the data is complete. With 10 replicas, each sees ~10% of traffic. Metrics like "in-flight requests per model" become proportionally less accurate as you scale out. The more you scale (which is when you need intelligent routing most), the worse the data quality gets.
No consistency guarantees — Two concurrent requests hitting different replicas can make contradictory routing decisions based on their local view. Replica 1 thinks model A is underloaded and sends traffic there. Replica 2 thinks the same thing. Both send to model A, which is actually overloaded — neither knew about the other's decision.
Operational burden — Stateful services need monitoring for state drift, cold-start behavior, per-replica metric accuracy. Operators need to understand that IPP routing quality degrades with replica count. This is a new operational concern that didn't exist before.
Scope creep — Today it's in-flight requests. Next it will be latency histograms, token usage, cost accumulators, rate limit state. The IPP gradually becomes a metrics aggregation system, which is not its original purpose.
Alternatives to consider
Questions
Not blocking the current work — just want to make sure this architectural direction has been intentionally chosen rather than emerging incrementally through implementation PRs.