[Feat]: add RBAC-to-router security policy integration#1714
Conversation
✅ Deploy Preview for vllm-semantic-router ready!
To edit notification comments on pull requests, go to your Netlify project configuration. |
👥 vLLM Semantic Team NotificationThe following members have been identified for the changed files in this PR and have been automatically assigned: 📁
|
✅ Supply Chain Security Report — All Clear
Scanned at |
ac2cf93 to
a7cedd1
Compare
990f0cd to
5ef3847
Compare
There was a problem hiding this comment.
Pull request overview
Adds an RBAC-to-router “Security Policy” integration so dashboard admins can manage role-based model access and rate limits via a dedicated API/UI, and hardens the proxy boundary by stripping internal headers at Envoy.
Changes:
- Introduces dashboard Security Policy backend + endpoints to generate/apply router config fragments (role_bindings, decisions, ratelimit).
- Adds a new dashboard
/securitypage and navigation entry, plus new RBAC permissions (feedback.submit,replay.read,security.manage) with middleware routing updates. - Updates Envoy templates (local + CLI) to remove internal/spoofable headers, and wires new E2E coverage mapping + a new E2E testcase.
Reviewed changes
Copilot reviewed 25 out of 25 changed files in this pull request and generated 11 comments.
Show a summary per file
| File | Description |
|---|---|
| tools/agent/e2e-profile-map.yaml | Maps new security policy E2E + handler paths into the appropriate E2E coverage profiles. |
| src/vllm-sr/cli/templates/envoy.template.yaml | Adds request_headers_to_remove to strip internal headers in the production Envoy template. |
| src/semantic-router/pkg/extproc/router_replay_api.go | Adds a stricter path gate so replay API handling only triggers on replay paths. |
| src/semantic-router/pkg/extproc/processor_req_header.go | Minor formatting adjustment around request header handling flow. |
| src/semantic-router/pkg/apiserver/route_feedback.go | Comment punctuation fix for feedback handler. |
| e2e/testcases/security_policy_apply.go | New E2E testcase validating security policy apply/preview behavior. |
| e2e/pkg/testmatrix/testcases.go | Adds the new E2E testcase to the dashboard contract group. |
| docs/security-hardening.md | New documentation for RBAC integration + Envoy header stripping and operational checklist. |
| deploy/local/envoy.yaml | Adds request_headers_to_remove to the local Envoy config for defense-in-depth. |
| dashboard/frontend/src/pages/SecurityPolicyPage.tsx | New UI page for managing role mappings, rate tiers, and fragment preview/save. |
| dashboard/frontend/src/components/LayoutNavSupport.ts | Adds “Security Policy” to the Manager menu. |
| dashboard/frontend/src/App.tsx | Registers /security route to render the new page. |
| dashboard/backend/router/core_routes.go | Registers security policy API endpoints under /api/security/*. |
| dashboard/backend/handlers/security_policy_test.go | Adds unit tests for fragment generation, validation, and handlers. |
| dashboard/backend/handlers/security_policy_edge_test.go | Adds edge-case tests for canonical YAML conversion and merge behavior. |
| dashboard/backend/handlers/security_policy_canonical_test.go | Adds canonicalization/merge tests for generated fragments into router config. |
| dashboard/backend/handlers/security_policy.go | Implements security policy types, fragment generation, validation, apply, and preview handler. |
| dashboard/backend/handlers/deploy.go | Extends deploy merging to incorporate global.services.ratelimit fragments. |
| dashboard/backend/handlers/canonical_transport.go | Adds YAML fragment transport structs for global.services.ratelimit. |
| dashboard/backend/auth/schema_test.go | New tests ensuring new permissions exist and are assigned correctly. |
| dashboard/backend/auth/schema.go | Adds new permission constants and updates default role permission sets. |
| dashboard/backend/auth/middleware_test.go | Adds permission routing testcases for /api/security/* endpoints. |
| dashboard/backend/auth/middleware.go | Routes /api/security/* endpoints to the correct required permission. |
| dashboard/README.md | Documents the Security Policy page and endpoints + RBAC permission additions. |
| .github/workflows/ci-changes.yml | Updates CI change-detection paths to include new security policy code/tests. |
| baseConfig.Routing = mergeCanonicalRouting(baseConfig.Routing, fragmentConfig.Routing) | ||
| mergeFragmentGlobal(&baseConfig, fragmentConfig.Global) | ||
| mergedYAML, err := marshalYAMLBytes(baseConfig) |
There was a problem hiding this comment.
mergeCanonicalRouting / mergeCanonicalSignals are currently driven by len(patch.*) > 0 checks, so an explicitly-empty fragment cannot clear routing.decisions or routing.signals.role_bindings. With the new security policy auto-apply (which may intentionally set these lists to empty to remove access), this can leave stale RBAC decisions/bindings in place. Consider using nil-vs-empty semantics (e.g., check patch.Decisions != nil / patch.RoleBindings != nil) and ensure fragment generation emits explicit empty lists when the intent is replace/clear.
| http.Error(w, "Method not allowed", http.StatusMethodNotAllowed) | ||
| } | ||
| }) | ||
| mux.HandleFunc("/api/security/policy/preview", handlers.HandlePreviewSecurityFragment) |
There was a problem hiding this comment.
/api/security/policy/preview is registered without an HTTP method gate, so GET/PUT/etc. will be routed to HandlePreviewSecurityFragment. Because the auth middleware grants config.read to GETs under /api/security/, this effectively weakens the intended permission/method contract for preview. Consider wrapping this handler in a method switch (POST-only, 405 otherwise) like the main /api/security/policy route.
| mux.HandleFunc("/api/security/policy/preview", handlers.HandlePreviewSecurityFragment) | |
| mux.HandleFunc("/api/security/policy/preview", func(w http.ResponseWriter, r *http.Request) { | |
| switch r.Method { | |
| case http.MethodPost: | |
| handlers.HandlePreviewSecurityFragment(w, r) | |
| default: | |
| http.Error(w, "Method not allowed", http.StatusMethodNotAllowed) | |
| } | |
| }) |
| if resp.StatusCode == http.StatusBadGateway || resp.StatusCode == http.StatusNotFound { | ||
| if env.verbose { | ||
| fmt.Printf("[SecurityPolicy] Endpoint not available (status %d); stock dashboard image — skipping\n", resp.StatusCode) | ||
| } | ||
| return false, true, nil | ||
| } |
There was a problem hiding this comment.
This E2E test treats both 404 and 502 as "endpoint not available" and skips. A 502 can also indicate a real regression (proxy misconfig, upstream crash, auth failure) and skipping would hide it. Consider only skipping on 404 (or a more specific error signal) and failing the test on 502 to avoid masking outages.
| t.Parallel() | ||
|
|
||
| // Reset global store so parallel tests that write to it don't interfere. |
There was a problem hiding this comment.
These tests are marked t.Parallel() but they share a package-level storedPolicy state (mutated via saveSecurityPolicy). Even with the mutex, parallel execution can interleave updates/resets and make assertions flaky (e.g., a different test overwriting the stored policy between request + assert). Consider removing t.Parallel() from tests that touch the global policy store, or refactoring the store into a per-test instance that can be isolated.
| t.Parallel() | |
| // Reset global store so parallel tests that write to it don't interfere. | |
| // Reset global store before exercising the handler; this test must not run | |
| // in parallel with other tests that read or write the shared policy state. |
| // HandlePreviewSecurityFragment generates a router config fragment from the | ||
| // given security policy without applying it. | ||
| func HandlePreviewSecurityFragment(w http.ResponseWriter, r *http.Request) { | ||
| var policy SecurityPolicyConfig | ||
| if err := json.NewDecoder(r.Body).Decode(&policy); err != nil { | ||
| writeJSON(w, http.StatusBadRequest, map[string]string{ | ||
| "error": "invalid request body: " + err.Error(), | ||
| }) | ||
| return | ||
| } | ||
|
|
||
| fragment := GenerateRouterFragment(&policy) | ||
| writeJSON(w, http.StatusOK, fragment) | ||
| } |
There was a problem hiding this comment.
HandlePreviewSecurityFragment decodes the request body and always returns 200 with a fragment, but it never enforces the expected HTTP method nor runs validateSecurityPolicy. This makes it impossible for the UI/API to get validation errors from preview (and allows unexpected methods like GET to succeed). Consider rejecting non-POST methods with 405 and running the same validation used by HandleUpdateSecurityPolicy, returning 400 with the validation error.
| switch r.Method { | ||
| case http.MethodGet: | ||
| handlers.HandleGetSecurityPolicy(w, r) | ||
| case http.MethodPut: |
There was a problem hiding this comment.
registerSecurityPolicyRoutes does not enforce cfg.ReadonlyMode for write operations. In read-only dashboard mode, PUT /api/security/policy can still write/merge config.yaml via HandleUpdateSecurityPolicy, bypassing the intended write lockout used by other config/deploy endpoints. Consider returning 403 for PUT/POST when cfg.ReadonlyMode is enabled (and ideally aligning this with the same readonly enforcement strategy used elsewhere).
| case http.MethodPut: | |
| case http.MethodPut: | |
| if cfg.ReadonlyMode { | |
| http.Error(w, "Dashboard is in read-only mode", http.StatusForbidden) | |
| return | |
| } |
| kind := strings.ToLower(s.Kind) | ||
| if kind != "user" && kind != "group" { | ||
| return fmt.Errorf("role_mappings[%d].subjects[%d]: kind must be User or Group", i, j) | ||
| } | ||
| if s.Name == "" { |
There was a problem hiding this comment.
validateSecurityPolicy only checks s.Name == "" and lowercases s.Kind without trimming whitespace. This allows whitespace-only subject names (or kinds like " User ") to pass/ fail inconsistently and can lead to router-side authz classifier errors or unexpected non-matches. Consider strings.TrimSpace for both kind and name, and reject names that are empty after trimming.
| kind := strings.ToLower(s.Kind) | |
| if kind != "user" && kind != "group" { | |
| return fmt.Errorf("role_mappings[%d].subjects[%d]: kind must be User or Group", i, j) | |
| } | |
| if s.Name == "" { | |
| kind := strings.ToLower(strings.TrimSpace(s.Kind)) | |
| if kind != "user" && kind != "group" { | |
| return fmt.Errorf("role_mappings[%d].subjects[%d]: kind must be User or Group", i, j) | |
| } | |
| name := strings.TrimSpace(s.Name) | |
| if name == "" { |
| // applySecurityFragment merges the generated fragment into config.yaml and | ||
| // triggers a runtime hot-reload. Returns true if the apply succeeded. | ||
| func applySecurityFragment(fragment *GeneratedRouterFragment) bool { | ||
| if securityPolicyConfigPath == "" { | ||
| return false | ||
| } | ||
|
|
There was a problem hiding this comment.
applySecurityFragment writes to the shared router config file without taking the existing deploy/config write lock (deployMu). This can race with /api/router/config/* deploy/write handlers and lead to lost updates or partially applied configs. Consider acquiring the same lock (or otherwise serializing writes) and returning a conflict response if another deploy/apply is in progress.
| existingData, err := os.ReadFile(securityPolicyConfigPath) | ||
| if err != nil { | ||
| existingData = nil | ||
| } | ||
|
|
||
| merged, err := mergeDeployPayload(existingData, DeployRequest{YAML: string(yamlBytes)}) | ||
| if err != nil { | ||
| log.Printf("[SecurityPolicy] failed to merge fragment into config: %v", err) | ||
| return false | ||
| } |
There was a problem hiding this comment.
If os.ReadFile(securityPolicyConfigPath) fails, the code silently sets existingData=nil and proceeds. With mergeDeployPayload, an empty base causes the fragment YAML to be written as the entire config, which can clobber a real config on transient read errors (permissions, partial mount, etc.). Consider treating read errors (except non-existence when explicitly intended) as a hard failure and validating the merged YAML (e.g., routerconfig.ParseYAMLBytes) before writing.
| func toCanonicalYAML(fragment *GeneratedRouterFragment) ([]byte, error) { | ||
| doc := routingFragmentDocument{} | ||
|
|
||
| for _, rb := range fragment.RoleBindings { | ||
| subjects := make([]routerconfig.Subject, len(rb.Subjects)) | ||
| for i, s := range rb.Subjects { | ||
| subjects[i] = routerconfig.Subject{Kind: s.Kind, Name: s.Name} | ||
| } | ||
| doc.Routing.Signals.RoleBindings = append(doc.Routing.Signals.RoleBindings, | ||
| routerconfig.RoleBinding{Subjects: subjects, Role: rb.Role}) | ||
| } | ||
|
|
||
| for _, d := range fragment.Decisions { | ||
| modelRefs := make([]routerconfig.ModelRef, len(d.ModelRefs)) | ||
| for i, mr := range d.ModelRefs { | ||
| modelRefs[i] = routerconfig.ModelRef{Model: mr.Model} | ||
| } | ||
| doc.Routing.Decisions = append(doc.Routing.Decisions, routerconfig.Decision{ | ||
| Name: d.Name, | ||
| Priority: d.Priority, | ||
| Rules: routerconfig.RuleNode{Type: d.Rules.Type, Name: d.Rules.Name}, | ||
| ModelRefs: modelRefs, | ||
| }) | ||
| } | ||
|
|
||
| if len(fragment.RateLimit.Providers) > 0 { | ||
| rlCfg := &routerconfig.RateLimitConfig{} | ||
| for _, p := range fragment.RateLimit.Providers { | ||
| provider := routerconfig.RateLimitProviderConfig{Type: p.Type} | ||
| for _, r := range p.Rules { | ||
| provider.Rules = append(provider.Rules, routerconfig.RateLimitRule{ | ||
| Name: r.Name, | ||
| Match: routerconfig.RateLimitMatch{User: r.Match.User, Group: r.Match.Group}, | ||
| RequestsPerUnit: r.RPU, | ||
| TokensPerUnit: r.TPU, | ||
| Unit: r.Unit, | ||
| }) | ||
| } | ||
| rlCfg.Providers = append(rlCfg.Providers, provider) | ||
| } | ||
| doc.Global = &globalFragment{ | ||
| Services: &globalServicesFragment{RateLimit: rlCfg}, | ||
| } | ||
| } |
There was a problem hiding this comment.
toCanonicalYAML only emits global.services.ratelimit when len(fragment.RateLimit.Providers) > 0, and role_bindings/decisions are omitted when empty due to omitempty. That means saving a policy with no role mappings / rate tiers cannot clear previously-applied routing.decisions, routing.signals.role_bindings, or global.services.ratelimit, despite the docs describing replace semantics. Consider emitting explicit empty slices / an explicit empty ratelimit block (or adding a replace/clear flag) so merges can remove prior policy state.
|
@abdallahsamabd can you address the copilot comments? thanks |
3c835fd to
e54fb28
Compare
…-project#1518) Add dashboard Security Policy API and UI that translates RBAC role/group mappings into router config fragments (role_bindings, decisions, ratelimit rules), enabling admins to manage model access and rate limits per role without hand-editing router YAML. - Add security policy backend: GenerateRouterFragment(), validation, GET/PUT/POST preview API endpoints - Add SecurityPolicyPage UI for role-to-model and rate-limit tier management - Add dashboard RBAC permissions: feedback.submit, replay.read, security.manage - Add Envoy template request_headers_to_remove for defense-in-depth - Add 28 test cases covering fragment generation, validation, HTTP handlers, permissions, and middleware routing - Add security-hardening.md documentation Signed-off-by: Abdallah Samara <abdallahsamabd@gmail.com>
e54fb28 to
9a2094f
Compare

Add dashboard Security Policy API and UI that translates RBAC role/group mappings into router config fragments (role_bindings, decisions, ratelimit rules), enabling admins to manage model access and rate limits per role without hand-editing router YAML.
Closes #1518
Purpose
What does this PR change?
This PR bridges the gap between the dashboard's RBAC system and the router's signal-driven routing plane. Before this change, an admin who wanted to restrict model access by user role (e.g., "premium users get GPT-4, free users get GPT-3.5") had to manually edit router YAML config files. Now they can do it from the dashboard UI.
Specifically, this PR adds:
Security Policy API (
/api/security/policy) — a backend that accepts role-to-model mappings and rate-limit tiers, validates them, and generates router-compatible config fragments containingrole_bindings,decisions(withauthzsignal rules), andratelimit.providers[].rules.Security Policy UI (
/securitypage) — a React page where admins can define role mappings (which groups/users get which models) and rate-limit tiers (RPM/TPM per group), preview the generated router config fragment as JSON, and save the policy.Dashboard RBAC permissions — three new fine-grained permissions (
feedback.submit,replay.read,security.manage) added to the permission system.security.manageis admin-only and required to modify security policies.feedback.submitandreplay.readare granted to all roles.Envoy header stripping — production-default
request_headers_to_removedirectives added to both the local Envoy config and the CLI Envoy template. This stripsx-vsr-looper-*andx-authz-*headers from all client requests at the proxy layer, preventing clients from spoofing internal headers regardless of router-level validation.Documentation —
docs/security-hardening.mdcovering the RBAC integration architecture, API endpoints, validation rules, permission matrix, an end-to-end example, and a production deployment checklist.Why is this change needed?
Issue feature: close Themis security gaps and integrate RBAC with auth/rate-limit signals #1518 identifies a fundamental disconnect: the dashboard manages users and roles, and the router makes model-routing decisions based on authz signals, but there is no mechanism to translate dashboard RBAC state into router config. Operators must manually maintain role_bindings and rate-limit rules in YAML, which is error-prone and doesn't scale for multi-tenant deployments. This PR closes that gap with a structured API and UI that generates validated, router-compatible config fragments.
The Envoy header stripping is needed as defense-in-depth: even if the router's own header validation has a gap, the proxy layer ensures internal-only headers never reach the router from external clients.
Which module(s) does this affect?
Dashboard(backend handlers, auth schema, auth middleware, route registration, frontend UI),Router(Envoy template),DocsTest Plan
Unit tests (28 cases across 3 files):
Security policy handler tests (19 cases):
cd dashboard/backend && go test ./handlers/ -run "TestGenerateRouterFragment|TestValidateSecurityPolicy|TestHandleGetSecurityPolicy|TestHandleUpdateSecurityPolicy|TestHandlePreviewSecurityFragment" -v -count=1
RBAC schema tests (6 cases):
cd dashboard/backend && go test ./auth/ -run "TestNewPermissions|TestAdminRole|TestWriteRole|TestReadRole|TestAllRoles|TestDefaultRolePermissions" -v -count=1
Middleware permission routing tests (3 new cases):
cd dashboard/backend && go test ./auth/ -run "TestRequiredPermission" -v -count=1