feat(llm-access): keyword-based session moderation gate#63
Conversation
Add a pre-upstream keyword moderation module for the Kiro and Codex gateways. When request content matches a configured keyword the session is banned in memory and this plus all subsequent requests are blocked; the full request body and (redacted) headers are captured once for admin review, and a reviewer can unban a session. Design highlights: - Phrase matching via Aho-Corasick over normalized text (lowercased, whitespace-collapsed); ASCII keywords require word boundaries while CJK phrases match freely. Only user-visible message content is scanned (system + messages[].content / instructions), not JSON structure noise. - Keywords import from plain-text (one phrase per line) or JSON. - Hot path never reads Postgres: the compiled automaton plus banned / allowlisted session-key sets live in process memory, refreshed on startup and a periodic interval. Already-banned sessions are rejected without a scan or a write; a new ban persists exactly once (JSONB body + headers) via a spawned task. - Admin API + Yew review console: manage keywords and review captured bans (inspect payload, keep or lift the ban). Storage: new AdminModerationStore trait, empty stub, Postgres impl, and migration 0036 (llm_moderation_keywords, llm_moderation_banned_sessions). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
There was a problem hiding this comment.
Code Review
This pull request introduces a keyword moderation gate for the LLM gateway, allowing administrators to block requests containing banned keywords and review flagged sessions. It adds an admin moderation page in the frontend, backend API endpoints, database tables for keywords and banned sessions, and an in-memory ModerationGate that filters requests on the hot path. Feedback on the implementation suggests several optimizations and safety improvements: using safe string slicing to prevent panics on non-UTF-8 boundaries, removing redundant lowercase conversions on header names, optimizing digest formatting and key allocations to reduce string allocations, and adding a composite index on (banned_at_ms DESC, id DESC) to improve pagination query performance.
Important
The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.
| fn match_context_snippet(text: &str, start: usize, end: usize) -> String { | ||
| let snippet_start = { | ||
| let mut cursor = start; | ||
| for _ in 0..MATCH_CONTEXT_RADIUS_CHARS { | ||
| match text[..cursor].char_indices().next_back() { | ||
| Some((index, _)) => cursor = index, | ||
| None => break, | ||
| } | ||
| } | ||
| cursor | ||
| }; | ||
| let snippet_end = { | ||
| let mut cursor = end; | ||
| for _ in 0..MATCH_CONTEXT_RADIUS_CHARS { | ||
| match text[cursor..].chars().next() { | ||
| Some(ch) => cursor += ch.len_utf8(), | ||
| None => break, | ||
| } | ||
| } | ||
| cursor | ||
| }; | ||
| let mut snippet = String::new(); | ||
| if snippet_start > 0 { | ||
| snippet.push('…'); | ||
| } | ||
| snippet.push_str(&text[snippet_start..snippet_end]); | ||
| if snippet_end < text.len() { | ||
| snippet.push('…'); | ||
| } | ||
| snippet | ||
| } |
There was a problem hiding this comment.
Using direct string slicing (e.g., text[..cursor]) can cause panics if the indices are not valid UTF-8 character boundaries. To safely slice a string slice in Rust without panicking, use .get(..index) instead of direct slicing.
fn match_context_snippet(text: &str, start: usize, end: usize) -> String {
let snippet_start = {
let mut cursor = start;
for _ in 0..MATCH_CONTEXT_RADIUS_CHARS {
match text.get(..cursor).and_then(|s| s.char_indices().next_back()) {
Some((index, _)) => cursor = index,
None => break,
}
}
cursor
};
let snippet_end = {
let mut cursor = end;
for _ in 0..MATCH_CONTEXT_RADIUS_CHARS {
match text.get(cursor..).and_then(|s| s.chars().next()) {
Some(ch) => cursor += ch.len_utf8(),
None => break,
}
}
cursor
};
let mut snippet = String::new();
if snippet_start > 0 {
snippet.push('…');
}
if let Some(segment) = text.get(snippet_start..snippet_end) {
snippet.push_str(segment);
}
if snippet_end < text.len() {
snippet.push('…');
}
snippet
}References
- To safely slice a string slice in Rust without panicking on non-UTF-8 character boundaries while preserving byte semantics, use
.get(..index)instead of direct slicing[..index]or converting to character iterators (e.g.,chars().take()).
| for name in headers.keys() { | ||
| let key = name.as_str().to_ascii_lowercase(); |
There was a problem hiding this comment.
In the http crate, HeaderName::as_str() is already guaranteed to be lowercase. Calling .to_ascii_lowercase() on it is redundant and causes unnecessary allocations.
| for name in headers.keys() { | |
| let key = name.as_str().to_ascii_lowercase(); | |
| for name in headers.keys() { | |
| let key = name.as_str().to_string(); |
References
- Avoid calling
.to_lowercase()on strings or constants that are already known to be lowercase, as it causes redundant allocations. Query sets or maps directly using the borrowed lowercase string.
| pub(crate) fn derived_moderation_session_key(provider: &str, key_id: &str, body: &[u8]) -> String { | ||
| let mut hasher = Sha256::new(); | ||
| hasher.update(body); | ||
| let digest = hasher.finalize(); | ||
| let mut preview = String::with_capacity(16); | ||
| for byte in digest.iter().take(8) { | ||
| preview.push_str(&format!("{byte:02x}")); | ||
| } | ||
| format!("{provider}:{key_id}:content:{preview}") | ||
| } |
There was a problem hiding this comment.
Formatting each byte of the digest in a loop using format! performs 8 separate string allocations. We can optimize this by converting the first 8 bytes of the digest to a u64 and formatting it once, which reduces allocations to exactly one.
| pub(crate) fn derived_moderation_session_key(provider: &str, key_id: &str, body: &[u8]) -> String { | |
| let mut hasher = Sha256::new(); | |
| hasher.update(body); | |
| let digest = hasher.finalize(); | |
| let mut preview = String::with_capacity(16); | |
| for byte in digest.iter().take(8) { | |
| preview.push_str(&format!("{byte:02x}")); | |
| } | |
| format!("{provider}:{key_id}:content:{preview}") | |
| } | |
| pub(crate) fn derived_moderation_session_key(provider: &str, key_id: &str, body: &[u8]) -> String { | |
| let mut hasher = Sha256::new(); | |
| hasher.update(body); | |
| let digest = hasher.finalize(); | |
| let mut bytes = [0u8; 8]; | |
| bytes.copy_from_slice(&digest[..8]); | |
| let val = u64::from_be_bytes(bytes); | |
| let preview = format!("{val:016x}"); | |
| format!("{provider}:{key_id}:content:{preview}") | |
| } |
| fn state_ban(&mut self, session_key: &str) -> bool { | ||
| self.allowed.remove(session_key); | ||
| self.banned.insert(session_key.to_string()) | ||
| } | ||
| } |
There was a problem hiding this comment.
Calling session_key.to_string() on every call to state_ban causes an unnecessary allocation if the session is already present in self.banned. Checking self.banned.contains first avoids this allocation.
impl ModerationGateState {
fn state_ban(&mut self, session_key: &str) -> bool {
if self.banned.contains(session_key) {
return false;
}
self.allowed.remove(session_key);
self.banned.insert(session_key.to_string())
}
}References
- Avoid allocating keys (e.g., calling
.to_string()) on every iteration of a loop when querying a map, especially on performance-critical hot paths or while holding a lock. Instead, query the map using a borrowed key (e.g.,get_mut(key.as_ref())) and only allocate a new key when inserting a new entry for the first time. This reduces allocations from O(N) to O(distinct keys) and minimizes lock contention.
| CREATE INDEX IF NOT EXISTS idx_llm_moderation_banned_sessions_status_banned_at | ||
| ON llm_moderation_banned_sessions(status, banned_at_ms DESC); | ||
|
|
||
| CREATE INDEX IF NOT EXISTS idx_llm_moderation_banned_sessions_key_id | ||
| ON llm_moderation_banned_sessions(key_id, banned_at_ms DESC); |
There was a problem hiding this comment.
The query in list_moderation_banned_sessions without a status filter orders by banned_at_ms DESC, id DESC. Without an index on (banned_at_ms DESC, id DESC), this query will require a full table scan and filesort as the table grows. Adding a composite index on these fields will significantly improve pagination performance.
CREATE INDEX IF NOT EXISTS idx_llm_moderation_banned_sessions_status_banned_at
ON llm_moderation_banned_sessions(status, banned_at_ms DESC);
CREATE INDEX IF NOT EXISTS idx_llm_moderation_banned_sessions_banned_at
ON llm_moderation_banned_sessions(banned_at_ms DESC, id DESC);
CREATE INDEX IF NOT EXISTS idx_llm_moderation_banned_sessions_key_id
ON llm_moderation_banned_sessions(key_id, banned_at_ms DESC);Exercise the moderation store against a real Postgres (Neon CI branch): keyword bulk import with within-batch and cross-call ON CONFLICT dedup, NULLIF note coercion, delete/RETURNING, banned-session capture with JSONB body+headers, session_key conflict dedup, status-filtered pagination, review/unban, and the runtime snapshot contract. Gated on TEST_POSTGRES_URL and skipped when unset, matching the existing integration tests. Adds the two moderation tables to the reset_test_db TRUNCATE list for isolation. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Keywords and request text may contain punctuation and arbitrary spacing. Normalize both sides through a shared tokenizer before the Aho-Corasick phrase match: lowercase, split into terms (alphanumeric runs for space-delimited scripts, one term per ideographic character), drop punctuation/whitespace as separators, and rejoin with a single space. This makes matching insensitive to punctuation/spacing (`build a bomb` matches `Build, a bomb!`) and, because ideographs tokenize per character, defeats separator-injection evasion (`习.近.平` still matches `习近平`). Term-boundary alignment on the canonical form keeps `bomb` from firing inside `bomber`. The Halfwidth & Fullwidth Forms block is excluded from the ideographic set so fullwidth punctuation stays a separator. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Apply verified findings from a full-feature review pass: Hot path - Add ModerationGate::is_active() and a shared enforce_moderation() helper so a dormant gate (no keywords, no bans) does zero work — no SHA-256 session-key derivation, no text extraction, no scan. - Collapse the triplicated ban-record + precheck logic across the kiro, codex, and direct-anthropic hooks into enforce_moderation(), removing the duplicated key derivation and MessagesRequest extraction. Capture fidelity - Store request_body_json/request_headers_json as TEXT, not JSONB, so the captured wire bytes are preserved verbatim for review instead of being reparsed/reordered; moderation_body_text() drops the extra JSON parse. - Add the missing reviewed_at_ms >= 0 CHECK to migration 0036. Matching - Fold fullwidth ASCII (B→b, fullwidth punctuation→separator) in the tokenizer so fullwidth-form evasion is caught. - Drop the unused ModerationMatch.pattern_index field. Admin & UI - Cap keyword imports (count + per-keyword length). - Banned-sessions review console: pagination (prev/next), a close control on the capture panel, clear the stale panel after a review, clear the error banner on a successful load, and add table header scope semantics. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Add diagram-driven English documentation: - llm-access-core moderation: the tokenize → canonical-form → Aho-Corasick → term-boundary pipeline, worked English and CJK examples, and the Aho-Corasick complexity rationale (single O(n) scan over all keywords). - llm-access moderation gate: the memory-vs-Postgres caching contract and the per-request enforce_moderation() decision flow (dormant → session key → precheck → scan → ban), as ASCII diagrams. - Pointer comments at the three dispatch hook sites and the migration. Comments only; no behavior change. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…name
Replace the CJK moderation example `习近平` with the neutral, on-theme
`违禁词` ("banned word") in the module docs, the is_ideographic_char note,
and the tokenizer test. Same 3-ideograph shape, so the diagrams and the
separator-evasion demonstration are unchanged.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Classify moderation keywords under an 11-category risk taxonomy aligned to the OpenAI usage policy (csam, sexual, weapons, extremism, drugs, criminal, fraud, cyber, piracy, self_harm, jailbreak). A keyword may carry several categories; a ban record captures the categories of the keyword that fired. - Data model: new llm_moderation_categories table; category_codes on keywords and matched_categories on banned sessions (JSONB arrays, GIN indexed). Migration 0037 seeds the 11 categories and the 642 classified blocklist keywords (canonicalized through the real tokenizer). The classification was generated by range+override mapping and adversarially audited (only 1/642 corrected). - Matcher: ModerationMatcher carries per-keyword categories; a hit returns them, and the gate records them on the ban. - Store trait + Postgres: category list/add/delete (delete refuses while a keyword still references the code), keyword import with categories. - Admin API: /moderation/categories list/add/delete; keyword import accepts a validated batch-level category set. - Frontend: a Categories tab (manage the taxonomy), category multi-select on import, and severity-colored category badges on the keyword list, the banned-session list, and the capture detail. The client-facing rejection stays generic (no keyword/category leak); the admin console shows the full keyword + categories. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Review of the hit-scoped unban surfaced correctness bugs; fixes: 1. [blocker] Suppression bypass: the resume loop advanced by match_start (find_after, exclusive on start), which discards EVERY hit at that offset — including a DISTINCT unsuppressed keyword sharing the start (e.g. `bomb` and `bomb making`, or `习`/`习近`). Unbanning the shorter one silently masked the longer one. Replace the position-based resume scan with ModerationMatcher::find_accepted: one overlapping scan from 0 that skips only suppressed hit_keys, so co-located keywords are each evaluated. Drops the unsafe resume-cursor scan-skip (it could also miss a longer keyword starting before a cursor); per-hit content-scoping is preserved via the prefix hash folded into hit_key. Regression test added. 2. [major] Drop the partial UNIQUE index on (session_key) WHERE status='banned' — it enforced one active ban per session, contradicting the multi-hit model and 500ing when re-banning a reviewed hit. Per-hit uniqueness is already covered by hit_key UNIQUE. 3. [minor] record_moderation_banned_session now uses ON CONFLICT (hit_key) DO NOTHING so a distinct new hit in an already-banned session is captured rather than silently swallowed by the dropped index's conflict. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Document, in code, why moderation suppression must skip by hit identity rather than by scan position: - A new module-doc section "Hit-scoped unban" spells out what hit_key is (session + keyword + offsets + preceding-content hash), why the content prefix makes suppression fail-closed on any content change, and a WRONG-vs-RIGHT diagram showing how position-based skipping lets a distinct keyword sharing a suppressed hit's start offset (bomb / bomb making) slip through — the bypass fixed by find_accepted. - find_accepted's rustdoc explains it exists precisely to avoid that bypass. Comments only; no behavior change. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
概述
为 Kiro / Codex 网关新增一个上游派发前置的关键词审核模块。当请求正文命中配置的关键词时,对应 session 会被封禁,本次及后续请求都会被拦截;命中时会把完整请求体 + 脱敏后的请求头记录一次供审核,审核员可对误封的 session 解封。
需求对应
AdminModerationStore+ migration0036;匹配挂在 kiro / codex / 直连 Anthropic 三条派发路径的上游前llm_moderation_banned_sessions(JSONB 存 body/headers,附命中关键词与上下文片段)/admin/llm-gateway/moderation:关键词管理 + 封禁审核两个 tabtxt(每行一个词/短语)与json(数组或{"keywords":[...]}或对象数组)均支持匹配引擎
ass不会命中class),CJK 短语无需边界即可命中。system+messages[].content;OpenAI/Codexinstructions+content[].text),不把 JSON 字段名、模型 id、工具 schema 等结构噪声纳入匹配。缓存 / 性能(关键设计)
ON CONFLICT DO NOTHING+ 内存去重),且通过tokio::spawn异步落库,不阻塞响应。存储
AdminModerationStoretrait +empty.rsstub + Postgres 实现。0036_keyword_moderation.sql:llm_moderation_keywords、llm_moderation_banned_sessions(JSONB body/headers,带状态与审核索引)。Admin API
/admin/llm-gateway/moderation/*:关键词列表/批量导入(txt/json)/删除;封禁 session 分页列表、详情(含完整请求体/头)、解封或维持封禁。复用现有ensure_admin_access鉴权。测试与门禁
llm-access-core匹配引擎单测 11 项(归一化、短语容错、ASCII 词边界、CJK、txt/json 解析、正文抽取)全过。llm-access门禁模块单测 7 项(session key、脱敏头、body JSON 包装、kiro/json 正文抽取、disabled gate)全过。cargo clippy对llm-access全栈及static-flow-frontend(wasm32)均零警告。rustfmt仅格式化改动文件。部署面
改动集中在
llm-access*,按仓库约定生产发布目标为 AWS 云上llm-access服务。🤖 Generated with Claude Code