feat(spec): SPEC-WFAIPRO-001 — WorkflowAI Pro Technical Specification#39
Conversation
Implementation-ready XML specification for WorkflowAI Pro enterprise workflow
optimisation platform. Tri-model AI architecture: GNN document routing,
collaborative filtering bottleneck prediction, active learning UI adaptation.
Document: SPEC-WFAIPRO-001 v1.0.0 (52,955 bytes, 1,323 lines)
Format: XML with CDATA-wrapped Markdown content
6 Required Sections — All Present:
1. Executive Summary — tri-model architecture overview, key differentiators table
2. System Architecture — syntax-valid Mermaid.js C4 Container diagram (13 containers,
3 external systems, 27 relationships)
3. AI Components — HeteroGAT GNN (18M params, <200ms P99), NCF with temporal
attention (72h lookahead, >91% precision), pool-based AL with BatchBALD
(200 labels/day, MC Dropout T=20)
4. Implementation Specs — Deep dive on 3 entities:
- Document Router: OpenAPI 3.0 (3 endpoints), PostgreSQL schema (4 tables,
RLS multi-tenancy, hash partitioning), Kafka (4 topics, exactly-once)
- Approval Predictor: OpenAPI 3.0 (2 endpoints), MongoDB schema (2 collections
with JSON Schema validation), Redis feature store (4 key patterns, TTL policy),
Kafka (3 topics, 5-tier retry backoff)
- Adaptive UI Engine: OpenAPI 3.0 (3 endpoints), MongoDB schema (2 collections:
al_pool, al_experiments), Kafka (4 topics including model.retrained)
5. Performance, Security & Compliance — exactly 3 bullet points:
SLAs, GDPR/SOC 2, RBAC (6 roles, 23 permissions, OPA enforcement)
6. 18-Month Roadmap & Risks — exactly 8 bullet points:
Q1-Q6 milestones + 2 risk mitigations (model drift, Kafka backpressure)
Validation: XML well-formed (Python ET parse), all section/content checks pass.
|
The files' contents are under analysis for test generation. |
|
Review these changes at https://app.gitnotebooks.com/OneFineStarstuff/OneFineStarstuff.github.io/pull/39 |
|
The latest updates on your projects. Learn more about Vercel for GitHub.
|
|
View changes in DiffLens |
|
You have reached your Codex usage limits for code reviews. You can see your limits in the Codex usage dashboard. |
Reviewer's GuideAdds a new implementation-ready XML technical specification for WorkflowAI Pro, defining a tri-model AI workflow optimization platform with detailed architecture, AI models, APIs, data schemas, and messaging topologies for the Document Router, Approval Predictor, and Adaptive UI Engine services. Sequence diagram for document routing and bottleneck predictionsequenceDiagram
actor User
participant API_Gateway
participant Document_Router_Service
participant Redis_Feature_Store
participant GNN_Inference_Engine
participant PostgreSQL_DB
participant Kafka_Broker
participant Approval_Predictor_Service
User->>API_Gateway: POST /api/v2/documents/route
API_Gateway->>Document_Router_Service: routeDocument(document_id, tenant_id, content_hash, doc_type, metadata)
Document_Router_Service->>Redis_Feature_Store: GET entity_embeddings(document_id, tenant_id)
Redis_Feature_Store-->>Document_Router_Service: embeddings, graph_features
Document_Router_Service->>GNN_Inference_Engine: gRPC infer_routing_paths(embeddings, graph_context)
GNN_Inference_Engine-->>Document_Router_Service: top_paths, confidences
Document_Router_Service->>PostgreSQL_DB: INSERT documents, routing_decisions, routing_paths
Document_Router_Service->>Kafka_Broker: Produce doc.routed(document_id, routing_id, selected_path)
Document_Router_Service->>Kafka_Broker: Produce doc.routing.escalated(if confidence < 0.75)
Document_Router_Service-->>API_Gateway: 200 RoutingDecision or 202 EscalationResponse
API_Gateway-->>User: Routing decision response
Kafka_Broker-->>Approval_Predictor_Service: Consume approval.requested(document_id, approval_chain)
Approval_Predictor_Service->>Redis_Feature_Store: GET user_stage_embeddings, temporal_features
Redis_Feature_Store-->>Approval_Predictor_Service: feature_vectors
Approval_Predictor_Service->>Approval_Predictor_Service: NCF_inference_for_chain
Approval_Predictor_Service->>Kafka_Broker: Produce approval.predicted(document_id, stage_risks)
Approval_Predictor_Service->>PostgreSQL_DB: UPDATE chain_risk_metadata(optional)
Kafka_Broker-->>Document_Router_Service: approval.predicted(re_routing_triggers)
Document_Router_Service->>PostgreSQL_DB: UPDATE routing_decisions_with_suggested_re_routes
Sequence diagram for adaptive UI layout resolution and active learning loopsequenceDiagram
actor User
participant API_Gateway
participant Adaptive_UI_Engine
participant Active_Learning_Service
participant Kafka_Broker
participant MongoDB_AL_Collections
User->>API_Gateway: POST /api/v2/ui/layout(context)
API_Gateway->>Adaptive_UI_Engine: getAdaptiveLayout(user_id, tenant_id, role_id, task_type, accessibility_flags)
Adaptive_UI_Engine->>Active_Learning_Service: Resolve_layout(context_vector)
Active_Learning_Service->>MongoDB_AL_Collections: INSERT al_pool_sample(context, predicted_layout_id)
Active_Learning_Service-->>Adaptive_UI_Engine: LayoutConfig(layout_id, components, theme_overrides)
Adaptive_UI_Engine-->>API_Gateway: 200 LayoutConfig
API_Gateway-->>User: Rendered adaptive UI
User->>API_Gateway: POST /api/v2/ui/feedback(session_id, layout_id, feedback)
API_Gateway->>Adaptive_UI_Engine: submitUIFeedback(payload)
Adaptive_UI_Engine->>Kafka_Broker: Produce ui.feedback(session_id, layout_id, metrics)
Kafka_Broker-->>Active_Learning_Service: Consume ui.feedback
Active_Learning_Service->>MongoDB_AL_Collections: UPDATE al_pool_sample_with_implicit_label
Active_Learning_Service->>Active_Learning_Service: Periodic_MC_Dropout_uncertainty_estimation
Active_Learning_Service->>MongoDB_AL_Collections: FIND top_entropy_diverse_samples
Active_Learning_Service->>Kafka_Broker: Produce al.query(sample_id, context)
Kafka_Broker-->>Active_Learning_Service: Consume al.label.acquired(sample_id, assigned_layout_id)
Active_Learning_Service->>MongoDB_AL_Collections: UPDATE annotation_for_sample
Active_Learning_Service->>Active_Learning_Service: Trigger_model_retrain_when_labels_threshold_reached
Active_Learning_Service->>Kafka_Broker: Produce model.retrained(model_type=al_layout, version)
Kafka_Broker-->>Adaptive_UI_Engine: model.retrained(al_layout, version)
Adaptive_UI_Engine->>Adaptive_UI_Engine: Hot_reload_layout_model
Entity relationship diagram for core WorkflowAI Pro data schemaserDiagram
DOCUMENTS {
uuid id
uuid tenant_id
char64 content_hash
varchar32 doc_type
varchar16 urgency
text_array compliance_flags
jsonb metadata
timestamptz created_at
timestamptz updated_at
}
ROUTING_DECISIONS {
uuid id
uuid document_id
uuid tenant_id
varchar16 decision
numeric4_3 confidence
uuid selected_path_id
varchar32 model_version
numeric8_2 inference_latency_ms
timestamptz created_at
}
ROUTING_PATHS {
uuid id
uuid routing_decision_id
smallint path_rank
numeric6_2 total_predicted_duration_h
numeric4_3 path_confidence
jsonb stages
}
ROUTING_AUDIT_LOG {
bigint id
uuid document_id
uuid tenant_id
varchar64 stage_id
uuid approver_id
varchar16 action
timestamptz occurred_at
jsonb metadata
}
PREDICTION_LOGS {
string prediction_id
string tenant_id
string document_id
string model_version
date created_at
double inference_latency_ms
array stages
string overall_chain_risk
bool feedback_received
}
CF_MODEL_ARTIFACTS {
string model_id
string version
date created_at
string status
object hyperparams
object metrics
string artifact_path
string training_data_snapshot
}
AL_POOL {
string sample_id
string tenant_id
object context
string predicted_layout_id
double prediction_entropy
double mc_dropout_variance
string status
date created_at
date selected_at
date annotated_at
object annotation
}
AL_EXPERIMENTS {
string experiment_id
string incumbent_version
string challenger_version
string status
date created_at
date concluded_at
double traffic_split
object metrics
string decision
}
DOCUMENTS ||--o{ ROUTING_DECISIONS : has
DOCUMENTS ||--o{ ROUTING_AUDIT_LOG : has
ROUTING_DECISIONS ||--o{ ROUTING_PATHS : includes
DOCUMENTS ||--o{ PREDICTION_LOGS : has
CF_MODEL_ARTIFACTS ||--o{ PREDICTION_LOGS : generates
AL_EXPERIMENTS ||--o{ AL_POOL : evaluates
CF_MODEL_ARTIFACTS ||--o{ AL_EXPERIMENTS : compared_in
File-Level Changes
Tips and commandsInteracting with Sourcery
Customizing Your ExperienceAccess your dashboard to:
Getting Help
|
📝 WalkthroughWalkthroughA new XML technical specification document for WorkflowAI Pro platform has been added, detailing a C4 architecture with three core services (Document Router, Approval Predictor, Adaptive UI Engine), AI component architectures, end-to-end data flows over Kafka/Redis/PostgreSQL/MongoDB, OpenAPI endpoint contracts, schema definitions, and operational requirements. Changes
Estimated code review effort🎯 2 (Simple) | ⏱️ ~10 minutes Poem
🚥 Pre-merge checks | ✅ 3✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
📝 Coding Plan
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
❌ Deploy Preview for onefinestarstuff failed.
|
|
View changes in DiffLens |
There was a problem hiding this comment.
Hey - I've found 2 issues, and left some high level feedback:
- Kafka topic names and semantics are described in multiple places (e.g., high-level bullets vs detailed YAML sections); consider standardizing the exact topic names and DLQ naming across the entire spec to avoid ambiguity during implementation.
- There are many timestamp and duration fields across APIs and schemas (PostgreSQL, MongoDB, OpenAPI); explicitly stating a global convention (e.g., all timestamps in ISO 8601 UTC, all durations in hours/ms) near the top of the spec would reduce the risk of subtle cross-service inconsistencies.
Prompt for AI Agents
Please address the comments from this code review:
## Overall Comments
- Kafka topic names and semantics are described in multiple places (e.g., high-level bullets vs detailed YAML sections); consider standardizing the exact topic names and DLQ naming across the entire spec to avoid ambiguity during implementation.
- There are many timestamp and duration fields across APIs and schemas (PostgreSQL, MongoDB, OpenAPI); explicitly stating a global convention (e.g., all timestamps in ISO 8601 UTC, all durations in hours/ms) near the top of the spec would reduce the risk of subtle cross-service inconsistencies.
## Individual Comments
### Comment 1
<location path="docs/specifications/workflow-ai-pro.xml" line_range="475-484" />
<code_context>
+CREATE TABLE routing_audit_log (
</code_context>
<issue_to_address>
**🚨 issue (security):** Apply consistent tenant isolation to `routing_audit_log` (and potentially `routing_paths`) to align with the multi-tenant security goals.
`routing_audit_log` includes `tenant_id` but is not protected by RLS, unlike `documents` and `routing_decisions`. To maintain strict tenant isolation and long-term audit data safety, enable RLS here and add a `tenant_isolation` policy. Also consider adding `tenant_id` and RLS to `routing_paths` so it remains tenant-scoped even if accessed without joining through `routing_decision_id`.
</issue_to_address>
### Comment 2
<location path="docs/specifications/workflow-ai-pro.xml" line_range="464-471" />
<code_context>
+CREATE INDEX idx_routing_decisions_tenant_created
+ ON routing_decisions (tenant_id, created_at DESC);
+
+CREATE TABLE routing_paths (
+ id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
+ routing_decision_id UUID NOT NULL
+ REFERENCES routing_decisions(id) ON DELETE CASCADE,
+ path_rank SMALLINT NOT NULL, -- 0=selected, 1-2=alternatives
+ total_predicted_duration_h NUMERIC(6,2),
+ path_confidence NUMERIC(4,3),
+ stages JSONB NOT NULL
+ -- array of {stage_id, approver_id, predicted_duration_h, bottleneck_prob}
+);
</code_context>
<issue_to_address>
**suggestion (performance):** Add an index on `(routing_decision_id, path_rank)` in `routing_paths` to support common query patterns efficiently.
Given how `RoutingDecision.selected_path` and `alternative_paths` will be used, queries will often filter/order by `routing_decision_id` and `path_rank` (e.g., rank 0 plus a few alternatives). With only a PK on `id`, these will devolve into table scans as data grows. Please add a non-unique index on `(routing_decision_id, path_rank)` to keep lookups efficient, especially under multi-tenant load.
</issue_to_address>Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.
| CREATE TABLE routing_audit_log ( | ||
| id BIGINT GENERATED ALWAYS AS IDENTITY PRIMARY KEY, | ||
| document_id UUID NOT NULL, | ||
| tenant_id UUID NOT NULL, | ||
| stage_id VARCHAR(64) NOT NULL, | ||
| approver_id UUID, | ||
| action VARCHAR(16) NOT NULL, | ||
| -- approved | rejected | delegated | escalated | ||
| occurred_at TIMESTAMPTZ NOT NULL DEFAULT now(), | ||
| metadata JSONB DEFAULT '{}' |
There was a problem hiding this comment.
🚨 issue (security): Apply consistent tenant isolation to routing_audit_log (and potentially routing_paths) to align with the multi-tenant security goals.
routing_audit_log includes tenant_id but is not protected by RLS, unlike documents and routing_decisions. To maintain strict tenant isolation and long-term audit data safety, enable RLS here and add a tenant_isolation policy. Also consider adding tenant_id and RLS to routing_paths so it remains tenant-scoped even if accessed without joining through routing_decision_id.
| CREATE TABLE routing_paths ( | ||
| id UUID PRIMARY KEY DEFAULT gen_random_uuid(), | ||
| routing_decision_id UUID NOT NULL | ||
| REFERENCES routing_decisions(id) ON DELETE CASCADE, | ||
| path_rank SMALLINT NOT NULL, -- 0=selected, 1-2=alternatives | ||
| total_predicted_duration_h NUMERIC(6,2), | ||
| path_confidence NUMERIC(4,3), | ||
| stages JSONB NOT NULL |
There was a problem hiding this comment.
suggestion (performance): Add an index on (routing_decision_id, path_rank) in routing_paths to support common query patterns efficiently.
Given how RoutingDecision.selected_path and alternative_paths will be used, queries will often filter/order by routing_decision_id and path_rank (e.g., rank 0 plus a few alternatives). With only a PK on id, these will devolve into table scans as data grows. Please add a non-unique index on (routing_decision_id, path_rank) to keep lookups efficient, especially under multi-tenant load.
|
View changes in DiffLens |
There was a problem hiding this comment.
Actionable comments posted: 5
🧹 Nitpick comments (8)
docs/specifications/workflow-ai-pro.xml (8)
550-550: Longmax_poll_interval_msmay cause consumer group instability.The Document Router consumer configuration sets
max_poll_interval_ms: 300000(5 minutes) to accommodate GNN inference latency. However, this very long interval increases the risk of:
- Delayed detection of consumer failures
- Extended partition ownership during hung/slow consumers
- Consumer group rebalancing delays
Recommendation:
- Verify that GNN inference P99 latency target (<200ms) is achieved in production
- If inference occasionally exceeds 5 minutes, consider processing messages asynchronously (immediately commit offset after queuing message for background processing)
- Implement consumer heartbeat monitoring to detect processing delays earlier than the 5-minute timeout
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@docs/specifications/workflow-ai-pro.xml` at line 550, The consumer configuration sets max_poll_interval_ms: 300000 which is too long and can cause consumer-group instability; update the consumer behavior by either lowering max_poll_interval_ms to a safer value (e.g., closer to expected GNN P99 <200ms) or change processing to be asynchronous: immediately commit offsets after enqueueing messages for background GNN inference and implement heartbeat/monitoring to detect slow consumers earlier; locate and modify the max_poll_interval_ms setting in workflow-ai-pro.xml and ensure any consumer loop (consumer poll/commit logic and background worker/queue) is changed to enqueue work and commit promptly while adding heartbeat/monitoring hooks.
735-781: Consider adding TTL indexes to prevent unbounded collection growth.The
prediction_logscollection stores every prediction but lacks a TTL (Time-To-Live) index. Without automatic expiration, this collection will grow indefinitely, potentially impacting performance and storage costs.Consider adding TTL indexes for time-series data:
// Auto-delete prediction logs older than 180 days db.prediction_logs.createIndex( { created_at: 1 }, { expireAfterSeconds: 15552000 } // 180 days ); // Similar for AL pool - expire samples in "expired" status after 30 days db.al_pool.createIndex( { created_at: 1 }, { expireAfterSeconds: 2592000, // 30 days partialFilterExpression: { status: "expired" } } );🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@docs/specifications/workflow-ai-pro.xml` around lines 735 - 781, The schema for the prediction_logs collection lacks TTL indexes so data will grow unbounded; add a TTL index on prediction_logs.created_at to expire old prediction documents (e.g., 180 days) by creating an index with expireAfterSeconds, and also add a TTL index on al_pool.created_at with a partialFilterExpression for status: "expired" (e.g., 30 days) to auto-remove expired AL samples; update the migration/schema diff to include db.prediction_logs.createIndex({ created_at: 1 }, { expireAfterSeconds: <seconds> }) and db.al_pool.createIndex({ created_at: 1 }, { expireAfterSeconds: <seconds>, partialFilterExpression: { status: "expired" } }) so retention is enforced.
558-560: Clarifytransactional_idimplementation pattern.Line 560 specifies
transactional_id: "doc-router-tx-{instance_id}"for exactly-once semantics, but doesn't explain how{instance_id}should be generated or managed. Each producer instance must have a unique transactional ID that persists across restarts.Document the implementation approach:
- How is
instance_idgenerated? (e.g., pod name, UUID, consumer group member ID)- Is it stable across pod restarts?
- How to handle transactional ID exhaustion/cleanup?
- Recovery procedure when a transactional producer fails mid-transaction
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@docs/specifications/workflow-ai-pro.xml` around lines 558 - 560, Clarify how the transactional_id pattern "doc-router-tx-{instance_id}" must be implemented: specify generation strategies for instance_id (e.g., use Kubernetes pod name for stability, or a cluster-assigned persistent UUID stored in a volume/secret), state whether the chosen approach is stable across restarts, describe lifecycle/cleanup to avoid transactional ID exhaustion (e.g., reusing stable IDs, TTL/policy for ephemeral IDs, admin tooling to remove retired IDs), and document recovery steps for a producer that failed mid-transaction (how to detect in-doubt transactions, force abort or resume via broker/admin APIs, and recommended monitoring/alerting). Include references to transactional_id and instance_id so implementers know where to apply each guidance.
834-846: Optimize embedding storage format for Redis.Storing 64-dimensional embeddings as JSON strings (e.g.,
"[0.12,-0.34,...,0.56]") in Redis hash fields is inefficient:
- Parsing overhead when reading embeddings
- Increased memory footprint compared to binary formats
- Slower serialization/deserialization
Consider alternatives:
- Use Redis vector data type (Redis Stack with RediSearch) for native vector storage and similarity search
- Store as binary-encoded float arrays using MessagePack or Protocol Buffers
- Use HSET with separate numeric fields if individual dimensions need independent access
Example: Binary encoding with MessagePack
import msgpack import numpy as np # Encode embedding embedding = np.array([0.12, -0.34, ..., 0.56], dtype=np.float32) packed = msgpack.packb(embedding.tolist(), use_bin_type=True) redis.hset(f"feat:{tenant_id}:user_emb:{user_id}", "embedding", packed) # Decode embedding packed = redis.hget(f"feat:{tenant_id}:user_emb:{user_id}", "embedding") embedding = np.array(msgpack.unpackb(packed, raw=False), dtype=np.float32)🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@docs/specifications/workflow-ai-pro.xml` around lines 834 - 846, The current HSET usage storing the 64-dim embedding as a JSON string under the "embedding" field (key pattern feat:{tenant_id}:user_emb:{user_id}) is inefficient; update the write/read flows to store the embedding in a binary/vector-native format—either switch to Redis Vector/RediSearch native vectors for similarity use, or encode the float32 array with MessagePack/Protobuf before HSET and decode on read—so modify the code that calls HSET for "embedding" to pack the float32 array and the corresponding reader to unpack it (or replace HSET with the Redis vector API), and keep other hash fields (department_id, last_updated, etc.) unchanged.
831-875: Add Redis Cluster hash tags to key patterns for optimal performance.The feature store keys lack Redis Cluster hash tags, which can lead to related keys being distributed across different cluster nodes, requiring cross-node multi-key operations.
For example, fetching all features for a user (embedding + temporal features) might require multiple cross-node requests.
Add hash tags to ensure related keys reside on the same hash slot:
-# Key: feat:{tenant_id}:user_emb:{user_id} +# Key: feat:{tenant_id}:user_emb:{user_id} -> use {tenant_id} or {user_id} as hash tag -HSET feat:t-abc123:user_emb:u-789def +HSET feat:{t-abc123}:user_emb:u-789def -HSET feat:t-abc123:temporal:u-789def +HSET feat:{t-abc123}:temporal:u-789defThis ensures all keys with the same
{tenant_id}hash to the same Redis Cluster node, enabling efficient MGET/pipeline operations.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@docs/specifications/workflow-ai-pro.xml` around lines 831 - 875, The keys must include Redis Cluster hash tags around the tenant identifier so related keys land on the same hash slot; update all key patterns (e.g., feat:{tenant_id}:user_emb:{user_id}, feat:{tenant_id}:stage_emb:{stage_id}, feat:{tenant_id}:cf_score:{user_id}:{stage_id}, feat:{tenant_id}:temporal:{user_id}) to wrap only the tenant_id in braces (e.g., feat:{<tenant_id>}:user_emb:<user_id>) in every HSET/SET example and any related comments so pipelines/MGETs operate on a single node.
1311-1320: Consider documenting additional risks and dependencies.The roadmap and risk section covers model drift and Kafka backpressure, which are well-mitigated. However, an 18-month platform development with three complex AI systems may benefit from addressing additional risk categories:
Potential additional risks:
- Team skill gaps: GNN, NCF, and Active Learning require specialized ML expertise. Mitigation: training plan, consultants, or hiring timeline
- Data quality: AI models depend on high-quality training data. Mitigation: data validation pipeline, labeling quality checks
- Cold start: New tenants without historical data. Mitigation: covered partially for NCF (line 148) but not for GNN routing
- Regulatory changes: GDPR/SOC 2 requirements may evolve. Mitigation: quarterly compliance review
- Infrastructure costs: ML infrastructure (GPUs, Redis cluster, Kafka) can be expensive. Mitigation: cost monitoring and optimization plan
- Dependency on external systems: SharePoint, S3, IdP availability. Mitigation: graceful degradation, caching strategies
Since this is marked CONFIDENTIAL for CTO/VP Engineering, including a more comprehensive risk register would strengthen the business case and resource planning.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@docs/specifications/workflow-ai-pro.xml` around lines 1311 - 1320, Add a comprehensive additional-risk subsection alongside the existing "Risk -- Model Drift / Data Distribution Shift" and "Risk -- Kafka Partition Skew / Backpressure Cascade" entries that enumerates and mitigations for: Team skill gaps (training plan, consultant/hiring timeline), Data quality (validation pipelines, labeling QA), Cold-start for GNN routing (seeded priors, transfer learning, rule-based fallback), Regulatory changes (quarterly compliance reviews, legal monitoring), Infrastructure costs (cost monitoring, GPU/instance rightsizing, spot/commitment strategies), and External system dependencies (graceful degradation, caching, SLA-based failover). Place this new "Risk -- Additional: Team/Data/ColdStart/Compliance/Cost/Dependencies" block in the 18-month roadmap/risk section and ensure each bullet pairs a clear mitigation with a measurable trigger or owner to match the existing style and tone.
205-207: Add pattern constraint for SHA-256 content hash.The
content_hashfield is described as "SHA-256 hash of document content" but lacks a pattern constraint. SHA-256 hashes are exactly 64 hexadecimal characters.Add validation:
content_hash: type: string pattern: '^[a-f0-9]{64}$' description: SHA-256 hash of document content🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@docs/specifications/workflow-ai-pro.xml` around lines 205 - 207, The content_hash schema currently lacks a pattern constraint; update the content_hash field definition in the schema (the content_hash property) to include a pattern that enforces exactly 64 lowercase hexadecimal characters (use the regex ^[a-f0-9]{64}$), keep type: string and the existing description, so the field validates as a SHA-256 hex digest.
450-450: PostgreSQL confidence column allows out-of-range values.The
confidencecolumn is defined asNUMERIC(4,3)which allows values from -9.999 to 9.999, but the CHECK constraint limits it to 0-1. The NUMERIC definition should beNUMERIC(3,3)to represent values from 0.000 to 0.999, or increase precision if needed:confidence NUMERIC(4,3) NOT NULL CHECK (confidence BETWEEN 0 AND 1),Actually,
NUMERIC(4,3)means total 4 digits with 3 after decimal point, so it allows 0.000 to 9.999. For confidence scores 0.000 to 1.000, this should be:confidence NUMERIC(4,3) NOT NULL CHECK (confidence >= 0 AND confidence <= 1),The existing CHECK is correct, but the type could be more precise. Consider
NUMERIC(4,3)is fine since it allows 1.000, but you might want to document why this precision was chosen.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@docs/specifications/workflow-ai-pro.xml` at line 450, The confidence column's precision is ambiguous: keep the type as NUMERIC(4,3) to allow 1.000 (since NUMERIC(3,3) maxes at 0.999) and retain the existing CHECK (confidence BETWEEN 0 AND 1); update the schema line for the confidence column to include a short inline comment explaining why NUMERIC(4,3) was chosen (to permit 1.000) and ensure the CHECK constraint on confidence remains in place.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@docs/specifications/workflow-ai-pro.xml`:
- Around line 534-538: The DLQ topic configuration doc.routing.dlq currently
uses retention_ms: -1 which risks unbounded storage growth; change retention_ms
to a large but finite value (e.g., 7776000000 for 90 days) instead of -1, keep
or confirm cleanup_policy: compact as needed, and add operational controls:
create automated DLQ monitoring/alerting for topic depth and storage, and add a
runbook for inspecting/reprocessing messages (document procedures and thresholds
alongside the doc.routing.dlq configuration).
- Around line 492-498: Add documentation/comments explaining that the RLS
policies tenant_isolation_documents and tenant_isolation_routing rely on
current_setting('app.current_tenant')::UUID and that the application must set
this via "SET LOCAL app.current_tenant = '<tenant_uuid>'" at the start of each
transaction (or via middleware that runs per-transaction); note tradeoffs of
per-transaction vs per-connection when using connection poolers, describe error
handling if the setting is missing (e.g., detect and abort the transaction with
a clear error or raise a custom NOTICE/ERROR), and include an example
implementation pattern for middleware/connection wrapper that reads the
authenticated tenant ID and issues the SET LOCAL before any DB statements for
tables documents and routing_decisions.
- Line 1305: The doc has conflicting latency expectations: the global Kafka
consumer SLA "<5s end-to-end event processing latency" conflicts with Document
Router's consumer config max_poll_interval_ms: 300000 and the GNN P99 <200ms
target; update the spec to (1) state which percentile the "<5s" SLA refers to
(P50/P95/P99), (2) define concrete behavior when GNN inference exceeds 5s for
the Document Router Kafka consumer (e.g., enforce an inference timeout of 5s in
the Document Router's inference handler, emit the message to DLQ or mark for
human review and increment a metric/alert), and (3) reconcile the config by
either reducing max_poll_interval_ms to match the chosen SLA (e.g., 5000ms if
you require consumer poll intervals to support a 5s E2E SLA) or relaxing the SLA
to accept longer tail latency; reference Document Router, max_poll_interval_ms,
the GNN inference path and the "<5s end-to-end event processing latency" SLA
when making the change.
- Around line 455-457: The FK constraint fk_doc currently uses ON DELETE CASCADE
which will remove routing decisions, paths, and audit logs when a documents row
is deleted; change the constraint to use ON DELETE RESTRICT or ON DELETE SET
NULL and implement a soft-delete pattern on the documents table (e.g., add a
deleted_at flag and update application queries in routing logic to filter out
soft-deleted documents) and update any code that deletes documents to set the
soft-delete flag instead of issuing a hard DELETE; also add a migration to alter
the fk_doc constraint and handle existing NULLability if you choose SET NULL.
- Around line 176-416: The OpenAPI spec declares endpoints (operationIds:
routeDocument, getRoutingStatus, getGraphHealth and the Approval Predictor/UI
endpoints) and schemas (RoutingDecision, RoutingPath, EscalationResponse,
RoutingStatus) that are not implemented in backend/server.js; either implement
matching Express handlers for POST /api/v2/documents/route, GET
/api/v2/documents/:document_id/routing-status, GET /api/v2/routing/graph/health,
POST /api/v2/predictions/bottlenecks, GET
/api/v2/predictions/approver-load/:approver_id, POST /api/v2/ui/layout, POST
/api/v2/ui/feedback, GET /api/v2/ui/al/status in backend/server.js (hooking into
your business logic and returning the declared response shapes) and add
request/response JSON Schema validation middleware for the schemas
RoutingDecision, RoutingPath, EscalationResponse, RoutingStatus (or reuse your
existing validation utilities), or alternatively prune/update the OpenAPI
document to exactly match the two existing handlers (GET /api/wheel/stages and
POST /api/wheel/progress) and remove the unused schema/type declarations so the
spec and implementation stay synchronized.
---
Nitpick comments:
In `@docs/specifications/workflow-ai-pro.xml`:
- Line 550: The consumer configuration sets max_poll_interval_ms: 300000 which
is too long and can cause consumer-group instability; update the consumer
behavior by either lowering max_poll_interval_ms to a safer value (e.g., closer
to expected GNN P99 <200ms) or change processing to be asynchronous: immediately
commit offsets after enqueueing messages for background GNN inference and
implement heartbeat/monitoring to detect slow consumers earlier; locate and
modify the max_poll_interval_ms setting in workflow-ai-pro.xml and ensure any
consumer loop (consumer poll/commit logic and background worker/queue) is
changed to enqueue work and commit promptly while adding heartbeat/monitoring
hooks.
- Around line 735-781: The schema for the prediction_logs collection lacks TTL
indexes so data will grow unbounded; add a TTL index on
prediction_logs.created_at to expire old prediction documents (e.g., 180 days)
by creating an index with expireAfterSeconds, and also add a TTL index on
al_pool.created_at with a partialFilterExpression for status: "expired" (e.g.,
30 days) to auto-remove expired AL samples; update the migration/schema diff to
include db.prediction_logs.createIndex({ created_at: 1 }, { expireAfterSeconds:
<seconds> }) and db.al_pool.createIndex({ created_at: 1 }, { expireAfterSeconds:
<seconds>, partialFilterExpression: { status: "expired" } }) so retention is
enforced.
- Around line 558-560: Clarify how the transactional_id pattern
"doc-router-tx-{instance_id}" must be implemented: specify generation strategies
for instance_id (e.g., use Kubernetes pod name for stability, or a
cluster-assigned persistent UUID stored in a volume/secret), state whether the
chosen approach is stable across restarts, describe lifecycle/cleanup to avoid
transactional ID exhaustion (e.g., reusing stable IDs, TTL/policy for ephemeral
IDs, admin tooling to remove retired IDs), and document recovery steps for a
producer that failed mid-transaction (how to detect in-doubt transactions, force
abort or resume via broker/admin APIs, and recommended monitoring/alerting).
Include references to transactional_id and instance_id so implementers know
where to apply each guidance.
- Around line 834-846: The current HSET usage storing the 64-dim embedding as a
JSON string under the "embedding" field (key pattern
feat:{tenant_id}:user_emb:{user_id}) is inefficient; update the write/read flows
to store the embedding in a binary/vector-native format—either switch to Redis
Vector/RediSearch native vectors for similarity use, or encode the float32 array
with MessagePack/Protobuf before HSET and decode on read—so modify the code that
calls HSET for "embedding" to pack the float32 array and the corresponding
reader to unpack it (or replace HSET with the Redis vector API), and keep other
hash fields (department_id, last_updated, etc.) unchanged.
- Around line 831-875: The keys must include Redis Cluster hash tags around the
tenant identifier so related keys land on the same hash slot; update all key
patterns (e.g., feat:{tenant_id}:user_emb:{user_id},
feat:{tenant_id}:stage_emb:{stage_id},
feat:{tenant_id}:cf_score:{user_id}:{stage_id},
feat:{tenant_id}:temporal:{user_id}) to wrap only the tenant_id in braces (e.g.,
feat:{<tenant_id>}:user_emb:<user_id>) in every HSET/SET example and any related
comments so pipelines/MGETs operate on a single node.
- Around line 1311-1320: Add a comprehensive additional-risk subsection
alongside the existing "Risk -- Model Drift / Data Distribution Shift" and "Risk
-- Kafka Partition Skew / Backpressure Cascade" entries that enumerates and
mitigations for: Team skill gaps (training plan, consultant/hiring timeline),
Data quality (validation pipelines, labeling QA), Cold-start for GNN routing
(seeded priors, transfer learning, rule-based fallback), Regulatory changes
(quarterly compliance reviews, legal monitoring), Infrastructure costs (cost
monitoring, GPU/instance rightsizing, spot/commitment strategies), and External
system dependencies (graceful degradation, caching, SLA-based failover). Place
this new "Risk -- Additional: Team/Data/ColdStart/Compliance/Cost/Dependencies"
block in the 18-month roadmap/risk section and ensure each bullet pairs a clear
mitigation with a measurable trigger or owner to match the existing style and
tone.
- Around line 205-207: The content_hash schema currently lacks a pattern
constraint; update the content_hash field definition in the schema (the
content_hash property) to include a pattern that enforces exactly 64 lowercase
hexadecimal characters (use the regex ^[a-f0-9]{64}$), keep type: string and the
existing description, so the field validates as a SHA-256 hex digest.
- Line 450: The confidence column's precision is ambiguous: keep the type as
NUMERIC(4,3) to allow 1.000 (since NUMERIC(3,3) maxes at 0.999) and retain the
existing CHECK (confidence BETWEEN 0 AND 1); update the schema line for the
confidence column to include a short inline comment explaining why NUMERIC(4,3)
was chosen (to permit 1.000) and ensure the CHECK constraint on confidence
remains in place.
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 5d128e43-fbcb-4f96-8e6e-1bd1b46c468f
📒 Files selected for processing (1)
docs/specifications/workflow-ai-pro.xml
| ```yaml | ||
| openapi: 3.0.3 | ||
| info: | ||
| title: Document Router Service API | ||
| version: 2.1.0 | ||
| description: Intelligent document routing powered by heterogeneous GNN inference. | ||
|
|
||
| paths: | ||
| /api/v2/documents/route: | ||
| post: | ||
| operationId: routeDocument | ||
| summary: Submit a document for AI-powered routing | ||
| tags: [routing] | ||
| security: | ||
| - BearerAuth: [] | ||
| requestBody: | ||
| required: true | ||
| content: | ||
| application/json: | ||
| schema: | ||
| type: object | ||
| required: [document_id, tenant_id, content_hash, doc_type] | ||
| properties: | ||
| document_id: | ||
| type: string | ||
| format: uuid | ||
| tenant_id: | ||
| type: string | ||
| format: uuid | ||
| content_hash: | ||
| type: string | ||
| description: SHA-256 hash of document content | ||
| doc_type: | ||
| type: string | ||
| enum: [contract, invoice, policy, legal_brief, hr_form, engineering_spec, compliance_report] | ||
| urgency: | ||
| type: string | ||
| enum: [critical, high, standard, low] | ||
| default: standard | ||
| compliance_flags: | ||
| type: array | ||
| items: | ||
| type: string | ||
| enum: [gdpr, sox, hipaa, pci_dss, itar] | ||
| metadata: | ||
| type: object | ||
| additionalProperties: true | ||
| responses: | ||
| '200': | ||
| description: Routing decision computed | ||
| content: | ||
| application/json: | ||
| schema: | ||
| $ref: '#/components/schemas/RoutingDecision' | ||
| '202': | ||
| description: Low-confidence routing; escalated to human review | ||
| content: | ||
| application/json: | ||
| schema: | ||
| $ref: '#/components/schemas/EscalationResponse' | ||
| '422': | ||
| description: Unprocessable document features | ||
| '429': | ||
| description: Rate limit exceeded | ||
|
|
||
| /api/v2/documents/{document_id}/routing-status: | ||
| get: | ||
| operationId: getRoutingStatus | ||
| summary: Retrieve current routing state and audit trail | ||
| tags: [routing] | ||
| security: | ||
| - BearerAuth: [] | ||
| parameters: | ||
| - name: document_id | ||
| in: path | ||
| required: true | ||
| schema: | ||
| type: string | ||
| format: uuid | ||
| responses: | ||
| '200': | ||
| description: Routing status with full path trace | ||
| content: | ||
| application/json: | ||
| schema: | ||
| $ref: '#/components/schemas/RoutingStatus' | ||
| '404': | ||
| description: Document not found | ||
|
|
||
| /api/v2/routing/graph/health: | ||
| get: | ||
| operationId: getGraphHealth | ||
| summary: GNN model and graph index health check | ||
| tags: [operations] | ||
| security: | ||
| - BearerAuth: [] | ||
| responses: | ||
| '200': | ||
| description: Graph and model health metrics | ||
| content: | ||
| application/json: | ||
| schema: | ||
| type: object | ||
| properties: | ||
| model_version: | ||
| type: string | ||
| graph_node_count: | ||
| type: integer | ||
| graph_edge_count: | ||
| type: integer | ||
| avg_inference_latency_ms: | ||
| type: number | ||
| p99_inference_latency_ms: | ||
| type: number | ||
| last_retrain_timestamp: | ||
| type: string | ||
| format: date-time | ||
| feature_store_status: | ||
| type: string | ||
| enum: [healthy, degraded, unavailable] | ||
|
|
||
| components: | ||
| securitySchemes: | ||
| BearerAuth: | ||
| type: http | ||
| scheme: bearer | ||
| bearerFormat: JWT | ||
|
|
||
| schemas: | ||
| RoutingDecision: | ||
| type: object | ||
| properties: | ||
| document_id: | ||
| type: string | ||
| format: uuid | ||
| routing_id: | ||
| type: string | ||
| format: uuid | ||
| decision: | ||
| type: string | ||
| enum: [auto_routed, human_review] | ||
| confidence: | ||
| type: number | ||
| minimum: 0 | ||
| maximum: 1 | ||
| selected_path: | ||
| $ref: '#/components/schemas/RoutingPath' | ||
| alternative_paths: | ||
| type: array | ||
| maxItems: 2 | ||
| items: | ||
| $ref: '#/components/schemas/RoutingPath' | ||
| model_version: | ||
| type: string | ||
| inference_latency_ms: | ||
| type: number | ||
| timestamp: | ||
| type: string | ||
| format: date-time | ||
|
|
||
| RoutingPath: | ||
| type: object | ||
| properties: | ||
| path_id: | ||
| type: string | ||
| format: uuid | ||
| stages: | ||
| type: array | ||
| items: | ||
| type: object | ||
| properties: | ||
| stage_id: | ||
| type: string | ||
| stage_name: | ||
| type: string | ||
| assigned_approver_id: | ||
| type: string | ||
| format: uuid | ||
| predicted_duration_hours: | ||
| type: number | ||
| bottleneck_probability: | ||
| type: number | ||
| total_predicted_duration_hours: | ||
| type: number | ||
| path_confidence: | ||
| type: number | ||
|
|
||
| EscalationResponse: | ||
| type: object | ||
| properties: | ||
| document_id: | ||
| type: string | ||
| format: uuid | ||
| escalation_id: | ||
| type: string | ||
| format: uuid | ||
| reason: | ||
| type: string | ||
| top_candidate_paths: | ||
| type: array | ||
| items: | ||
| $ref: '#/components/schemas/RoutingPath' | ||
| escalated_to: | ||
| type: string | ||
| format: uuid | ||
|
|
||
| RoutingStatus: | ||
| type: object | ||
| properties: | ||
| document_id: | ||
| type: string | ||
| format: uuid | ||
| current_stage: | ||
| type: string | ||
| overall_status: | ||
| type: string | ||
| enum: [in_progress, completed, rejected, escalated, stalled] | ||
| path_trace: | ||
| type: array | ||
| items: | ||
| type: object | ||
| properties: | ||
| stage_id: | ||
| type: string | ||
| approver_id: | ||
| type: string | ||
| format: uuid | ||
| entered_at: | ||
| type: string | ||
| format: date-time | ||
| completed_at: | ||
| type: string | ||
| format: date-time | ||
| nullable: true | ||
| action: | ||
| type: string | ||
| enum: [approved, rejected, delegated, pending] | ||
| sla_status: | ||
| type: string | ||
| enum: [on_track, at_risk, breached] | ||
| ``` |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
# Verify which API endpoints are actually implemented in the backend
echo "=== Searching for declared endpoints in backend codebase ==="
# Search for Document Router endpoints
echo -e "\n--- Document Router endpoints ---"
rg -n "'/api/v2/documents/route'" --type js
rg -n "'/api/v2/routing/graph/health'" --type js
# Search for Approval Predictor endpoints
echo -e "\n--- Approval Predictor endpoints ---"
rg -n "'/api/v2/predictions/bottlenecks'" --type js
rg -n "'/api/v2/predictions/approver-load'" --type js
# Search for Adaptive UI endpoints
echo -e "\n--- Adaptive UI Engine endpoints ---"
rg -n "'/api/v2/ui/layout'" --type js
rg -n "'/api/v2/ui/feedback'" --type js
rg -n "'/api/v2/ui/al/status'" --type js
# Show what endpoints actually exist
echo -e "\n=== Actual endpoints found in backend ==="
rg -n "app\.(get|post|put|delete|patch)\(" backend/server.js -A 1Repository: OneFineStarstuff/OneFineStarstuff.github.io
Length of output: 495
Specification-implementation mismatch: declared endpoints do not exist in the backend.
The specification declares 8 OpenAPI endpoints across three services, but the backend (backend/server.js) only implements 2 endpoints:
GET /api/wheel/stagesPOST /api/wheel/progress
The specified but unimplemented endpoints are:
Document Router Service:
POST /api/v2/documents/routeGET /api/v2/documents/{document_id}/routing-statusGET /api/v2/routing/graph/health
Approval Predictor Service:
POST /api/v2/predictions/bottlenecksGET /api/v2/predictions/approver-load/{approver_id}
Adaptive UI Engine:
POST /api/v2/ui/layoutPOST /api/v2/ui/feedbackGET /api/v2/ui/al/status
Additionally, no validation schemas exist for the request/response types declared in this specification (RoutingDecision, RoutingPath, EscalationResponse, etc.).
Either implement all declared endpoints and add corresponding validation middleware, or update the specification to match the actual implementation. Leaving this unresolved will block integration and create confusion about system capabilities.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@docs/specifications/workflow-ai-pro.xml` around lines 176 - 416, The OpenAPI
spec declares endpoints (operationIds: routeDocument, getRoutingStatus,
getGraphHealth and the Approval Predictor/UI endpoints) and schemas
(RoutingDecision, RoutingPath, EscalationResponse, RoutingStatus) that are not
implemented in backend/server.js; either implement matching Express handlers for
POST /api/v2/documents/route, GET /api/v2/documents/:document_id/routing-status,
GET /api/v2/routing/graph/health, POST /api/v2/predictions/bottlenecks, GET
/api/v2/predictions/approver-load/:approver_id, POST /api/v2/ui/layout, POST
/api/v2/ui/feedback, GET /api/v2/ui/al/status in backend/server.js (hooking into
your business logic and returning the declared response shapes) and add
request/response JSON Schema validation middleware for the schemas
RoutingDecision, RoutingPath, EscalationResponse, RoutingStatus (or reuse your
existing validation utilities), or alternatively prune/update the OpenAPI
document to exactly match the two existing handlers (GET /api/wheel/stages and
POST /api/wheel/progress) and remove the unused schema/type declarations so the
spec and implementation stay synchronized.
| CONSTRAINT fk_doc FOREIGN KEY (document_id) | ||
| REFERENCES documents(id) ON DELETE CASCADE | ||
| ); |
There was a problem hiding this comment.
Consider the impact of CASCADE deletion on routing decisions.
The foreign key constraint uses ON DELETE CASCADE, meaning if a document is deleted, all associated routing decisions, paths, and audit logs will be automatically deleted. This could result in loss of historical routing data that may be valuable for model training or compliance audit trails.
Consider using ON DELETE RESTRICT or ON DELETE SET NULL with a soft-delete pattern for documents to preserve routing history.
🛡️ Alternative approach: soft delete pattern
CREATE TABLE documents (
id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
tenant_id UUID NOT NULL,
content_hash CHAR(64) NOT NULL,
doc_type VARCHAR(32) NOT NULL,
urgency VARCHAR(16) NOT NULL DEFAULT 'standard',
compliance_flags TEXT[] DEFAULT '{}',
metadata JSONB DEFAULT '{}',
+ deleted_at TIMESTAMPTZ NULL,
created_at TIMESTAMPTZ NOT NULL DEFAULT now(),
updated_at TIMESTAMPTZ NOT NULL DEFAULT now()
) PARTITION BY HASH (tenant_id);Then change the foreign key:
CONSTRAINT fk_doc FOREIGN KEY (document_id)
- REFERENCES documents(id) ON DELETE CASCADE
+ REFERENCES documents(id) ON DELETE RESTRICT🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@docs/specifications/workflow-ai-pro.xml` around lines 455 - 457, The FK
constraint fk_doc currently uses ON DELETE CASCADE which will remove routing
decisions, paths, and audit logs when a documents row is deleted; change the
constraint to use ON DELETE RESTRICT or ON DELETE SET NULL and implement a
soft-delete pattern on the documents table (e.g., add a deleted_at flag and
update application queries in routing logic to filter out soft-deleted
documents) and update any code that deletes documents to set the soft-delete
flag instead of issuing a hard DELETE; also add a migration to alter the fk_doc
constraint and handle existing NULLability if you choose SET NULL.
| ALTER TABLE documents ENABLE ROW LEVEL SECURITY; | ||
| CREATE POLICY tenant_isolation_documents ON documents | ||
| USING (tenant_id = current_setting('app.current_tenant')::UUID); | ||
|
|
||
| ALTER TABLE routing_decisions ENABLE ROW LEVEL SECURITY; | ||
| CREATE POLICY tenant_isolation_routing ON routing_decisions | ||
| USING (tenant_id = current_setting('app.current_tenant')::UUID); |
There was a problem hiding this comment.
Document the required session variable for Row-Level Security policies.
The RLS policies rely on current_setting('app.current_tenant')::UUID, but the specification doesn't document how this session variable should be set. The application must execute SET LOCAL app.current_tenant = '<tenant_uuid>' at the beginning of each transaction for RLS to work correctly.
Without proper documentation, developers may encounter authorization failures or inadvertently expose data across tenant boundaries.
Add a comment in the SQL schema or a dedicated section explaining:
- How to set the session variable (e.g., via middleware, connection pooler)
- When it must be set (per transaction vs per connection)
- Error handling when the variable is not set
- Example implementation pattern
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@docs/specifications/workflow-ai-pro.xml` around lines 492 - 498, Add
documentation/comments explaining that the RLS policies
tenant_isolation_documents and tenant_isolation_routing rely on
current_setting('app.current_tenant')::UUID and that the application must set
this via "SET LOCAL app.current_tenant = '<tenant_uuid>'" at the start of each
transaction (or via middleware that runs per-transaction); note tradeoffs of
per-transaction vs per-connection when using connection poolers, describe error
handling if the setting is missing (e.g., detect and abort the transaction with
a clear error or raise a custom NOTICE/ERROR), and include an example
implementation pattern for middleware/connection wrapper that reads the
authenticated tenant ID and issues the SET LOCAL before any DB statements for
tables documents and routing_decisions.
| doc.routing.dlq: | ||
| partitions: 6 | ||
| replication_factor: 3 | ||
| retention_ms: -1 # infinite retention for DLQ | ||
| cleanup_policy: compact |
There was a problem hiding this comment.
Risk: Infinite retention on DLQ topic can cause unbounded storage growth.
The doc.routing.dlq topic uses retention_ms: -1 (infinite retention) combined with cleanup_policy: compact. While this ensures failed messages are never lost, it can lead to unbounded storage growth if DLQ messages are not actively monitored and remediated.
Consider:
- Setting a large but finite retention (e.g., 90 days: 7776000000ms)
- Implementing automated DLQ monitoring with alerting when depth exceeds thresholds
- Creating a runbook for DLQ message inspection and reprocessing
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@docs/specifications/workflow-ai-pro.xml` around lines 534 - 538, The DLQ
topic configuration doc.routing.dlq currently uses retention_ms: -1 which risks
unbounded storage growth; change retention_ms to a large but finite value (e.g.,
7776000000 for 90 days) instead of -1, keep or confirm cleanup_policy: compact
as needed, and add operational controls: create automated DLQ
monitoring/alerting for topic depth and storage, and add a runbook for
inspecting/reprocessing messages (document procedures and thresholds alongside
the doc.routing.dlq configuration).
|
|
||
| # 5. Performance, Security and Compliance | ||
|
|
||
| - **SLAs:** Document routing P99 <200ms; bottleneck prediction P95 <500ms; UI layout resolution P95 <100ms; system-wide availability 99.95% (43.8 min/month max downtime); all Kafka consumers maintain <5s end-to-end event processing latency; model retraining completes within 4-hour nightly batch window. |
There was a problem hiding this comment.
Conflicting latency requirements: Kafka consumer SLA vs GNN inference timeout.
The performance requirements state "all Kafka consumers maintain <5s end-to-end event processing latency," but the Document Router Kafka consumer configuration (line 550) sets max_poll_interval_ms: 300000 (5 minutes) to accommodate GNN inference.
This creates a conflict:
- If GNN inference takes >5 seconds, the <5s end-to-end latency SLA is breached
- The 5-minute timeout suggests inference could take much longer than 5 seconds
- The stated GNN P99 latency target of <200ms should normally satisfy the 5s SLA, but the large timeout indicates potential outliers
Clarify:
- Is the <5s latency requirement for P50, P95, or P99?
- How should the system behave when GNN inference exceeds 5 seconds? (e.g., timeout and route to DLQ, escalate to human review)
- Update either the max_poll_interval_ms or the latency SLA to be consistent
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@docs/specifications/workflow-ai-pro.xml` at line 1305, The doc has
conflicting latency expectations: the global Kafka consumer SLA "<5s end-to-end
event processing latency" conflicts with Document Router's consumer config
max_poll_interval_ms: 300000 and the GNN P99 <200ms target; update the spec to
(1) state which percentile the "<5s" SLA refers to (P50/P95/P99), (2) define
concrete behavior when GNN inference exceeds 5s for the Document Router Kafka
consumer (e.g., enforce an inference timeout of 5s in the Document Router's
inference handler, emit the message to DLQ or mark for human review and
increment a metric/alert), and (3) reconcile the config by either reducing
max_poll_interval_ms to match the chosen SLA (e.g., 5000ms if you require
consumer poll intervals to support a 5s E2E SLA) or relaxing the SLA to accept
longer tail latency; reference Document Router, max_poll_interval_ms, the GNN
inference path and the "<5s end-to-end event processing latency" SLA when making
the change.
|
View changes in DiffLens |
PR Review 🔍
|
PR Code Suggestions ✨No code suggestions found for PR. |
User description
SPEC-WFAIPRO-001 — WorkflowAI Pro Technical Specification
Overview
Implementation-ready XML technical specification for WorkflowAI Pro, an enterprise-grade AI-powered workflow optimisation platform. Tri-model AI architecture: GNN document routing, collaborative filtering bottleneck prediction, active learning UI adaptation.
Document Reference: SPEC-WFAIPRO-001 v1.0.0
File:
docs/specifications/workflow-ai-pro.xml(52,955 bytes, 1,323 lines)Format: XML with CDATA-wrapped Markdown content
Classification: CONFIDENTIAL — CTO Office, VP Engineering, Lead Developers
Required Sections — All 6 Present and Validated
Section 4 — Implementation Specs Detail
Document Router
routeDocument,getRoutingStatus,getGraphHealth)doc.ingested,doc.routed,doc.routing.escalated, DLQ), exactly-once semanticsApproval Predictor
predictBottlenecks,getApproverLoad)prediction_logs,cf_model_artifacts) with JSON Schema validationapproval.requested,approval.predicted,approval.completed), 5-tier retry backoffAdaptive UI Engine
getAdaptiveLayout,submitUIFeedback,getALStatus)al_pool,al_experiments) with A/B experiment trackingui.feedback,al.query,al.label.acquired,model.retrained)Constraint Compliance
]]>sequence inside CDATA contentValidation Results
Files Changed
docs/specifications/workflow-ai-pro.xml(new — 1,323 lines)Summary by Sourcery
Add a comprehensive XML technical specification document for the WorkflowAI Pro platform, defining its architecture, AI subsystems, data models, and integration contracts.
Documentation:
Summary by CodeRabbit
Description
Changes walkthrough 📝
workflow-ai-pro.xml
Comprehensive XML Specification for WorkflowAI Prodocs/specifications/workflow-ai-pro.xml
Predictor.