Skip to content

feat(spec): SPEC-WFAIPRO-001 — WorkflowAI Pro Technical Specification#39

Merged
OneFineStarstuff merged 1 commit into
mainfrom
genspark_ai_developer
Mar 20, 2026
Merged

feat(spec): SPEC-WFAIPRO-001 — WorkflowAI Pro Technical Specification#39
OneFineStarstuff merged 1 commit into
mainfrom
genspark_ai_developer

Conversation

@OneFineStarstuff

@OneFineStarstuff OneFineStarstuff commented Mar 20, 2026

Copy link
Copy Markdown
Owner

User description

SPEC-WFAIPRO-001 — WorkflowAI Pro Technical Specification

Overview

Implementation-ready XML technical specification for WorkflowAI Pro, an enterprise-grade AI-powered workflow optimisation platform. Tri-model AI architecture: GNN document routing, collaborative filtering bottleneck prediction, active learning UI adaptation.

Document Reference: SPEC-WFAIPRO-001 v1.0.0
File: docs/specifications/workflow-ai-pro.xml (52,955 bytes, 1,323 lines)
Format: XML with CDATA-wrapped Markdown content
Classification: CONFIDENTIAL — CTO Office, VP Engineering, Lead Developers


Required Sections — All 6 Present and Validated

# Section Key Content
1 Executive Summary Tri-model architecture overview, legacy vs WorkflowAI Pro differentiators table
2 System Architecture Syntax-valid Mermaid.js C4 Container diagram (13 containers, 3 external systems, 27 relationships)
3 AI Components HeteroGAT GNN (18M params, <200ms P99), NCF + temporal attention (72h lookahead, >91% precision), pool-based AL with BatchBALD (200 labels/day, MC Dropout T=20)
4 Implementation Specs Deep dive on 3 entities — see below
5 Performance, Security & Compliance Exactly 3 bullet points: SLAs, GDPR/SOC 2, RBAC
6 18-Month Roadmap & Risks Exactly 8 bullet points: Q1-Q6 milestones + 2 risk mitigations

Section 4 — Implementation Specs Detail

Document Router

  • OpenAPI 3.0: 3 endpoints (routeDocument, getRoutingStatus, getGraphHealth)
  • PostgreSQL: 4 tables with RLS multi-tenancy, 16 hash partitions, monthly audit log partitions
  • Kafka: 4 topics (doc.ingested, doc.routed, doc.routing.escalated, DLQ), exactly-once semantics

Approval Predictor

  • OpenAPI 3.0: 2 endpoints (predictBottlenecks, getApproverLoad)
  • MongoDB: 2 collections (prediction_logs, cf_model_artifacts) with JSON Schema validation
  • Redis: 4 key patterns (user embeddings, stage embeddings, CF scores, temporal features), TTL policies
  • Kafka: 3 topics (approval.requested, approval.predicted, approval.completed), 5-tier retry backoff

Adaptive UI Engine

  • OpenAPI 3.0: 3 endpoints (getAdaptiveLayout, submitUIFeedback, getALStatus)
  • MongoDB: 2 collections (al_pool, al_experiments) with A/B experiment tracking
  • Kafka: 4 topics (ui.feedback, al.query, al.label.acquired, model.retrained)

Constraint Compliance

  • Output is raw XML (no markdown code block wrapper)
  • CDATA section wraps all Markdown content
  • No ]]> sequence inside CDATA content
  • Section 5: exactly 3 bullet points
  • Section 6: exactly 8 bullet points
  • Technical density prioritised over high-level explanation

Validation Results

XML Parse: OK (Python xml.etree.ElementTree)
All 6 sections: FOUND
Mermaid C4 diagram: PASS
OpenAPI 3.0: PASS
PostgreSQL schema: PASS
MongoDB schema: PASS
Redis schema: PASS
Kafka config: PASS
GNN detail (HeteroGAT): PASS
NCF detail: PASS
Active Learning (BatchBALD): PASS
CDATA wrapper: PASS

Files Changed

  • docs/specifications/workflow-ai-pro.xml (new — 1,323 lines)

Summary by Sourcery

Add a comprehensive XML technical specification document for the WorkflowAI Pro platform, defining its architecture, AI subsystems, data models, and integration contracts.

Documentation:

  • Introduce an implementation-ready XML specification for WorkflowAI Pro, covering executive summary, architecture, AI components, implementation specs, performance/compliance, and roadmap.
  • Document detailed contracts and schemas for the Document Router, Approval Predictor, and Adaptive UI Engine, including APIs, storage models, and Kafka topologies.

Summary by CodeRabbit

  • Documentation
    • Added comprehensive technical specification for the WorkflowAI Pro platform, including platform architecture, system components, data flow configurations, API contracts, database schemas, operational requirements, and product roadmap.

Description

  • Introduced a detailed XML technical specification for WorkflowAI Pro, outlining its tri-model AI architecture.
  • Specified implementation details for core components: Document Router, Approval Predictor, and Adaptive UI Engine.
  • Added OpenAPI 3.0 definitions for key service endpoints, enhancing API documentation.
  • Included sections on performance metrics, security compliance, and an 18-month development roadmap.

Changes walkthrough 📝

Relevant files
Enhancement
workflow-ai-pro.xml
Comprehensive XML Specification for WorkflowAI Pro             

docs/specifications/workflow-ai-pro.xml

  • Added comprehensive XML technical specification for WorkflowAI Pro.
  • Defined tri-model AI architecture and detailed implementation specs.
  • Included OpenAPI 3.0 endpoints for Document Router and Approval
    Predictor.
  • Documented performance, security, compliance, and roadmap sections.
  • +1323/-0

    💡 Penify usage:
    Comment /help on the PR to get a list of all available Penify tools and their descriptions

    Implementation-ready XML specification for WorkflowAI Pro enterprise workflow
    optimisation platform. Tri-model AI architecture: GNN document routing,
    collaborative filtering bottleneck prediction, active learning UI adaptation.
    
    Document: SPEC-WFAIPRO-001 v1.0.0 (52,955 bytes, 1,323 lines)
    Format: XML with CDATA-wrapped Markdown content
    
    6 Required Sections — All Present:
    1. Executive Summary — tri-model architecture overview, key differentiators table
    2. System Architecture — syntax-valid Mermaid.js C4 Container diagram (13 containers,
       3 external systems, 27 relationships)
    3. AI Components — HeteroGAT GNN (18M params, <200ms P99), NCF with temporal
       attention (72h lookahead, >91% precision), pool-based AL with BatchBALD
       (200 labels/day, MC Dropout T=20)
    4. Implementation Specs — Deep dive on 3 entities:
       - Document Router: OpenAPI 3.0 (3 endpoints), PostgreSQL schema (4 tables,
         RLS multi-tenancy, hash partitioning), Kafka (4 topics, exactly-once)
       - Approval Predictor: OpenAPI 3.0 (2 endpoints), MongoDB schema (2 collections
         with JSON Schema validation), Redis feature store (4 key patterns, TTL policy),
         Kafka (3 topics, 5-tier retry backoff)
       - Adaptive UI Engine: OpenAPI 3.0 (3 endpoints), MongoDB schema (2 collections:
         al_pool, al_experiments), Kafka (4 topics including model.retrained)
    5. Performance, Security & Compliance — exactly 3 bullet points:
       SLAs, GDPR/SOC 2, RBAC (6 roles, 23 permissions, OPA enforcement)
    6. 18-Month Roadmap & Risks — exactly 8 bullet points:
       Q1-Q6 milestones + 2 risk mitigations (model drift, Kafka backpressure)
    
    Validation: XML well-formed (Python ET parse), all section/content checks pass.
    @code-genius-code-coverage

    Copy link
    Copy Markdown

    The files' contents are under analysis for test generation.

    @semanticdiff-com

    semanticdiff-com Bot commented Mar 20, 2026

    Copy link
    Copy Markdown

    Review changes with  SemanticDiff

    Changed Files
    File Status
      docs/specifications/workflow-ai-pro.xml  0% smaller

    @gitnotebooks

    gitnotebooks Bot commented Mar 20, 2026

    Copy link
    Copy Markdown

    @vercel

    vercel Bot commented Mar 20, 2026

    Copy link
    Copy Markdown

    The latest updates on your projects. Learn more about Vercel for GitHub.

    Project Deployment Actions Updated (UTC)
    v0-one-fine-starstuff-github-io Ready Ready Preview, Comment, Open in v0 Mar 21, 2026 10:02am

    @difflens

    difflens Bot commented Mar 20, 2026

    Copy link
    Copy Markdown

    View changes in DiffLens

    @chatgpt-codex-connector

    Copy link
    Copy Markdown

    You have reached your Codex usage limits for code reviews. You can see your limits in the Codex usage dashboard.

    @sourcery-ai

    sourcery-ai Bot commented Mar 20, 2026

    Copy link
    Copy Markdown

    Reviewer's Guide

    Adds a new implementation-ready XML technical specification for WorkflowAI Pro, defining a tri-model AI workflow optimization platform with detailed architecture, AI models, APIs, data schemas, and messaging topologies for the Document Router, Approval Predictor, and Adaptive UI Engine services.

    Sequence diagram for document routing and bottleneck prediction

    sequenceDiagram
      actor User
      participant API_Gateway
      participant Document_Router_Service
      participant Redis_Feature_Store
      participant GNN_Inference_Engine
      participant PostgreSQL_DB
      participant Kafka_Broker
      participant Approval_Predictor_Service
    
      User->>API_Gateway: POST /api/v2/documents/route
      API_Gateway->>Document_Router_Service: routeDocument(document_id, tenant_id, content_hash, doc_type, metadata)
    
      Document_Router_Service->>Redis_Feature_Store: GET entity_embeddings(document_id, tenant_id)
      Redis_Feature_Store-->>Document_Router_Service: embeddings, graph_features
    
      Document_Router_Service->>GNN_Inference_Engine: gRPC infer_routing_paths(embeddings, graph_context)
      GNN_Inference_Engine-->>Document_Router_Service: top_paths, confidences
    
      Document_Router_Service->>PostgreSQL_DB: INSERT documents, routing_decisions, routing_paths
      Document_Router_Service->>Kafka_Broker: Produce doc.routed(document_id, routing_id, selected_path)
      Document_Router_Service->>Kafka_Broker: Produce doc.routing.escalated(if confidence < 0.75)
    
      Document_Router_Service-->>API_Gateway: 200 RoutingDecision or 202 EscalationResponse
      API_Gateway-->>User: Routing decision response
    
      Kafka_Broker-->>Approval_Predictor_Service: Consume approval.requested(document_id, approval_chain)
      Approval_Predictor_Service->>Redis_Feature_Store: GET user_stage_embeddings, temporal_features
      Redis_Feature_Store-->>Approval_Predictor_Service: feature_vectors
    
      Approval_Predictor_Service->>Approval_Predictor_Service: NCF_inference_for_chain
      Approval_Predictor_Service->>Kafka_Broker: Produce approval.predicted(document_id, stage_risks)
      Approval_Predictor_Service->>PostgreSQL_DB: UPDATE chain_risk_metadata(optional)
    
      Kafka_Broker-->>Document_Router_Service: approval.predicted(re_routing_triggers)
      Document_Router_Service->>PostgreSQL_DB: UPDATE routing_decisions_with_suggested_re_routes
    
    Loading

    Sequence diagram for adaptive UI layout resolution and active learning loop

    sequenceDiagram
      actor User
      participant API_Gateway
      participant Adaptive_UI_Engine
      participant Active_Learning_Service
      participant Kafka_Broker
      participant MongoDB_AL_Collections
    
      User->>API_Gateway: POST /api/v2/ui/layout(context)
      API_Gateway->>Adaptive_UI_Engine: getAdaptiveLayout(user_id, tenant_id, role_id, task_type, accessibility_flags)
    
      Adaptive_UI_Engine->>Active_Learning_Service: Resolve_layout(context_vector)
      Active_Learning_Service->>MongoDB_AL_Collections: INSERT al_pool_sample(context, predicted_layout_id)
      Active_Learning_Service-->>Adaptive_UI_Engine: LayoutConfig(layout_id, components, theme_overrides)
    
      Adaptive_UI_Engine-->>API_Gateway: 200 LayoutConfig
      API_Gateway-->>User: Rendered adaptive UI
    
      User->>API_Gateway: POST /api/v2/ui/feedback(session_id, layout_id, feedback)
      API_Gateway->>Adaptive_UI_Engine: submitUIFeedback(payload)
      Adaptive_UI_Engine->>Kafka_Broker: Produce ui.feedback(session_id, layout_id, metrics)
    
      Kafka_Broker-->>Active_Learning_Service: Consume ui.feedback
      Active_Learning_Service->>MongoDB_AL_Collections: UPDATE al_pool_sample_with_implicit_label
    
      Active_Learning_Service->>Active_Learning_Service: Periodic_MC_Dropout_uncertainty_estimation
      Active_Learning_Service->>MongoDB_AL_Collections: FIND top_entropy_diverse_samples
      Active_Learning_Service->>Kafka_Broker: Produce al.query(sample_id, context)
    
      Kafka_Broker-->>Active_Learning_Service: Consume al.label.acquired(sample_id, assigned_layout_id)
      Active_Learning_Service->>MongoDB_AL_Collections: UPDATE annotation_for_sample
      Active_Learning_Service->>Active_Learning_Service: Trigger_model_retrain_when_labels_threshold_reached
      Active_Learning_Service->>Kafka_Broker: Produce model.retrained(model_type=al_layout, version)
      Kafka_Broker-->>Adaptive_UI_Engine: model.retrained(al_layout, version)
      Adaptive_UI_Engine->>Adaptive_UI_Engine: Hot_reload_layout_model
    
    Loading

    Entity relationship diagram for core WorkflowAI Pro data schemas

    erDiagram
      DOCUMENTS {
        uuid id
        uuid tenant_id
        char64 content_hash
        varchar32 doc_type
        varchar16 urgency
        text_array compliance_flags
        jsonb metadata
        timestamptz created_at
        timestamptz updated_at
      }
    
      ROUTING_DECISIONS {
        uuid id
        uuid document_id
        uuid tenant_id
        varchar16 decision
        numeric4_3 confidence
        uuid selected_path_id
        varchar32 model_version
        numeric8_2 inference_latency_ms
        timestamptz created_at
      }
    
      ROUTING_PATHS {
        uuid id
        uuid routing_decision_id
        smallint path_rank
        numeric6_2 total_predicted_duration_h
        numeric4_3 path_confidence
        jsonb stages
      }
    
      ROUTING_AUDIT_LOG {
        bigint id
        uuid document_id
        uuid tenant_id
        varchar64 stage_id
        uuid approver_id
        varchar16 action
        timestamptz occurred_at
        jsonb metadata
      }
    
      PREDICTION_LOGS {
        string prediction_id
        string tenant_id
        string document_id
        string model_version
        date created_at
        double inference_latency_ms
        array stages
        string overall_chain_risk
        bool feedback_received
      }
    
      CF_MODEL_ARTIFACTS {
        string model_id
        string version
        date created_at
        string status
        object hyperparams
        object metrics
        string artifact_path
        string training_data_snapshot
      }
    
      AL_POOL {
        string sample_id
        string tenant_id
        object context
        string predicted_layout_id
        double prediction_entropy
        double mc_dropout_variance
        string status
        date created_at
        date selected_at
        date annotated_at
        object annotation
      }
    
      AL_EXPERIMENTS {
        string experiment_id
        string incumbent_version
        string challenger_version
        string status
        date created_at
        date concluded_at
        double traffic_split
        object metrics
        string decision
      }
    
      DOCUMENTS ||--o{ ROUTING_DECISIONS : has
      DOCUMENTS ||--o{ ROUTING_AUDIT_LOG : has
      ROUTING_DECISIONS ||--o{ ROUTING_PATHS : includes
    
      DOCUMENTS ||--o{ PREDICTION_LOGS : has
      CF_MODEL_ARTIFACTS ||--o{ PREDICTION_LOGS : generates
    
      AL_EXPERIMENTS ||--o{ AL_POOL : evaluates
      CF_MODEL_ARTIFACTS ||--o{ AL_EXPERIMENTS : compared_in
    
    Loading

    File-Level Changes

    Change Details Files
    Introduce a structured XML specification document with CDATA-wrapped Markdown defining the complete WorkflowAI Pro system architecture and behaviour.
    • Defines top-level XML metadata for the WorkflowAI Pro specification, including document reference, versioning, classification, and abstract.
    • Embeds a full Markdown technical spec inside a CDATA section, covering executive summary, C4 container architecture with Mermaid, AI component designs, implementation specs, performance/security/compliance, and roadmap and risks.
    • Ensures XML and CDATA constraints are respected (parsable XML, no forbidden CDATA terminators, required section counts, and validation notes).
    docs/specifications/workflow-ai-pro.xml
    Specify implementation details for the Document Router service, including external contracts and storage/event integration.
    • Defines OpenAPI 3.0 routes for document routing, routing status, and graph/model health with associated schemas and JWT security.
    • Designs a PostgreSQL 16 schema with hash-partitioned multi-tenant tables, monthly-partitioned audit logs, and RLS policies for tenant isolation.
    • Describes Kafka topics and consumer/producer configuration for document ingestion, routing events, escalation, DLQ handling, and exactly-once processing semantics.
    docs/specifications/workflow-ai-pro.xml
    Specify implementation details for the Approval Predictor service, including APIs, persistence, caching, and messaging.
    • Defines OpenAPI 3.0 endpoints for bottleneck prediction and approver load queries, including payloads, horizons, and response structures.
    • Details MongoDB collections with JSON Schema validation for prediction logs and NCF model artifacts, plus indexing strategy for monitoring and model lifecycle.
    • Describes Redis key patterns and TTL policies for user embeddings, stage embeddings, CF scores, and temporal features, along with Kafka topics and retry/backoff configuration for approval events.
    docs/specifications/workflow-ai-pro.xml
    Specify implementation details for the Adaptive UI Engine and its active learning loop.
    • Defines OpenAPI 3.0 endpoints for adaptive layout resolution, UI feedback ingestion, and active learning status, with emphasis on accessibility flags and layout config schema.
    • Details MongoDB collections and validators for the active-learning pool and A/B experiment tracking, including status fields and metrics used for promotion decisions.
    • Describes Kafka topics and consumer configs for UI feedback, annotation queries, label acquisition, and model retrain notifications across services.
    docs/specifications/workflow-ai-pro.xml
    Formalise AI model designs, non-functional requirements, and roadmap/risk posture for the platform.
    • Documents HeteroGAT GNN-based document routing, NCF with temporal attention for bottleneck prediction, and BatchBALD-driven active learning for UI layout selection, including model sizes, features, and training loops.
    • Captures performance, security, and compliance requirements such as latency SLOs, availability targets, GDPR/SOC 2 alignment, and RBAC model structure.
    • Outlines an 18‑month phased roadmap (Q1–Q6) and two key risk areas with mitigation strategies around model drift and Kafka backpressure/partition skew.
    docs/specifications/workflow-ai-pro.xml

    Tips and commands

    Interacting with Sourcery

    • Trigger a new review: Comment @sourcery-ai review on the pull request.
    • Continue discussions: Reply directly to Sourcery's review comments.
    • Generate a GitHub issue from a review comment: Ask Sourcery to create an
      issue from a review comment by replying to it. You can also reply to a
      review comment with @sourcery-ai issue to create an issue from it.
    • Generate a pull request title: Write @sourcery-ai anywhere in the pull
      request title to generate a title at any time. You can also comment
      @sourcery-ai title on the pull request to (re-)generate the title at any time.
    • Generate a pull request summary: Write @sourcery-ai summary anywhere in
      the pull request body to generate a PR summary at any time exactly where you
      want it. You can also comment @sourcery-ai summary on the pull request to
      (re-)generate the summary at any time.
    • Generate reviewer's guide: Comment @sourcery-ai guide on the pull
      request to (re-)generate the reviewer's guide at any time.
    • Resolve all Sourcery comments: Comment @sourcery-ai resolve on the
      pull request to resolve all Sourcery comments. Useful if you've already
      addressed all the comments and don't want to see them anymore.
    • Dismiss all Sourcery reviews: Comment @sourcery-ai dismiss on the pull
      request to dismiss all existing Sourcery reviews. Especially useful if you
      want to start fresh with a new review - don't forget to comment
      @sourcery-ai review to trigger a new review!

    Customizing Your Experience

    Access your dashboard to:

    • Enable or disable review features such as the Sourcery-generated pull request
      summary, the reviewer's guide, and others.
    • Change the review language.
    • Add, remove or edit custom review instructions.
    • Adjust other review settings.

    Getting Help

    @coderabbitai

    coderabbitai Bot commented Mar 20, 2026

    Copy link
    Copy Markdown
    Contributor
    📝 Walkthrough

    Walkthrough

    A new XML technical specification document for WorkflowAI Pro platform has been added, detailing a C4 architecture with three core services (Document Router, Approval Predictor, Adaptive UI Engine), AI component architectures, end-to-end data flows over Kafka/Redis/PostgreSQL/MongoDB, OpenAPI endpoint contracts, schema definitions, and operational requirements.

    Changes

    Cohort / File(s) Summary
    WorkflowAI Pro Specification
    docs/specifications/workflow-ai-pro.xml
    New technical specification (v1.0.0, DRAFT) defining platform architecture, AI component workflows, API contracts, persistence schemas, infrastructure topology, performance/security/compliance requirements, and 18-month roadmap.

    Estimated code review effort

    🎯 2 (Simple) | ⏱️ ~10 minutes

    Poem

    🐰 A specification born today,
    With schemas bright and flows at play,
    AI routers dancing through the streams,
    Approval dreams and UI schemes,
    WorkflowAI Pro takes flight away! 📋✨

    🚥 Pre-merge checks | ✅ 3
    ✅ Passed checks (3 passed)
    Check name Status Explanation
    Title check ✅ Passed The title directly and specifically references the main change: adding SPEC-WFAIPRO-001, a technical specification document for WorkflowAI Pro, which aligns perfectly with the changeset containing a new XML specification file.
    Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
    Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.

    ✏️ Tip: You can configure your own custom pre-merge checks in the settings.

    ✨ Finishing Touches
    📝 Generate docstrings
    • Create stacked PR
    • Commit on current branch
    🧪 Generate unit tests (beta)
    • Create PR with unit tests
    • Commit unit tests in branch genspark_ai_developer
    📝 Coding Plan
    • Generate coding plan for human review comments

    Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

    ❤️ Share

    Comment @coderabbitai help to get the list of available commands and usage tips.

    @netlify

    netlify Bot commented Mar 20, 2026

    Copy link
    Copy Markdown

    Deploy Preview for onefinestarstuff failed.

    Name Link
    🔨 Latest commit d6dae5b
    🔍 Latest deploy log https://app.netlify.com/projects/onefinestarstuff/deploys/69bd0c689fbbe7000849ab60

    @difflens

    difflens Bot commented Mar 20, 2026

    Copy link
    Copy Markdown

    View changes in DiffLens

    @sourcery-ai sourcery-ai Bot left a comment

    Copy link
    Copy Markdown

    Choose a reason for hiding this comment

    The reason will be displayed to describe this comment to others. Learn more.

    Hey - I've found 2 issues, and left some high level feedback:

    • Kafka topic names and semantics are described in multiple places (e.g., high-level bullets vs detailed YAML sections); consider standardizing the exact topic names and DLQ naming across the entire spec to avoid ambiguity during implementation.
    • There are many timestamp and duration fields across APIs and schemas (PostgreSQL, MongoDB, OpenAPI); explicitly stating a global convention (e.g., all timestamps in ISO 8601 UTC, all durations in hours/ms) near the top of the spec would reduce the risk of subtle cross-service inconsistencies.
    Prompt for AI Agents
    Please address the comments from this code review:
    
    ## Overall Comments
    - Kafka topic names and semantics are described in multiple places (e.g., high-level bullets vs detailed YAML sections); consider standardizing the exact topic names and DLQ naming across the entire spec to avoid ambiguity during implementation.
    - There are many timestamp and duration fields across APIs and schemas (PostgreSQL, MongoDB, OpenAPI); explicitly stating a global convention (e.g., all timestamps in ISO 8601 UTC, all durations in hours/ms) near the top of the spec would reduce the risk of subtle cross-service inconsistencies.
    
    ## Individual Comments
    
    ### Comment 1
    <location path="docs/specifications/workflow-ai-pro.xml" line_range="475-484" />
    <code_context>
    +CREATE TABLE routing_audit_log (
    </code_context>
    <issue_to_address>
    **🚨 issue (security):** Apply consistent tenant isolation to `routing_audit_log` (and potentially `routing_paths`) to align with the multi-tenant security goals.
    
    `routing_audit_log` includes `tenant_id` but is not protected by RLS, unlike `documents` and `routing_decisions`. To maintain strict tenant isolation and long-term audit data safety, enable RLS here and add a `tenant_isolation` policy. Also consider adding `tenant_id` and RLS to `routing_paths` so it remains tenant-scoped even if accessed without joining through `routing_decision_id`.
    </issue_to_address>
    
    ### Comment 2
    <location path="docs/specifications/workflow-ai-pro.xml" line_range="464-471" />
    <code_context>
    +CREATE INDEX idx_routing_decisions_tenant_created
    +    ON routing_decisions (tenant_id, created_at DESC);
    +
    +CREATE TABLE routing_paths (
    +    id                          UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    +    routing_decision_id         UUID NOT NULL
    +        REFERENCES routing_decisions(id) ON DELETE CASCADE,
    +    path_rank                   SMALLINT NOT NULL,  -- 0=selected, 1-2=alternatives
    +    total_predicted_duration_h  NUMERIC(6,2),
    +    path_confidence             NUMERIC(4,3),
    +    stages                      JSONB NOT NULL
    +    -- array of {stage_id, approver_id, predicted_duration_h, bottleneck_prob}
    +);
    </code_context>
    <issue_to_address>
    **suggestion (performance):** Add an index on `(routing_decision_id, path_rank)` in `routing_paths` to support common query patterns efficiently.
    
    Given how `RoutingDecision.selected_path` and `alternative_paths` will be used, queries will often filter/order by `routing_decision_id` and `path_rank` (e.g., rank 0 plus a few alternatives). With only a PK on `id`, these will devolve into table scans as data grows. Please add a non-unique index on `(routing_decision_id, path_rank)` to keep lookups efficient, especially under multi-tenant load.
    </issue_to_address>

    Sourcery is free for open source - if you like our reviews please consider sharing them ✨
    Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.

    Comment on lines +475 to +484
    CREATE TABLE routing_audit_log (
    id BIGINT GENERATED ALWAYS AS IDENTITY PRIMARY KEY,
    document_id UUID NOT NULL,
    tenant_id UUID NOT NULL,
    stage_id VARCHAR(64) NOT NULL,
    approver_id UUID,
    action VARCHAR(16) NOT NULL,
    -- approved | rejected | delegated | escalated
    occurred_at TIMESTAMPTZ NOT NULL DEFAULT now(),
    metadata JSONB DEFAULT '{}'

    Copy link
    Copy Markdown

    Choose a reason for hiding this comment

    The reason will be displayed to describe this comment to others. Learn more.

    🚨 issue (security): Apply consistent tenant isolation to routing_audit_log (and potentially routing_paths) to align with the multi-tenant security goals.

    routing_audit_log includes tenant_id but is not protected by RLS, unlike documents and routing_decisions. To maintain strict tenant isolation and long-term audit data safety, enable RLS here and add a tenant_isolation policy. Also consider adding tenant_id and RLS to routing_paths so it remains tenant-scoped even if accessed without joining through routing_decision_id.

    Comment on lines +464 to +471
    CREATE TABLE routing_paths (
    id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    routing_decision_id UUID NOT NULL
    REFERENCES routing_decisions(id) ON DELETE CASCADE,
    path_rank SMALLINT NOT NULL, -- 0=selected, 1-2=alternatives
    total_predicted_duration_h NUMERIC(6,2),
    path_confidence NUMERIC(4,3),
    stages JSONB NOT NULL

    Copy link
    Copy Markdown

    Choose a reason for hiding this comment

    The reason will be displayed to describe this comment to others. Learn more.

    suggestion (performance): Add an index on (routing_decision_id, path_rank) in routing_paths to support common query patterns efficiently.

    Given how RoutingDecision.selected_path and alternative_paths will be used, queries will often filter/order by routing_decision_id and path_rank (e.g., rank 0 plus a few alternatives). With only a PK on id, these will devolve into table scans as data grows. Please add a non-unique index on (routing_decision_id, path_rank) to keep lookups efficient, especially under multi-tenant load.

    @difflens

    difflens Bot commented Mar 20, 2026

    Copy link
    Copy Markdown

    View changes in DiffLens

    @coderabbitai coderabbitai Bot left a comment

    Copy link
    Copy Markdown
    Contributor

    Choose a reason for hiding this comment

    The reason will be displayed to describe this comment to others. Learn more.

    Actionable comments posted: 5

    🧹 Nitpick comments (8)
    docs/specifications/workflow-ai-pro.xml (8)

    550-550: Long max_poll_interval_ms may cause consumer group instability.

    The Document Router consumer configuration sets max_poll_interval_ms: 300000 (5 minutes) to accommodate GNN inference latency. However, this very long interval increases the risk of:

    • Delayed detection of consumer failures
    • Extended partition ownership during hung/slow consumers
    • Consumer group rebalancing delays

    Recommendation:

    • Verify that GNN inference P99 latency target (<200ms) is achieved in production
    • If inference occasionally exceeds 5 minutes, consider processing messages asynchronously (immediately commit offset after queuing message for background processing)
    • Implement consumer heartbeat monitoring to detect processing delays earlier than the 5-minute timeout
    🤖 Prompt for AI Agents
    Verify each finding against the current code and only fix it if needed.
    
    In `@docs/specifications/workflow-ai-pro.xml` at line 550, The consumer
    configuration sets max_poll_interval_ms: 300000 which is too long and can cause
    consumer-group instability; update the consumer behavior by either lowering
    max_poll_interval_ms to a safer value (e.g., closer to expected GNN P99 <200ms)
    or change processing to be asynchronous: immediately commit offsets after
    enqueueing messages for background GNN inference and implement
    heartbeat/monitoring to detect slow consumers earlier; locate and modify the
    max_poll_interval_ms setting in workflow-ai-pro.xml and ensure any consumer loop
    (consumer poll/commit logic and background worker/queue) is changed to enqueue
    work and commit promptly while adding heartbeat/monitoring hooks.
    

    735-781: Consider adding TTL indexes to prevent unbounded collection growth.

    The prediction_logs collection stores every prediction but lacks a TTL (Time-To-Live) index. Without automatic expiration, this collection will grow indefinitely, potentially impacting performance and storage costs.

    Consider adding TTL indexes for time-series data:

    // Auto-delete prediction logs older than 180 days
    db.prediction_logs.createIndex(
      { created_at: 1 },
      { expireAfterSeconds: 15552000 }  // 180 days
    );
    
    // Similar for AL pool - expire samples in "expired" status after 30 days
    db.al_pool.createIndex(
      { created_at: 1 },
      { expireAfterSeconds: 2592000,  // 30 days
        partialFilterExpression: { status: "expired" } }
    );
    🤖 Prompt for AI Agents
    Verify each finding against the current code and only fix it if needed.
    
    In `@docs/specifications/workflow-ai-pro.xml` around lines 735 - 781, The schema
    for the prediction_logs collection lacks TTL indexes so data will grow
    unbounded; add a TTL index on prediction_logs.created_at to expire old
    prediction documents (e.g., 180 days) by creating an index with
    expireAfterSeconds, and also add a TTL index on al_pool.created_at with a
    partialFilterExpression for status: "expired" (e.g., 30 days) to auto-remove
    expired AL samples; update the migration/schema diff to include
    db.prediction_logs.createIndex({ created_at: 1 }, { expireAfterSeconds:
    <seconds> }) and db.al_pool.createIndex({ created_at: 1 }, { expireAfterSeconds:
    <seconds>, partialFilterExpression: { status: "expired" } }) so retention is
    enforced.
    

    558-560: Clarify transactional_id implementation pattern.

    Line 560 specifies transactional_id: "doc-router-tx-{instance_id}" for exactly-once semantics, but doesn't explain how {instance_id} should be generated or managed. Each producer instance must have a unique transactional ID that persists across restarts.

    Document the implementation approach:

    • How is instance_id generated? (e.g., pod name, UUID, consumer group member ID)
    • Is it stable across pod restarts?
    • How to handle transactional ID exhaustion/cleanup?
    • Recovery procedure when a transactional producer fails mid-transaction
    🤖 Prompt for AI Agents
    Verify each finding against the current code and only fix it if needed.
    
    In `@docs/specifications/workflow-ai-pro.xml` around lines 558 - 560, Clarify how
    the transactional_id pattern "doc-router-tx-{instance_id}" must be implemented:
    specify generation strategies for instance_id (e.g., use Kubernetes pod name for
    stability, or a cluster-assigned persistent UUID stored in a volume/secret),
    state whether the chosen approach is stable across restarts, describe
    lifecycle/cleanup to avoid transactional ID exhaustion (e.g., reusing stable
    IDs, TTL/policy for ephemeral IDs, admin tooling to remove retired IDs), and
    document recovery steps for a producer that failed mid-transaction (how to
    detect in-doubt transactions, force abort or resume via broker/admin APIs, and
    recommended monitoring/alerting). Include references to transactional_id and
    instance_id so implementers know where to apply each guidance.
    

    834-846: Optimize embedding storage format for Redis.

    Storing 64-dimensional embeddings as JSON strings (e.g., "[0.12,-0.34,...,0.56]") in Redis hash fields is inefficient:

    • Parsing overhead when reading embeddings
    • Increased memory footprint compared to binary formats
    • Slower serialization/deserialization

    Consider alternatives:

    1. Use Redis vector data type (Redis Stack with RediSearch) for native vector storage and similarity search
    2. Store as binary-encoded float arrays using MessagePack or Protocol Buffers
    3. Use HSET with separate numeric fields if individual dimensions need independent access
    Example: Binary encoding with MessagePack
    import msgpack
    import numpy as np
    
    # Encode embedding
    embedding = np.array([0.12, -0.34, ..., 0.56], dtype=np.float32)
    packed = msgpack.packb(embedding.tolist(), use_bin_type=True)
    redis.hset(f"feat:{tenant_id}:user_emb:{user_id}", 
               "embedding", packed)
    
    # Decode embedding
    packed = redis.hget(f"feat:{tenant_id}:user_emb:{user_id}", "embedding")
    embedding = np.array(msgpack.unpackb(packed, raw=False), dtype=np.float32)
    🤖 Prompt for AI Agents
    Verify each finding against the current code and only fix it if needed.
    
    In `@docs/specifications/workflow-ai-pro.xml` around lines 834 - 846, The current
    HSET usage storing the 64-dim embedding as a JSON string under the "embedding"
    field (key pattern feat:{tenant_id}:user_emb:{user_id}) is inefficient; update
    the write/read flows to store the embedding in a binary/vector-native
    format—either switch to Redis Vector/RediSearch native vectors for similarity
    use, or encode the float32 array with MessagePack/Protobuf before HSET and
    decode on read—so modify the code that calls HSET for "embedding" to pack the
    float32 array and the corresponding reader to unpack it (or replace HSET with
    the Redis vector API), and keep other hash fields (department_id, last_updated,
    etc.) unchanged.
    

    831-875: Add Redis Cluster hash tags to key patterns for optimal performance.

    The feature store keys lack Redis Cluster hash tags, which can lead to related keys being distributed across different cluster nodes, requiring cross-node multi-key operations.

    For example, fetching all features for a user (embedding + temporal features) might require multiple cross-node requests.

    Add hash tags to ensure related keys reside on the same hash slot:

    -# Key: feat:{tenant_id}:user_emb:{user_id}
    +# Key: feat:{tenant_id}:user_emb:{user_id}  -> use {tenant_id} or {user_id} as hash tag
    
    -HSET feat:t-abc123:user_emb:u-789def
    +HSET feat:{t-abc123}:user_emb:u-789def
    
    -HSET feat:t-abc123:temporal:u-789def
    +HSET feat:{t-abc123}:temporal:u-789def

    This ensures all keys with the same {tenant_id} hash to the same Redis Cluster node, enabling efficient MGET/pipeline operations.

    🤖 Prompt for AI Agents
    Verify each finding against the current code and only fix it if needed.
    
    In `@docs/specifications/workflow-ai-pro.xml` around lines 831 - 875, The keys
    must include Redis Cluster hash tags around the tenant identifier so related
    keys land on the same hash slot; update all key patterns (e.g.,
    feat:{tenant_id}:user_emb:{user_id}, feat:{tenant_id}:stage_emb:{stage_id},
    feat:{tenant_id}:cf_score:{user_id}:{stage_id},
    feat:{tenant_id}:temporal:{user_id}) to wrap only the tenant_id in braces (e.g.,
    feat:{<tenant_id>}:user_emb:<user_id>) in every HSET/SET example and any related
    comments so pipelines/MGETs operate on a single node.
    

    1311-1320: Consider documenting additional risks and dependencies.

    The roadmap and risk section covers model drift and Kafka backpressure, which are well-mitigated. However, an 18-month platform development with three complex AI systems may benefit from addressing additional risk categories:

    Potential additional risks:

    • Team skill gaps: GNN, NCF, and Active Learning require specialized ML expertise. Mitigation: training plan, consultants, or hiring timeline
    • Data quality: AI models depend on high-quality training data. Mitigation: data validation pipeline, labeling quality checks
    • Cold start: New tenants without historical data. Mitigation: covered partially for NCF (line 148) but not for GNN routing
    • Regulatory changes: GDPR/SOC 2 requirements may evolve. Mitigation: quarterly compliance review
    • Infrastructure costs: ML infrastructure (GPUs, Redis cluster, Kafka) can be expensive. Mitigation: cost monitoring and optimization plan
    • Dependency on external systems: SharePoint, S3, IdP availability. Mitigation: graceful degradation, caching strategies

    Since this is marked CONFIDENTIAL for CTO/VP Engineering, including a more comprehensive risk register would strengthen the business case and resource planning.

    🤖 Prompt for AI Agents
    Verify each finding against the current code and only fix it if needed.
    
    In `@docs/specifications/workflow-ai-pro.xml` around lines 1311 - 1320, Add a
    comprehensive additional-risk subsection alongside the existing "Risk -- Model
    Drift / Data Distribution Shift" and "Risk -- Kafka Partition Skew /
    Backpressure Cascade" entries that enumerates and mitigations for: Team skill
    gaps (training plan, consultant/hiring timeline), Data quality (validation
    pipelines, labeling QA), Cold-start for GNN routing (seeded priors, transfer
    learning, rule-based fallback), Regulatory changes (quarterly compliance
    reviews, legal monitoring), Infrastructure costs (cost monitoring, GPU/instance
    rightsizing, spot/commitment strategies), and External system dependencies
    (graceful degradation, caching, SLA-based failover). Place this new "Risk --
    Additional: Team/Data/ColdStart/Compliance/Cost/Dependencies" block in the
    18-month roadmap/risk section and ensure each bullet pairs a clear mitigation
    with a measurable trigger or owner to match the existing style and tone.
    

    205-207: Add pattern constraint for SHA-256 content hash.

    The content_hash field is described as "SHA-256 hash of document content" but lacks a pattern constraint. SHA-256 hashes are exactly 64 hexadecimal characters.

    Add validation:

    content_hash:
      type: string
      pattern: '^[a-f0-9]{64}$'
      description: SHA-256 hash of document content
    🤖 Prompt for AI Agents
    Verify each finding against the current code and only fix it if needed.
    
    In `@docs/specifications/workflow-ai-pro.xml` around lines 205 - 207, The
    content_hash schema currently lacks a pattern constraint; update the
    content_hash field definition in the schema (the content_hash property) to
    include a pattern that enforces exactly 64 lowercase hexadecimal characters (use
    the regex ^[a-f0-9]{64}$), keep type: string and the existing description, so
    the field validates as a SHA-256 hex digest.
    

    450-450: PostgreSQL confidence column allows out-of-range values.

    The confidence column is defined as NUMERIC(4,3) which allows values from -9.999 to 9.999, but the CHECK constraint limits it to 0-1. The NUMERIC definition should be NUMERIC(3,3) to represent values from 0.000 to 0.999, or increase precision if needed:

    confidence           NUMERIC(4,3) NOT NULL CHECK (confidence BETWEEN 0 AND 1),

    Actually, NUMERIC(4,3) means total 4 digits with 3 after decimal point, so it allows 0.000 to 9.999. For confidence scores 0.000 to 1.000, this should be:

    confidence           NUMERIC(4,3) NOT NULL CHECK (confidence >= 0 AND confidence <= 1),

    The existing CHECK is correct, but the type could be more precise. Consider NUMERIC(4,3) is fine since it allows 1.000, but you might want to document why this precision was chosen.

    🤖 Prompt for AI Agents
    Verify each finding against the current code and only fix it if needed.
    
    In `@docs/specifications/workflow-ai-pro.xml` at line 450, The confidence column's
    precision is ambiguous: keep the type as NUMERIC(4,3) to allow 1.000 (since
    NUMERIC(3,3) maxes at 0.999) and retain the existing CHECK (confidence BETWEEN 0
    AND 1); update the schema line for the confidence column to include a short
    inline comment explaining why NUMERIC(4,3) was chosen (to permit 1.000) and
    ensure the CHECK constraint on confidence remains in place.
    
    🤖 Prompt for all review comments with AI agents
    Verify each finding against the current code and only fix it if needed.
    
    Inline comments:
    In `@docs/specifications/workflow-ai-pro.xml`:
    - Around line 534-538: The DLQ topic configuration doc.routing.dlq currently
    uses retention_ms: -1 which risks unbounded storage growth; change retention_ms
    to a large but finite value (e.g., 7776000000 for 90 days) instead of -1, keep
    or confirm cleanup_policy: compact as needed, and add operational controls:
    create automated DLQ monitoring/alerting for topic depth and storage, and add a
    runbook for inspecting/reprocessing messages (document procedures and thresholds
    alongside the doc.routing.dlq configuration).
    - Around line 492-498: Add documentation/comments explaining that the RLS
    policies tenant_isolation_documents and tenant_isolation_routing rely on
    current_setting('app.current_tenant')::UUID and that the application must set
    this via "SET LOCAL app.current_tenant = '<tenant_uuid>'" at the start of each
    transaction (or via middleware that runs per-transaction); note tradeoffs of
    per-transaction vs per-connection when using connection poolers, describe error
    handling if the setting is missing (e.g., detect and abort the transaction with
    a clear error or raise a custom NOTICE/ERROR), and include an example
    implementation pattern for middleware/connection wrapper that reads the
    authenticated tenant ID and issues the SET LOCAL before any DB statements for
    tables documents and routing_decisions.
    - Line 1305: The doc has conflicting latency expectations: the global Kafka
    consumer SLA "<5s end-to-end event processing latency" conflicts with Document
    Router's consumer config max_poll_interval_ms: 300000 and the GNN P99 <200ms
    target; update the spec to (1) state which percentile the "<5s" SLA refers to
    (P50/P95/P99), (2) define concrete behavior when GNN inference exceeds 5s for
    the Document Router Kafka consumer (e.g., enforce an inference timeout of 5s in
    the Document Router's inference handler, emit the message to DLQ or mark for
    human review and increment a metric/alert), and (3) reconcile the config by
    either reducing max_poll_interval_ms to match the chosen SLA (e.g., 5000ms if
    you require consumer poll intervals to support a 5s E2E SLA) or relaxing the SLA
    to accept longer tail latency; reference Document Router, max_poll_interval_ms,
    the GNN inference path and the "<5s end-to-end event processing latency" SLA
    when making the change.
    - Around line 455-457: The FK constraint fk_doc currently uses ON DELETE CASCADE
    which will remove routing decisions, paths, and audit logs when a documents row
    is deleted; change the constraint to use ON DELETE RESTRICT or ON DELETE SET
    NULL and implement a soft-delete pattern on the documents table (e.g., add a
    deleted_at flag and update application queries in routing logic to filter out
    soft-deleted documents) and update any code that deletes documents to set the
    soft-delete flag instead of issuing a hard DELETE; also add a migration to alter
    the fk_doc constraint and handle existing NULLability if you choose SET NULL.
    - Around line 176-416: The OpenAPI spec declares endpoints (operationIds:
    routeDocument, getRoutingStatus, getGraphHealth and the Approval Predictor/UI
    endpoints) and schemas (RoutingDecision, RoutingPath, EscalationResponse,
    RoutingStatus) that are not implemented in backend/server.js; either implement
    matching Express handlers for POST /api/v2/documents/route, GET
    /api/v2/documents/:document_id/routing-status, GET /api/v2/routing/graph/health,
    POST /api/v2/predictions/bottlenecks, GET
    /api/v2/predictions/approver-load/:approver_id, POST /api/v2/ui/layout, POST
    /api/v2/ui/feedback, GET /api/v2/ui/al/status in backend/server.js (hooking into
    your business logic and returning the declared response shapes) and add
    request/response JSON Schema validation middleware for the schemas
    RoutingDecision, RoutingPath, EscalationResponse, RoutingStatus (or reuse your
    existing validation utilities), or alternatively prune/update the OpenAPI
    document to exactly match the two existing handlers (GET /api/wheel/stages and
    POST /api/wheel/progress) and remove the unused schema/type declarations so the
    spec and implementation stay synchronized.
    
    ---
    
    Nitpick comments:
    In `@docs/specifications/workflow-ai-pro.xml`:
    - Line 550: The consumer configuration sets max_poll_interval_ms: 300000 which
    is too long and can cause consumer-group instability; update the consumer
    behavior by either lowering max_poll_interval_ms to a safer value (e.g., closer
    to expected GNN P99 <200ms) or change processing to be asynchronous: immediately
    commit offsets after enqueueing messages for background GNN inference and
    implement heartbeat/monitoring to detect slow consumers earlier; locate and
    modify the max_poll_interval_ms setting in workflow-ai-pro.xml and ensure any
    consumer loop (consumer poll/commit logic and background worker/queue) is
    changed to enqueue work and commit promptly while adding heartbeat/monitoring
    hooks.
    - Around line 735-781: The schema for the prediction_logs collection lacks TTL
    indexes so data will grow unbounded; add a TTL index on
    prediction_logs.created_at to expire old prediction documents (e.g., 180 days)
    by creating an index with expireAfterSeconds, and also add a TTL index on
    al_pool.created_at with a partialFilterExpression for status: "expired" (e.g.,
    30 days) to auto-remove expired AL samples; update the migration/schema diff to
    include db.prediction_logs.createIndex({ created_at: 1 }, { expireAfterSeconds:
    <seconds> }) and db.al_pool.createIndex({ created_at: 1 }, { expireAfterSeconds:
    <seconds>, partialFilterExpression: { status: "expired" } }) so retention is
    enforced.
    - Around line 558-560: Clarify how the transactional_id pattern
    "doc-router-tx-{instance_id}" must be implemented: specify generation strategies
    for instance_id (e.g., use Kubernetes pod name for stability, or a
    cluster-assigned persistent UUID stored in a volume/secret), state whether the
    chosen approach is stable across restarts, describe lifecycle/cleanup to avoid
    transactional ID exhaustion (e.g., reusing stable IDs, TTL/policy for ephemeral
    IDs, admin tooling to remove retired IDs), and document recovery steps for a
    producer that failed mid-transaction (how to detect in-doubt transactions, force
    abort or resume via broker/admin APIs, and recommended monitoring/alerting).
    Include references to transactional_id and instance_id so implementers know
    where to apply each guidance.
    - Around line 834-846: The current HSET usage storing the 64-dim embedding as a
    JSON string under the "embedding" field (key pattern
    feat:{tenant_id}:user_emb:{user_id}) is inefficient; update the write/read flows
    to store the embedding in a binary/vector-native format—either switch to Redis
    Vector/RediSearch native vectors for similarity use, or encode the float32 array
    with MessagePack/Protobuf before HSET and decode on read—so modify the code that
    calls HSET for "embedding" to pack the float32 array and the corresponding
    reader to unpack it (or replace HSET with the Redis vector API), and keep other
    hash fields (department_id, last_updated, etc.) unchanged.
    - Around line 831-875: The keys must include Redis Cluster hash tags around the
    tenant identifier so related keys land on the same hash slot; update all key
    patterns (e.g., feat:{tenant_id}:user_emb:{user_id},
    feat:{tenant_id}:stage_emb:{stage_id},
    feat:{tenant_id}:cf_score:{user_id}:{stage_id},
    feat:{tenant_id}:temporal:{user_id}) to wrap only the tenant_id in braces (e.g.,
    feat:{<tenant_id>}:user_emb:<user_id>) in every HSET/SET example and any related
    comments so pipelines/MGETs operate on a single node.
    - Around line 1311-1320: Add a comprehensive additional-risk subsection
    alongside the existing "Risk -- Model Drift / Data Distribution Shift" and "Risk
    -- Kafka Partition Skew / Backpressure Cascade" entries that enumerates and
    mitigations for: Team skill gaps (training plan, consultant/hiring timeline),
    Data quality (validation pipelines, labeling QA), Cold-start for GNN routing
    (seeded priors, transfer learning, rule-based fallback), Regulatory changes
    (quarterly compliance reviews, legal monitoring), Infrastructure costs (cost
    monitoring, GPU/instance rightsizing, spot/commitment strategies), and External
    system dependencies (graceful degradation, caching, SLA-based failover). Place
    this new "Risk -- Additional: Team/Data/ColdStart/Compliance/Cost/Dependencies"
    block in the 18-month roadmap/risk section and ensure each bullet pairs a clear
    mitigation with a measurable trigger or owner to match the existing style and
    tone.
    - Around line 205-207: The content_hash schema currently lacks a pattern
    constraint; update the content_hash field definition in the schema (the
    content_hash property) to include a pattern that enforces exactly 64 lowercase
    hexadecimal characters (use the regex ^[a-f0-9]{64}$), keep type: string and the
    existing description, so the field validates as a SHA-256 hex digest.
    - Line 450: The confidence column's precision is ambiguous: keep the type as
    NUMERIC(4,3) to allow 1.000 (since NUMERIC(3,3) maxes at 0.999) and retain the
    existing CHECK (confidence BETWEEN 0 AND 1); update the schema line for the
    confidence column to include a short inline comment explaining why NUMERIC(4,3)
    was chosen (to permit 1.000) and ensure the CHECK constraint on confidence
    remains in place.
    

    ℹ️ Review info
    ⚙️ Run configuration

    Configuration used: defaults

    Review profile: CHILL

    Plan: Pro

    Run ID: 5d128e43-fbcb-4f96-8e6e-1bd1b46c468f

    📥 Commits

    Reviewing files that changed from the base of the PR and between 1d3de7d and d6dae5b.

    📒 Files selected for processing (1)
    • docs/specifications/workflow-ai-pro.xml

    Comment on lines +176 to +416
    ```yaml
    openapi: 3.0.3
    info:
    title: Document Router Service API
    version: 2.1.0
    description: Intelligent document routing powered by heterogeneous GNN inference.

    paths:
    /api/v2/documents/route:
    post:
    operationId: routeDocument
    summary: Submit a document for AI-powered routing
    tags: [routing]
    security:
    - BearerAuth: []
    requestBody:
    required: true
    content:
    application/json:
    schema:
    type: object
    required: [document_id, tenant_id, content_hash, doc_type]
    properties:
    document_id:
    type: string
    format: uuid
    tenant_id:
    type: string
    format: uuid
    content_hash:
    type: string
    description: SHA-256 hash of document content
    doc_type:
    type: string
    enum: [contract, invoice, policy, legal_brief, hr_form, engineering_spec, compliance_report]
    urgency:
    type: string
    enum: [critical, high, standard, low]
    default: standard
    compliance_flags:
    type: array
    items:
    type: string
    enum: [gdpr, sox, hipaa, pci_dss, itar]
    metadata:
    type: object
    additionalProperties: true
    responses:
    '200':
    description: Routing decision computed
    content:
    application/json:
    schema:
    $ref: '#/components/schemas/RoutingDecision'
    '202':
    description: Low-confidence routing; escalated to human review
    content:
    application/json:
    schema:
    $ref: '#/components/schemas/EscalationResponse'
    '422':
    description: Unprocessable document features
    '429':
    description: Rate limit exceeded

    /api/v2/documents/{document_id}/routing-status:
    get:
    operationId: getRoutingStatus
    summary: Retrieve current routing state and audit trail
    tags: [routing]
    security:
    - BearerAuth: []
    parameters:
    - name: document_id
    in: path
    required: true
    schema:
    type: string
    format: uuid
    responses:
    '200':
    description: Routing status with full path trace
    content:
    application/json:
    schema:
    $ref: '#/components/schemas/RoutingStatus'
    '404':
    description: Document not found

    /api/v2/routing/graph/health:
    get:
    operationId: getGraphHealth
    summary: GNN model and graph index health check
    tags: [operations]
    security:
    - BearerAuth: []
    responses:
    '200':
    description: Graph and model health metrics
    content:
    application/json:
    schema:
    type: object
    properties:
    model_version:
    type: string
    graph_node_count:
    type: integer
    graph_edge_count:
    type: integer
    avg_inference_latency_ms:
    type: number
    p99_inference_latency_ms:
    type: number
    last_retrain_timestamp:
    type: string
    format: date-time
    feature_store_status:
    type: string
    enum: [healthy, degraded, unavailable]

    components:
    securitySchemes:
    BearerAuth:
    type: http
    scheme: bearer
    bearerFormat: JWT

    schemas:
    RoutingDecision:
    type: object
    properties:
    document_id:
    type: string
    format: uuid
    routing_id:
    type: string
    format: uuid
    decision:
    type: string
    enum: [auto_routed, human_review]
    confidence:
    type: number
    minimum: 0
    maximum: 1
    selected_path:
    $ref: '#/components/schemas/RoutingPath'
    alternative_paths:
    type: array
    maxItems: 2
    items:
    $ref: '#/components/schemas/RoutingPath'
    model_version:
    type: string
    inference_latency_ms:
    type: number
    timestamp:
    type: string
    format: date-time

    RoutingPath:
    type: object
    properties:
    path_id:
    type: string
    format: uuid
    stages:
    type: array
    items:
    type: object
    properties:
    stage_id:
    type: string
    stage_name:
    type: string
    assigned_approver_id:
    type: string
    format: uuid
    predicted_duration_hours:
    type: number
    bottleneck_probability:
    type: number
    total_predicted_duration_hours:
    type: number
    path_confidence:
    type: number

    EscalationResponse:
    type: object
    properties:
    document_id:
    type: string
    format: uuid
    escalation_id:
    type: string
    format: uuid
    reason:
    type: string
    top_candidate_paths:
    type: array
    items:
    $ref: '#/components/schemas/RoutingPath'
    escalated_to:
    type: string
    format: uuid

    RoutingStatus:
    type: object
    properties:
    document_id:
    type: string
    format: uuid
    current_stage:
    type: string
    overall_status:
    type: string
    enum: [in_progress, completed, rejected, escalated, stalled]
    path_trace:
    type: array
    items:
    type: object
    properties:
    stage_id:
    type: string
    approver_id:
    type: string
    format: uuid
    entered_at:
    type: string
    format: date-time
    completed_at:
    type: string
    format: date-time
    nullable: true
    action:
    type: string
    enum: [approved, rejected, delegated, pending]
    sla_status:
    type: string
    enum: [on_track, at_risk, breached]
    ```

    Copy link
    Copy Markdown
    Contributor

    Choose a reason for hiding this comment

    The reason will be displayed to describe this comment to others. Learn more.

    ⚠️ Potential issue | 🔴 Critical

    🧩 Analysis chain

    🏁 Script executed:

    #!/bin/bash
    # Verify which API endpoints are actually implemented in the backend
    
    echo "=== Searching for declared endpoints in backend codebase ==="
    
    # Search for Document Router endpoints
    echo -e "\n--- Document Router endpoints ---"
    rg -n "'/api/v2/documents/route'" --type js
    rg -n "'/api/v2/routing/graph/health'" --type js
    
    # Search for Approval Predictor endpoints  
    echo -e "\n--- Approval Predictor endpoints ---"
    rg -n "'/api/v2/predictions/bottlenecks'" --type js
    rg -n "'/api/v2/predictions/approver-load'" --type js
    
    # Search for Adaptive UI endpoints
    echo -e "\n--- Adaptive UI Engine endpoints ---"
    rg -n "'/api/v2/ui/layout'" --type js
    rg -n "'/api/v2/ui/feedback'" --type js
    rg -n "'/api/v2/ui/al/status'" --type js
    
    # Show what endpoints actually exist
    echo -e "\n=== Actual endpoints found in backend ==="
    rg -n "app\.(get|post|put|delete|patch)\(" backend/server.js -A 1

    Repository: OneFineStarstuff/OneFineStarstuff.github.io

    Length of output: 495


    Specification-implementation mismatch: declared endpoints do not exist in the backend.

    The specification declares 8 OpenAPI endpoints across three services, but the backend (backend/server.js) only implements 2 endpoints:

    • GET /api/wheel/stages
    • POST /api/wheel/progress

    The specified but unimplemented endpoints are:

    Document Router Service:

    • POST /api/v2/documents/route
    • GET /api/v2/documents/{document_id}/routing-status
    • GET /api/v2/routing/graph/health

    Approval Predictor Service:

    • POST /api/v2/predictions/bottlenecks
    • GET /api/v2/predictions/approver-load/{approver_id}

    Adaptive UI Engine:

    • POST /api/v2/ui/layout
    • POST /api/v2/ui/feedback
    • GET /api/v2/ui/al/status

    Additionally, no validation schemas exist for the request/response types declared in this specification (RoutingDecision, RoutingPath, EscalationResponse, etc.).

    Either implement all declared endpoints and add corresponding validation middleware, or update the specification to match the actual implementation. Leaving this unresolved will block integration and create confusion about system capabilities.

    🤖 Prompt for AI Agents
    Verify each finding against the current code and only fix it if needed.
    
    In `@docs/specifications/workflow-ai-pro.xml` around lines 176 - 416, The OpenAPI
    spec declares endpoints (operationIds: routeDocument, getRoutingStatus,
    getGraphHealth and the Approval Predictor/UI endpoints) and schemas
    (RoutingDecision, RoutingPath, EscalationResponse, RoutingStatus) that are not
    implemented in backend/server.js; either implement matching Express handlers for
    POST /api/v2/documents/route, GET /api/v2/documents/:document_id/routing-status,
    GET /api/v2/routing/graph/health, POST /api/v2/predictions/bottlenecks, GET
    /api/v2/predictions/approver-load/:approver_id, POST /api/v2/ui/layout, POST
    /api/v2/ui/feedback, GET /api/v2/ui/al/status in backend/server.js (hooking into
    your business logic and returning the declared response shapes) and add
    request/response JSON Schema validation middleware for the schemas
    RoutingDecision, RoutingPath, EscalationResponse, RoutingStatus (or reuse your
    existing validation utilities), or alternatively prune/update the OpenAPI
    document to exactly match the two existing handlers (GET /api/wheel/stages and
    POST /api/wheel/progress) and remove the unused schema/type declarations so the
    spec and implementation stay synchronized.
    

    Comment on lines +455 to +457
    CONSTRAINT fk_doc FOREIGN KEY (document_id)
    REFERENCES documents(id) ON DELETE CASCADE
    );

    Copy link
    Copy Markdown
    Contributor

    Choose a reason for hiding this comment

    The reason will be displayed to describe this comment to others. Learn more.

    ⚠️ Potential issue | 🟠 Major

    Consider the impact of CASCADE deletion on routing decisions.

    The foreign key constraint uses ON DELETE CASCADE, meaning if a document is deleted, all associated routing decisions, paths, and audit logs will be automatically deleted. This could result in loss of historical routing data that may be valuable for model training or compliance audit trails.

    Consider using ON DELETE RESTRICT or ON DELETE SET NULL with a soft-delete pattern for documents to preserve routing history.

    🛡️ Alternative approach: soft delete pattern
     CREATE TABLE documents (
         id              UUID PRIMARY KEY DEFAULT gen_random_uuid(),
         tenant_id       UUID NOT NULL,
         content_hash    CHAR(64) NOT NULL,
         doc_type        VARCHAR(32) NOT NULL,
         urgency         VARCHAR(16) NOT NULL DEFAULT 'standard',
         compliance_flags TEXT[] DEFAULT '{}',
         metadata        JSONB DEFAULT '{}',
    +    deleted_at      TIMESTAMPTZ NULL,
         created_at      TIMESTAMPTZ NOT NULL DEFAULT now(),
         updated_at      TIMESTAMPTZ NOT NULL DEFAULT now()
     ) PARTITION BY HASH (tenant_id);

    Then change the foreign key:

         CONSTRAINT fk_doc FOREIGN KEY (document_id)
    -        REFERENCES documents(id) ON DELETE CASCADE
    +        REFERENCES documents(id) ON DELETE RESTRICT
    🤖 Prompt for AI Agents
    Verify each finding against the current code and only fix it if needed.
    
    In `@docs/specifications/workflow-ai-pro.xml` around lines 455 - 457, The FK
    constraint fk_doc currently uses ON DELETE CASCADE which will remove routing
    decisions, paths, and audit logs when a documents row is deleted; change the
    constraint to use ON DELETE RESTRICT or ON DELETE SET NULL and implement a
    soft-delete pattern on the documents table (e.g., add a deleted_at flag and
    update application queries in routing logic to filter out soft-deleted
    documents) and update any code that deletes documents to set the soft-delete
    flag instead of issuing a hard DELETE; also add a migration to alter the fk_doc
    constraint and handle existing NULLability if you choose SET NULL.
    

    Comment on lines +492 to +498
    ALTER TABLE documents ENABLE ROW LEVEL SECURITY;
    CREATE POLICY tenant_isolation_documents ON documents
    USING (tenant_id = current_setting('app.current_tenant')::UUID);

    ALTER TABLE routing_decisions ENABLE ROW LEVEL SECURITY;
    CREATE POLICY tenant_isolation_routing ON routing_decisions
    USING (tenant_id = current_setting('app.current_tenant')::UUID);

    Copy link
    Copy Markdown
    Contributor

    Choose a reason for hiding this comment

    The reason will be displayed to describe this comment to others. Learn more.

    ⚠️ Potential issue | 🟠 Major

    Document the required session variable for Row-Level Security policies.

    The RLS policies rely on current_setting('app.current_tenant')::UUID, but the specification doesn't document how this session variable should be set. The application must execute SET LOCAL app.current_tenant = '<tenant_uuid>' at the beginning of each transaction for RLS to work correctly.

    Without proper documentation, developers may encounter authorization failures or inadvertently expose data across tenant boundaries.

    Add a comment in the SQL schema or a dedicated section explaining:

    • How to set the session variable (e.g., via middleware, connection pooler)
    • When it must be set (per transaction vs per connection)
    • Error handling when the variable is not set
    • Example implementation pattern
    🤖 Prompt for AI Agents
    Verify each finding against the current code and only fix it if needed.
    
    In `@docs/specifications/workflow-ai-pro.xml` around lines 492 - 498, Add
    documentation/comments explaining that the RLS policies
    tenant_isolation_documents and tenant_isolation_routing rely on
    current_setting('app.current_tenant')::UUID and that the application must set
    this via "SET LOCAL app.current_tenant = '<tenant_uuid>'" at the start of each
    transaction (or via middleware that runs per-transaction); note tradeoffs of
    per-transaction vs per-connection when using connection poolers, describe error
    handling if the setting is missing (e.g., detect and abort the transaction with
    a clear error or raise a custom NOTICE/ERROR), and include an example
    implementation pattern for middleware/connection wrapper that reads the
    authenticated tenant ID and issues the SET LOCAL before any DB statements for
    tables documents and routing_decisions.
    

    Comment on lines +534 to +538
    doc.routing.dlq:
    partitions: 6
    replication_factor: 3
    retention_ms: -1 # infinite retention for DLQ
    cleanup_policy: compact

    Copy link
    Copy Markdown
    Contributor

    Choose a reason for hiding this comment

    The reason will be displayed to describe this comment to others. Learn more.

    ⚠️ Potential issue | 🟠 Major

    Risk: Infinite retention on DLQ topic can cause unbounded storage growth.

    The doc.routing.dlq topic uses retention_ms: -1 (infinite retention) combined with cleanup_policy: compact. While this ensures failed messages are never lost, it can lead to unbounded storage growth if DLQ messages are not actively monitored and remediated.

    Consider:

    • Setting a large but finite retention (e.g., 90 days: 7776000000ms)
    • Implementing automated DLQ monitoring with alerting when depth exceeds thresholds
    • Creating a runbook for DLQ message inspection and reprocessing
    🤖 Prompt for AI Agents
    Verify each finding against the current code and only fix it if needed.
    
    In `@docs/specifications/workflow-ai-pro.xml` around lines 534 - 538, The DLQ
    topic configuration doc.routing.dlq currently uses retention_ms: -1 which risks
    unbounded storage growth; change retention_ms to a large but finite value (e.g.,
    7776000000 for 90 days) instead of -1, keep or confirm cleanup_policy: compact
    as needed, and add operational controls: create automated DLQ
    monitoring/alerting for topic depth and storage, and add a runbook for
    inspecting/reprocessing messages (document procedures and thresholds alongside
    the doc.routing.dlq configuration).
    


    # 5. Performance, Security and Compliance

    - **SLAs:** Document routing P99 <200ms; bottleneck prediction P95 <500ms; UI layout resolution P95 <100ms; system-wide availability 99.95% (43.8 min/month max downtime); all Kafka consumers maintain <5s end-to-end event processing latency; model retraining completes within 4-hour nightly batch window.

    Copy link
    Copy Markdown
    Contributor

    Choose a reason for hiding this comment

    The reason will be displayed to describe this comment to others. Learn more.

    ⚠️ Potential issue | 🟠 Major

    Conflicting latency requirements: Kafka consumer SLA vs GNN inference timeout.

    The performance requirements state "all Kafka consumers maintain <5s end-to-end event processing latency," but the Document Router Kafka consumer configuration (line 550) sets max_poll_interval_ms: 300000 (5 minutes) to accommodate GNN inference.

    This creates a conflict:

    • If GNN inference takes >5 seconds, the <5s end-to-end latency SLA is breached
    • The 5-minute timeout suggests inference could take much longer than 5 seconds
    • The stated GNN P99 latency target of <200ms should normally satisfy the 5s SLA, but the large timeout indicates potential outliers

    Clarify:

    1. Is the <5s latency requirement for P50, P95, or P99?
    2. How should the system behave when GNN inference exceeds 5 seconds? (e.g., timeout and route to DLQ, escalate to human review)
    3. Update either the max_poll_interval_ms or the latency SLA to be consistent
    🤖 Prompt for AI Agents
    Verify each finding against the current code and only fix it if needed.
    
    In `@docs/specifications/workflow-ai-pro.xml` at line 1305, The doc has
    conflicting latency expectations: the global Kafka consumer SLA "<5s end-to-end
    event processing latency" conflicts with Document Router's consumer config
    max_poll_interval_ms: 300000 and the GNN P99 <200ms target; update the spec to
    (1) state which percentile the "<5s" SLA refers to (P50/P95/P99), (2) define
    concrete behavior when GNN inference exceeds 5s for the Document Router Kafka
    consumer (e.g., enforce an inference timeout of 5s in the Document Router's
    inference handler, emit the message to DLQ or mark for human review and
    increment a metric/alert), and (3) reconcile the config by either reducing
    max_poll_interval_ms to match the chosen SLA (e.g., 5000ms if you require
    consumer poll intervals to support a 5s E2E SLA) or relaxing the SLA to accept
    longer tail latency; reference Document Router, max_poll_interval_ms, the GNN
    inference path and the "<5s end-to-end event processing latency" SLA when making
    the change.
    

    @penify-dev penify-dev Bot added the enhancement New feature or request label Mar 20, 2026
    @difflens

    difflens Bot commented Mar 20, 2026

    Copy link
    Copy Markdown

    View changes in DiffLens

    @penify-dev

    penify-dev Bot commented Mar 20, 2026

    Copy link
    Copy Markdown
    Contributor

    PR Review 🔍

    ⏱️ Estimated effort to review [1-5]

    5, because the PR introduces a substantial XML document with detailed technical specifications, including complex architecture, API definitions, and database schemas. The review will require careful examination of the entire document to ensure accuracy and completeness.

    🧪 Relevant tests

    No

    ⚡ Possible issues

    Possible Bug: The XML structure must be validated against the appropriate XML schema to ensure it adheres to the expected format and standards.

    Documentation Clarity: Some sections may require clearer explanations or examples to ensure that all stakeholders can understand the technical specifications.

    🔒 Security concerns

    No

    @penify-dev

    penify-dev Bot commented Mar 20, 2026

    Copy link
    Copy Markdown
    Contributor

    PR Code Suggestions ✨

    No code suggestions found for PR.

    @OneFineStarstuff OneFineStarstuff merged commit 4ffe209 into main Mar 20, 2026
    26 of 95 checks passed
    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

    Projects

    None yet

    Development

    Successfully merging this pull request may close these issues.

    3 participants