Skip to content

Feature: Evaluation history and behavioral regression detection across runs #165

@nanookclaw

Description

@nanookclaw

Problem

Rogue evaluates agents at a specific point in time — business policy compliance, red team vulnerability scores, CVSS risk metrics. This is excellent for CI/CD gating and security audits. But it leaves a blind spot: behavioral regression between evaluation runs.

An agent that passes all 75 vulnerability categories today can gradually weaken over time — model updates, prompt drift, tool chain changes, fine-tuning decay. Current Rogue runs are stateless: each evaluation stands alone. There is no mechanism to answer "is this agent getting more vulnerable over time?" or "did this model update regress our policy compliance?"

Proposed Solution

1. Run-over-run comparison

Store evaluation results with agent ID + timestamp. On subsequent runs, compare against previous baselines:

  • New vulnerabilities introduced (category appeared where it was previously clean)
  • Score regression (CVSS score increased on a previously-tested category)
  • Policy compliance drift (scenarios that previously passed now fail)

2. Trend analysis

With ≥3 evaluation runs for the same agent, compute:

  • Drift velocity: rate of score change per run/time period
  • Regression alerts: flag when cumulative score crosses a threshold or trend direction reverses
  • Category-level tracking: which specific vulnerability categories are improving vs degrading

3. CI/CD integration

Beyond pass/fail, add a regression gate: "fail if any category regressed more than X from the baseline run." This catches gradual degradation that stays within absolute thresholds but represents a concerning trend.

Why This Matters

I have been running behavioral drift measurement on autonomous agents for 28 days in production. Key finding: agents that pass all point-in-time evaluations can still show 7% cumulative behavioral divergence when measured longitudinally. The degradation is often non-monotonic — scores improve, then degrade, then partially recover — making it invisible to single-run evaluations.

Detailed methodology and pilot data:

Rogue's existing CVSS scoring, category taxonomy, and CI/CD pipeline integration make it an ideal surface for longitudinal behavioral tracking. The infrastructure is already there — it just needs historical state and comparison logic.

Prior Art

Related work in adjacent projects:

  • NexusGuard AIP: Integrated PDR behavioral scoring alongside vouch chain trust (live endpoint)
  • OWASP Agent Observability Standard: Issue #73 proposing longitudinal behavioral drift as a new dimension
  • Exgentic: Issue #23 proposing run-over-run regression detection for universal evaluation

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions