Security Incident Response Runbook

This runbook provides step-by-step procedures for responding to security incidents in the pi extension runtime. It maps directly to implemented controls in src/extensions.rs and src/config.rs.

Severity Classification

Level	Label	Example	Response Time	Escalation
P0	Critical	Policy boundary breached, data exfiltration detected	< 1 hour	Immediate kill-switch + rollback
P1	High	Risk controller degraded, sustained deny escalation	< 4 hours	Investigation + potential rollback
P2	Medium	SLO budget at risk, elevated FP rate	< 1 sprint	Root cause analysis + tuning
P3	Low	Informational, single transient anomaly	Next sprint	Log and backlog

Decision Tree: Incoming Alert

Alert received
├── Is extension quarantined (Terminate state)?
│   └── YES → Go to: Quarantine Investigation (P0/P1)
├── Is it a policy denial (exec/env denied)?
│   ├── Single occurrence → Log, monitor (P3)
│   └── Repeated burst (>3 in 60s) → Go to: Burst Denial Triage (P1)
├── Is it an enforcement state transition?
│   ├── Escalation (Allow→Harden→Deny) → Go to: Escalation Review (P2)
│   └── De-escalation → Log, verify cooldown worked (P3)
├── Is it a rollback trigger event?
│   └── YES → Go to: Automatic Rollback Verification (P1)
└── Is it a ledger integrity failure?
    └── YES → Go to: Ledger Corruption Recovery (P0)

Procedure 1: Quarantine Investigation (P0/P1)

Trigger: Extension reaches Terminate enforcement state (3+ consecutive unsafe evaluations).

Step 1 — Verify quarantine is active

Check the extension's enforcement state via the runtime risk ledger:

// Programmatic: ExtensionManager API
let state = manager.runtime_risk_config();
assert!(state.enabled, "risk controller must be enabled");

// Check per-extension state
let ledger = manager.runtime_risk_ledger_artifact();
let ext_entries: Vec<_> = ledger.entries.iter()
    .filter(|e| e.extension_id.as_deref() == Some("suspect-ext-id"))
    .collect();
// Last entry should show action = "terminate"

Expected artifact: RuntimeRiskLedgerEntry with action: "terminate" and consecutive unsafe count >= 3.

Step 2 — Assess threat severity

Review the hostcall telemetry for the quarantined extension:

let telemetry = manager.runtime_hostcall_telemetry_artifact();
let suspicious: Vec<_> = telemetry.entries.iter()
    .filter(|e| e.extension_id.as_deref() == Some("suspect-ext-id"))
    .collect();
// Look for: capability="exec", unusual param patterns, resource_target_class

Questions to answer:

What capabilities was the extension requesting? (exec, http, fs/write)
Were the targets unusual? (system paths, external URLs)
Was there a burst pattern? (many calls in short window)

Step 3 — Execute response

For confirmed threats:

Activate kill-switch: manager.set_kill_switch("suspect-ext-id", true, "incident-ref")
Verify: Check that subsequent calls from the extension are blocked
Export evidence bundle for forensic review

For false positives:

Document the FP in the incident log
Consider tuning alpha or window_size (see Policy Tuning Guide)
Reset extension state if appropriate

Step 4 — Verify and close

Confirm kill-switch is active in trust state
Verify ledger hash chain integrity: verify_runtime_risk_ledger_artifact(&ledger)
Create post-incident bead with linked evidence

Procedure 2: Burst Denial Triage (P1)

Trigger: 3+ policy denials from a single extension within 60 seconds.

Step 1 — Identify the extension and denied capabilities

Check security alerts:

let alerts = manager.security_alerts();
let recent: Vec<_> = alerts.iter()
    .filter(|a| a.extension_id.as_deref() == Some("ext-id"))
    .filter(|a| a.ts_ms > (now_ms - 60_000))
    .collect();

Step 2 — Determine if legitimate

Is the extension newly installed? May need capability grants
Did policy change recently? Check policy source (cli, env, config)
Is the extension attempting capabilities it has never used before?

Step 3 — Respond

If legitimate need:

Add per-extension capability override in policy config
Verify the override takes effect
Monitor for 24 hours

If suspicious:

Let denial continue (policy is working correctly)
Escalate to P0 if extension appears to be probing permissions
Consider activating kill-switch preemptively

Procedure 3: Escalation Review (P2)

Trigger: Enforcement state escalated (e.g., Allow → Harden, or Harden → Deny).

Step 1 — Review the transition

// The enforcement transition is recorded in telemetry
let telemetry = manager.runtime_hostcall_telemetry_artifact();
// Look for entries where risk_action changed

Step 2 — Evaluate if expected

Check the risk score that triggered escalation against score band thresholds
Review the extension's recent hostcall pattern
Verify hysteresis is working (no rapid flapping)

Step 3 — Tune if needed

If escalation is too aggressive:

Increase score band thresholds (see Policy Tuning Guide)
Increase cooldown_calls for slower de-escalation
Consider switching to a more permissive policy profile

If escalation is correct:

Log and monitor
No action needed — the state machine is working as designed

Procedure 4: Automatic Rollback Verification (P1)

Trigger: Rollback trigger fired, rollout phase reverted to Shadow.

Step 1 — Verify rollback occurred

let state = manager.rollout_state();
assert_eq!(state.phase, RolloutPhase::Shadow);
assert!(state.rolled_back_from.is_some());
// rolled_back_from tells you what phase we were in

Step 2 — Identify trigger cause

Check the window statistics:

let stats = state.window_stats;
// Check which threshold was breached:
// - stats.false_positive_count / stats.total_decisions > max_false_positive_rate?
// - stats.error_count / stats.total_decisions > max_error_rate?
// - stats.avg_latency_ms > max_latency_ms?

Step 3 — Root cause analysis

Trigger	Likely Cause	Remediation
High FP rate	Score thresholds too aggressive	Raise band thresholds or alpha
High error rate	Controller bug or resource exhaustion	Check logs, fix bug, increase timeout
High latency	Window size too large or system load	Reduce window_size or decision_timeout_ms

Step 4 — Recovery

Fix the root cause
Gradually re-advance rollout: manager.advance_rollout()
Monitor window stats at each phase before advancing further
Document in incident bead

Procedure 5: Ledger Corruption Recovery (P0)

Trigger: Hash chain verification fails on the runtime risk ledger.

Step 1 — Confirm corruption

let ledger = manager.runtime_risk_ledger_artifact();
let report = verify_runtime_risk_ledger_artifact(&ledger);
assert!(!report.valid, "corruption confirmed");
// report.first_invalid_index tells you where the chain broke

Step 2 — Assess impact

How many entries are affected?
When did corruption start? (check timestamps)
Is the risk controller still making correct decisions?

Step 3 — Immediate response

Roll back enforcement to Shadow mode: manager.set_rollout_phase(RolloutPhase::Shadow)
Export the corrupted ledger as evidence
Restart the extension manager to get a fresh ledger

Step 4 — Investigation

Check for concurrent modification bugs
Check for memory corruption indicators
Review recent code changes to ledger-touching paths

Expected artifacts:

Corrupted ledger export (JSON)
RuntimeRiskLedgerVerificationReport showing valid: false
Incident evidence bundle

Procedure 6: Evidence Bundle Export

When to use: After any P0-P2 incident, or when forensic evidence must be preserved for audit or sharing.

Step 1 -- Collect raw artifacts

let ledger = manager.runtime_risk_ledger_artifact();
let alerts = manager.security_alert_artifact();
let telemetry = manager.runtime_hostcall_telemetry_artifact();
let exec = manager.exec_mediation_artifact();
let secret = manager.secret_broker_artifact();
let quota_breaches = manager.quota_breach_events();

Step 2 -- Define scope with a filter

use pi::extensions::{
    IncidentBundleFilter, SecurityAlertCategory, SecurityAlertSeverity,
};

let filter = IncidentBundleFilter {
    start_ms: Some(incident_start_ms),   // Time window start
    end_ms: Some(incident_end_ms),       // Time window end
    extension_id: Some("suspect-ext".into()),  // Scope to one extension
    alert_categories: Some(vec![         // Limit to relevant categories
        SecurityAlertCategory::ExecMediation,
        SecurityAlertCategory::AnomalyDenial,
    ]),
    min_severity: Some(SecurityAlertSeverity::Warning),
};

Step 3 -- Set redaction policy

For external sharing (redact all hashes):

use pi::extensions::IncidentBundleRedactionPolicy;

let redaction = IncidentBundleRedactionPolicy {
    redact_params_hash: true,
    redact_context_hash: true,
    redact_args_shape_hash: true,
    redact_command_hash: true,
    redact_name_hash: true,
    redact_remediation: false,  // Keep human-readable remediation text
};

For internal investigation (no redaction):

let redaction = IncidentBundleRedactionPolicy {
    redact_params_hash: false,
    redact_context_hash: false,
    redact_args_shape_hash: false,
    redact_command_hash: false,
    redact_name_hash: false,
    redact_remediation: false,
};

Step 4 -- Build the bundle

use pi::extensions::build_incident_evidence_bundle;

let bundle = build_incident_evidence_bundle(
    &ledger, &alerts, &telemetry, &exec, &secret,
    &quota_breaches, &filter, &redaction, now_ms,
);

Step 5 -- Verify integrity

use pi::extensions::verify_incident_evidence_bundle;

let report = verify_incident_evidence_bundle(&bundle);
assert!(report.valid, "Bundle integrity check failed: {:?}", report.errors);
assert!(report.ledger_chain_intact, "Ledger hash chain broken");
assert_eq!(report.bundle_hash, report.recomputed_hash, "Hash mismatch");

Step 6 -- Review bundle summary

let summary = &bundle.summary;
println!("Ledger entries: {}", summary.ledger_entry_count);
println!("Alerts: {}", summary.alert_count);
println!("Telemetry events: {}", summary.telemetry_event_count);
println!("Exec mediation: {}", summary.exec_mediation_count);
println!("Secret broker: {}", summary.secret_broker_count);
println!("Quota breaches: {}", summary.quota_breach_count);
println!("Distinct extensions: {}", summary.distinct_extensions);
println!("Peak risk score: {:.3}", summary.peak_risk_score);
println!("Deny/Terminate count: {}", summary.deny_or_terminate_count);
println!("Ledger chain intact: {}", summary.ledger_chain_intact);

Step 7 -- Forensic replay (optional)

Reconstruct the decision sequence step-by-step:

use pi::extensions::replay_runtime_risk_ledger_artifact;

let replay = replay_runtime_risk_ledger_artifact(&ledger)?;
for step in &replay.steps {
    println!("[{}] ext={} cap={}.{} score={:.3} action={:?} state={:?}",
        step.ts_ms, step.extension_id, step.capability, step.method,
        step.risk_score, step.selected_action, step.derived_state);
}

Expected artifacts:

IncidentEvidenceBundle (schema pi.ext.incident_evidence_bundle.v1)
IncidentBundleVerificationReport confirming valid: true
Optional RuntimeRiskReplayArtifact for decision reconstruction

Evidence Collection Checklist

All incidents should produce an evidence bundle containing:

Runtime risk ledger -- hash-chained decision history
Hostcall telemetry -- per-call feature vectors and scores
Security alerts -- alert stream filtered to incident timeframe
Exec mediation log -- command classifications and decisions
Secret broker log -- redaction decisions
Quota breach events -- resource limit violations
Rollout state snapshot -- phase, enforce flag, window stats
Extension trust state -- current trust level and kill-switch audit

Bundle integrity is verified via SHA-256 hashing. The verify_incident_evidence_bundle() function confirms bundle hash and ledger chain validity.

Post-Incident Checklist

Incident severity classified (P0-P3)
Immediate response executed per procedure above
Evidence bundle collected and hash verified
Root cause identified
Remediation applied (config tuning, bug fix, policy update)
Verification steps confirm remediation worked
Incident bead created with linked evidence artifacts
Rollout phase restored (if rolled back) with monitoring

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Security Incident Response Runbook

Severity Classification

Decision Tree: Incoming Alert

Procedure 1: Quarantine Investigation (P0/P1)

Procedure 2: Burst Denial Triage (P1)

Procedure 3: Escalation Review (P2)

Procedure 4: Automatic Rollback Verification (P1)

Procedure 5: Ledger Corruption Recovery (P0)

Procedure 6: Evidence Bundle Export

Evidence Collection Checklist

Post-Incident Checklist

FilesExpand file tree

incident-response-runbook.md

Latest commit

History

incident-response-runbook.md

File metadata and controls

Security Incident Response Runbook

Severity Classification

Decision Tree: Incoming Alert

Procedure 1: Quarantine Investigation (P0/P1)

Procedure 2: Burst Denial Triage (P1)

Procedure 3: Escalation Review (P2)

Procedure 4: Automatic Rollback Verification (P1)

Procedure 5: Ledger Corruption Recovery (P0)

Procedure 6: Evidence Bundle Export

Evidence Collection Checklist

Post-Incident Checklist