diff --git a/docs/blog/v0.9.x-milestone.md b/docs/blog/v0.9.x-milestone.md new file mode 100644 index 0000000..78eeffd --- /dev/null +++ b/docs/blog/v0.9.x-milestone.md @@ -0,0 +1,290 @@ +# v0.9.x milestone: NC kernel honesty, RTA→WCTT coupling, and a peek at v0.10.0 trace-topology + +Status: **draft** — milestone wrap-up for the v0.9.x cycle. +Last update: 2026-05-10. +Audience: spar users tracking releases, OEMs comparing against +commercial NC arms, anyone hunting the v0.10.0 trace-topology +foundation. + +> **Why care.** Between 2026-04-29 (v0.9.0) and 2026-05-10 (v0.9.3), +> spar grew an MCP server, a runtime-trace assistant, and four +> reviewer-driven tightness passes on the Network-Calculus kernel. +> The output is byte-identical to v0.8.1 when the new properties are +> unset — every change is opt-in or only flips a previously +> optimistic bound. End-state: 2,900+ tests across 19 crates, 131 +> predeclared properties, four Lean theorem-statement skeletons +> filed for v1.0.0 discharge, and the v0.10.0 trace-topology +> foundation merged with four runtime-artefact parsers in tree. + +--- + +## What landed in v0.9.x + +Four releases, each with a sharp focus. + +### v0.9.0 — MCP + spar-insight + +(see [CHANGELOG.md](../../CHANGELOG.md#090--2026-04-29)) + +LLM agents got first-class access to the move-oracle via a JSON-RPC +2.0 MCP server (`spar mcp serve`, or the standalone `spar-mcp` +binary). Three tools, all `readOnlyHint: true` and +`idempotentHint: true`: + +- `spar.verify_move` — single hypothetical-rebinding check. +- `spar.enumerate_moves` — design-space exploration with + multi-objective ranking (`max-response | total-load | + total-power | total-weight | balanced`). +- `spar.check_chain` — end-to-end latency breakdown for a flow + chain. + +The deterministic-apply path stays CLI-exclusive by design: there is +no `spar.apply_move` over MCP, ever. LLM agents propose, spar +verifies deterministically, the apply path is replayable from the +command line. The certification chain stays inside spar. + +`spar-insight` (Track G) shipped alongside as the runtime-trace +discrepancy assistant. Five rules on probe-point CTF traces from +Zephyr: `WcetViolated`, `BcetUnderestimated`, `MeanDrift`, +`MissingProbe`, `UnobservedProbe`. Binary CTF and ITM/SWO ingest +deferred to a v0.9.x follow-up. + +### v0.9.1 — NC kernel soundness (Tier-2 reviewer items) + +(see [CHANGELOG.md](../../CHANGELOG.md#091--2026-04-29)) + +The v0.8.1 TSN bounds were *technically optimistic* in two ways. PR +[#186](https://github.com/pulseengine/spar/pull/186) closed both: + +- **gPTP sync-error ε.** The new `Spar_TSN::Sync_Error` property + feeds the TAS dispatch in `wctt.rs`: ε subtracts from the + effective open time and inflates the worst-case gate latency. + Without ε, a frame could miss its window by ε in silicon — an + unsound NC reading. +- **Atomic-frame quantization.** `wctt.rs` now adds + `ceil(max_frame_bytes · 8e12 / link_rate_bps)` ps per hop on the + TAS and FIFO/Priority arms. Bytes-level NC under-counts by up to + one MTU per hop because frames are atomic. + +Both changes are byte-identical to v0.8.1 when the new properties +are unset (ε = 0, default `Max_Frame_Size`). Users who calibrated +against v0.8.1 numbers will see ≈ +12.144 µs per 1 Gbps hop with +1518 B MTU — the `WcttFrameQuantization` Info diagnostic flags +exactly where. + +### v0.9.2 — Honesty + tightness, sensitivity, ARINC severity + +(see [CHANGELOG.md](../../CHANGELOG.md#092--2026-05-03)) + +Six PRs. The highlights for users: + +- **RTA→WCTT release-jitter coupling + ([#199](https://github.com/pulseengine/spar/pull/199)).** When a + stream source declares `Timing_Properties::Dispatch_Jitter`, + that J is now treated as ingress release-jitter and inflates the + arrival burst σ by `ρ·J` bytes. Single biggest credibility lift + in the v0.9.x cycle — closes the reviewer's NC top-5 item #4. +- **WCTT sensitivity output + ([#196](https://github.com/pulseengine/spar/pull/196)).** Every + `WcttBound` Info is now followed by a `WcttSensitivity` Info + carrying worst-hop partial derivatives + `∂σ_self / ∂ρ_competing / ∂T_link`. Pure post-processing on the + closed-form. Reviewer NC top-5 item #13: cheapest workflow win, + turns spar from judge into design partner. +- **Context_Switch_Time in RTA recurrence + ([#198](https://github.com/pulseengine/spar/pull/198)).** v0.8.x + emitted a STPA-REQ-022 advisory if `Context_Switch_Time` was + unset, then quietly used B=0 when it was set. Now the value + inflates each thread's WCET by `2 × Context_Switch_Time` per + Buttazzo §7 (one preemption-in + one preemption-out). +- **CBS user-tunable Hi/Lo Credit + ([#195](https://github.com/pulseengine/spar/pull/195)).** v0.8.1 + hardcoded both credits to `Max_Frame_Size`. Real Qcc/YANG configs + carry these per traffic class; now they are honoured. +- **ARINC-PARTITION-ISOLATION promoted to Error.** Per DO-297 + spatial-isolation invariant. New `--allow + arinc-partition-isolation` for legitimate IMA bypasses. + +Tier A reviewer items #5/#6/#8/#9/#13 closed in one release. + +### v0.9.3 — Piecewise-affine arrival curves + PMOO/LUDB + +(see [CHANGELOG.md](../../CHANGELOG.md#093--2026-05-10)) + +Two PRs, both reviewer NC top-5 closeouts: + +- **Piecewise-affine arrival curves + ([#204](https://github.com/pulseengine/spar/pull/204)).** New + `PiecewiseAffineArrivalCurve` alongside the affine form. + Multi-leaky-bucket `(σ₁, ρ₁) ∧ (σ₂, ρ₂) ∧ …` exactly captures + "burst of N frames at line rate, then sustained at 100 Mbps" — + the canonical T-SPEC pattern in real ADAS/TSN traffic. Operators + take the min over buckets, which is the mathematically tight + choice. Reviewer NC top-5 item #1. +- **PMOO/LUDB tree-shaped multiplexing + ([#203](https://github.com/pulseengine/spar/pull/203)).** New + `ludb_bound(tagged, competing, services)` for the + zonal/automotive pattern. Numerical confirmation: **27.4 % + tighter than SFA** on a 3-hop / 3-competing test, **23.7 % + tighter** on 5-source zonal aggregation. Closes the "spar says + 31.7 ms vs RTaW says 8 ms" credibility gap on tree topologies. + Reviewer NC top-5 item #2. + +That makes 5/5 reviewer NC top-5 items closed across v0.9.1 / v0.9.2 +/ v0.9.3. The post-v0.9.0 reviewer narrative is now: NC arm has +parity with RTaW-Pegase on the workloads where bounds matter most. + +--- + +## DSLTrans-style cutoff analysis on modal filtering + +Lucio's recent Cutoff Theorem (arXiv +[2604.18792](https://arxiv.org/abs/2604.18792)) proves that bounded +model checking is *complete* for a precise DSLTrans fragment plus +positive existence / traceability properties — turning an +otherwise-infinite search into a finite computable bound. The +benchmark line in that paper is sharp: 552 properties proved, 345 +counterexamples, 2 undecided across 29 transformations. + +We asked: does this apply to spar's **modal filtering** pass — the +helpers in `crates/spar-analysis/src/modal.rs` that project a +`SystemInstance` to the components and connections active in a given +System Operational Mode? + +Short answer: **yes for the positive case, no for the rest.** The +filter is non-recursive, has no negation in matchers, and is +layered (component filter and connection filter are independent), +so it fits the fragment. For the property *"every element declared +`in modes (s)` is preserved in the SOM-projection"*, the cutoff +bound is concrete and small: + +``` +K_filter = 1 + max(C, E) +``` + +where `C` is the largest `in_modes` list on any component and `E` +the same for any connection. On `test-data/parser/modes_test.aadl` +that gives **K = 3** — three mode-name labels per element suffice to +exhaust the filter's behavioural equivalence classes. + +The theorem does **not** apply to: negative properties ("no inactive +connection appears"); the cartesian-product SOM enumeration itself +(`|SOMs| = ∏ |modes(c)|` is exponential in the input, not a +verification artefact); fixed-point downstream passes (RTA +recurrence, WCTT chain compose); and mode reachability (a graph +reachability problem, deliberately exported to NuSMV via +`mode_reachability::export_smv`). + +Full analysis with discharge plan (Kani harness vs capped proptest): +[`docs/research/cutoff-theorem-modal-filtering.md`](../research/cutoff-theorem-modal-filtering.md). + +--- + +## Forge-style navigation evidence — protocol shipped, trials pending + +Jin et al. (arXiv +[2604.13108](https://arxiv.org/abs/2604.13108)) reported that +conditioning an LLM agent with an architecture descriptor reduces +navigation steps by **33–44 %** (Wilcoxon p=0.009, Cohen's d=0.92) +on a controlled 24-task × 4-condition benchmark. Auto-generated +descriptors hit 100 % accuracy vs 80 % blind on a separate 15-task +artefact-vs-process arm. S-expressions detected all structural +completeness errors where JSON failed atomically and YAML silently +corrupted ~50 % of errors. + +We mirror Jin's protocol on spar itself: 24 paired tasks, two arms +(C0: no descriptor; C1: with `descriptor.sexp`), Wilcoxon +signed-rank with α=0.05. The scaffolding is in tree: + +- `docs/research/forge-replication/descriptor.sexp` — S-expression + descriptor of the spar workspace. +- `docs/research/forge-replication/tasks.json` — 24 code-localization + tasks pinned to commit `bad85e6`. +- `docs/research/forge-replication/run_trial.sh` — bash trial runner + that drives `claude --print --output-format stream-json` and + scores transcripts. +- `docs/research/forge-replication/RESULTS.md` — empty results + template with pre-registration prompts. + +We have not run trials. Numbers will land in `RESULTS.md` once the +user (or a community contributor) drives the scaffolding. The +protocol document is honest about deviations from Jin's setup +(self-target confound, hand-curated descriptor, no field-study arm). +Pre-registration is the discipline; we record it in `RESULTS.md` +before any trial fires. + +Full protocol: +[`docs/research/forge-replication-protocol.md`](../research/forge-replication-protocol.md). + +--- + +## What's next: v0.10.0 trace-topology + +(design: +[`docs/designs/v0.10.0-trace-topology.md`](../designs/v0.10.0-trace-topology.md)) + +v0.10.0 ships the foundation for `spar trace topology`: the +runtime/declared reconciler that asks *"does the AADL declaration +match the wire?"* on a real OEM deployment. + +Four parsers, all now landed: + +- **PCAPNG FrameSource** + ([#210](https://github.com/pulseengine/spar/pull/210)) — Tier 1 + L2 frame ingest from `tcpdump -w` / Wireshark captures. +- **LLDP JSON TopologySource** + ([#208](https://github.com/pulseengine/spar/pull/208)) — neighbor + adjacency from `lldpctl -f json`. +- **Qcc YANG ConfigSource** — TSN switch config from + NETCONF/RESTCONF. +- **gPTP JSON PtpTimeSource** + ([#211](https://github.com/pulseengine/spar/pull/211)) — + per-port ε from gPTP logs. + +Five deterministic checks on the parsed-vs-declared diff: + +- `IdentityUnknown` — runtime saw an identity (MAC, + chassis-id/port-id, stream-handle, multicast group) the AADL + didn't declare. +- `TopologyMissingWiring` — LLDP saw an adjacency without a + corresponding AADL `bus access` connection. +- `ConfigDrift` — `Spar_TSN::Gate_Control_List` (and 7 sibling + properties) on the wire disagree with the declaration. +- `GptpOutOfBudget` — observed ε exceeds the declared + `Spar_TSN::Sync_Error`. This is the soundness guard for v0.9.1's + TAS bounds: if it fires, the WCTT readings were optimistic. +- `BinaryMismatch` — running image digest disagrees with the + declared one. + +Output: SARIF 2.1.0 + a signed in-toto v1.0 attestation. Predicate +URL (the integration anchor for rivet, witness, and OEM internal +certification kits): + +``` +https://pulseengine.eu/spar-trace-topology/v1 +``` + +Phasing: v0.10.0 = foundation + parsers; v0.11.0 = engine + SARIF; +v1.0 = signed attestation. Estimated total scope ~2 weeks per the +external reviewer's v1 spec. + +--- + +## Call to action + +- **Repo.** [pulseengine/spar](https://github.com/pulseengine/spar). +- **v0.10.0 design doc.** + [`docs/designs/v0.10.0-trace-topology.md`](../designs/v0.10.0-trace-topology.md). +- **Forge-trial contributors welcome.** The scaffold in + `docs/research/forge-replication/` is ready to run. If you have + cycles to drive 48 paired trials (24 × 2 arms) on a fresh agent, + please file the results.csv + transcripts as a PR against + `docs/research/forge-replication/RESULTS.md`. +- **NC reviewer items.** Five of the post-v0.9.0 top-5 closed; the + follow-up list (multi-grouping LUDB LP, full RTA→WCTT + auto-propagation, `Interrupt_Overhead` per ISR firing) is logged + in the CHANGELOG's "Deferred" sections. + +Track the v0.10.x trace-topology rollout via the +[milestone](https://github.com/pulseengine/spar/milestones) and the +five-check reconciler engine landing in v0.11.0. diff --git a/docs/research/cutoff-theorem-modal-filtering.md b/docs/research/cutoff-theorem-modal-filtering.md new file mode 100644 index 0000000..f75ae47 --- /dev/null +++ b/docs/research/cutoff-theorem-modal-filtering.md @@ -0,0 +1,318 @@ +# Cutoff-theorem reasoning applied to spar's modal filtering pass + +Status: **research note** — applies Lucio's DSLTrans Cutoff Theorem +(arXiv [2604.18792v2](https://arxiv.org/abs/2604.18792)) to spar's +per-System-Operational-Mode (SOM) analysis pipeline. +Last update: 2026-05-10. +Audience: spar maintainers thinking about Kani / proptest harnesses, +external readers comparing spar's verification posture against the +DSLTrans tractable-verification line. + +> **Executive summary.** spar's modal filter — the helpers in +> `crates/spar-analysis/src/modal.rs` that decide whether a component +> or connection is active in a given SOM — fits the DSLTrans +> *positive-existence* fragment for the property "every element +> declared `in modes (s)` is preserved in the SOM-projection". For +> that property the cutoff bound is concrete and small: **K = 1 + C + +> E** where C is the maximum number of mode names listed in any one +> component's `in_modes` and E the same for any one connection. The +> theorem does **not** apply to (i) negative properties like "no +> inactive connection is preserved", (ii) downstream passes that +> close fixed-points (RTA recurrence, WCTT propagation, mode +> reachability), or (iii) the cartesian-product enumeration that +> builds the SOM set in the first place. The bound is therefore +> useful for verifying the filter, not the entire per-SOM pipeline. + +--- + +## §1 What the modal filtering pass does in spar + +The flattened AADL instance model in spar carries two pieces of modal +state: + +- `ComponentInstance::in_modes: Vec` + (`crates/spar-hir-def/src/instance.rs:84`). +- The list of `SystemOperationMode { name, mode_selections }` on the + root `SystemInstance` (`crates/spar-hir-def/src/instance.rs:62`). + Each SOM is one element of the cartesian product of every modal + subcomponent's declared modes (AS5506 §12, + `crates/spar-hir-def/src/instance.rs:31-43`). + +The filter is three pure-function predicates in `modal.rs`: + +- `is_component_active_in_som(instance, comp_idx, som)` + (`crates/spar-analysis/src/modal.rs:42-75`). +- `is_connection_active_in_som(instance, conn_owner, conn_in_modes, + som)` (`crates/spar-analysis/src/modal.rs:82-105`). +- `is_active_in_mode(in_modes, current_mode)` — single-mode variant + used by analyses with no SOM context + (`crates/spar-analysis/src/modal.rs:107-120`). + +Each predicate is a constant-depth boolean expression over (a) the +element's declared `in_modes`, (b) the SOM's `mode_selections`, and +(c) case-insensitive name equality. There is no recursion, no +fixed-point, and no negation other than the early-return +`!in_modes.is_empty()` guard. + +Eight `ModalAnalysis` impls consume these predicates today +(`grep "impl ModalAnalysis for" crates/spar-analysis/src`): +connectivity, binding-check, memory-budget, resource-budget, +weight-power, ARINC-653, bus-bandwidth, scheduling. The orchestrator +`AnalysisRunner::run_all_per_som` +(`crates/spar-analysis/src/lib.rs:290-311`) calls +`analyze_in_mode(instance, som)` once per SOM and prefixes every +emitted diagnostic with `[mode: ]`. + +## §2 DSLTrans-fragment fit + +Lucio's Cutoff Theorem (arXiv 2604.18792, abstract; theorem +statements in the paper's §3-§4) admits a DSLTrans fragment +characterised by: no negation in match patterns, no recursion, and +layered rules. The theorem proves that bounded model checking is +**complete** for *positive existence* and *traceability* properties +within that fragment — i.e. any counterexample is witnessed at some +finite bound K computable from the rule set. + +The modal filter is not literally a DSLTrans transformation, but +viewed as a model-to-model projection +`filter : (Instance, SOM) → Instance` it satisfies the structural +preconditions: + +| DSLTrans fragment requirement | Modal filter status | +|---|---| +| No negation in matchers | Match: filter uses positive predicates (`a.eq_ignore_ascii_case(b)`); the `is_empty()` guard short-circuits to *accept*, not reject. | +| No recursion | Match: each predicate is constant-depth. | +| Layered rules | Match: component-filter and connection-filter are independent passes over disjoint arenas. | +| Bounded fan-out per rule | Match: a component sees at most `\|in_modes\|` mode names and the SOM has at most one selection per parent; both are statically bounded by the input. | + +The hierarchy-walk inside `is_component_active_in_som` +(`modal.rs:55-71`) reads the parent index once and scans the SOM's +`mode_selections` linearly — no recursive descent on the component +tree. That is the structural property the cutoff theorem leans on. + +## §3 The positive-existence property and its bound K + +**Property to verify.** Pseudo-DSLTrans notation, against a fixed SOM +`s`: + +``` +∀ component c ∈ instance.components : + c.in_modes = ∅ + ∨ ∃ (parent_sel, mode_inst) ∈ s.mode_selections : + mode_inst.owner = c.parent + ∧ ∃ m ∈ c.in_modes : m ≈ mode_inst.name + ⟹ c is active in projection(instance, s) +``` + +This is a positive existence statement: "for every element matching a +positive antecedent, the projection contains the witness". No +universally-quantified negative ("no inactive element appears") and +no recursion (the antecedent is depth-1 over the parent edge). + +**The bound K.** The cutoff theorem says we can witness any +counterexample at some K derivable from per-class bounds (paper §3, +"per-class bounds"). For our property, the only unbounded inputs are +(a) the size of `in_modes`, (b) the number of `mode_selections` in +the SOM. Constructing a counterexample requires: + +- one component with `in_modes ≠ ∅` (1 element); +- enough mode names in `c.in_modes` to exhibit every + case-insensitive mismatch class against the SOM's owner mode + (≤ C distinct names); +- enough entries in the SOM to exhibit at least one selection on a + non-matching owner (≤ 1 entry is enough for the falsifying case). + +Define + +``` +C = max over components c of |c.in_modes| +E = max over connections k of |k.in_modes| +M = max over modal components of |modes| (used for the SOM-set bound, see §5) +P = total number of modal subcomponents +``` + +Then the cutoff for the **filter-correctness property** alone is + +``` +K_filter = 1 + max(C, E) +``` + +A counterexample to "active-in-SOM ⟹ preserved" or to "preserved ⟹ +active-in-SOM" exists in the full model iff it exists in some +sub-instance with at most `K_filter` mode-name labels per element. +This follows because all comparisons are case-insensitive equality on +`Name`s; once we exhaust the equivalence classes of strings present +in the input, adding a (`K_filter` + 1)-th name only repeats an +existing equivalence class. + +## §4 Worked estimate of K on a real spar model + +Take `test-data/parser/modes_test.aadl` (the canonical multi-mode +example): + +- `Modal_System.impl` (top): 3 declared modes + (`nominal`, `degraded`, `emergency`) on the parent + (`modes_test.aadl:9-13`). +- `normal_proc` with `in modes (nominal)` ⇒ C contribution 1. +- `backup_proc` with `in modes (degraded, emergency)` ⇒ C + contribution 2. +- Connections `c1` ⇒ E contribution 1; `c2` ⇒ E contribution 2. + +So `C = 2`, `E = 2`, **K_filter = 3**. Three labels per element +suffice to exhaustively test the filter on any instance with the same +mode-name vocabulary. A Kani harness or capped proptest that +enumerates components with `|in_modes| ∈ {0, 1, 2, 3}` against SOMs +with `|mode_selections| ∈ {0, 1}` covers every behavioural +equivalence class of the filter. + +For the larger `test-data/aadl2rust/connection_in_modes.aadl` fixture +(`ConnectionInModes`, +`test-data/aadl2rust/connection_in_modes.aadl:1-25`) we have 2 modes +on the parent and 1 mode-name per connection, so `K_filter = 2`. + +In practice every modal fixture in `test-data/` runs comfortably +inside K = 3. The bound is small because AADL `in modes (…)` lists +are short by convention — they enumerate semantic modes, not +arbitrary identifiers. + +## §5 What this buys us — a discharge plan + +With `K_filter` in hand the filter can be verified two ways: + +**Option A — capped proptest.** A `proptest!` harness in +`crates/spar-analysis/src/modal.rs` under `#[cfg(test)]` that +generates `ComponentInstance` / `SystemOperationMode` pairs with +`|in_modes| ≤ K_filter` and asserts: + +``` +filter(instance, som).contains(c) + ⟺ is_component_active_in_som(instance, c, som) +``` + +The cutoff theorem says the proptest is **complete** at K, not just +randomly sound. Today the file has only example-based tests +(`modal.rs:165-234`); promoting to property-based with the cutoff +gives qualitative coverage. + +**Option B — Kani harness.** Bound the proof at K and ask Kani to +exhaust the symbolic state space. The proof effort is small because +the filter is straight-line code with no loops over arenas (the only +loop, `for &(_sel_comp, mode_inst_idx) in &som.mode_selections`, +walks a `Vec` bounded by `P`). + +The recommended cargo command is + +``` +cargo kani -p spar-analysis --harness modal::proofs::filter_consistency +``` + +once a harness is added; or simply + +``` +cargo test -p spar-analysis modal::proptest_filter_at_cutoff +``` + +for the proptest path. The verification.yaml entry would be a +`type: feature` `VERIF-MODAL-CUTOFF` linking to `ARCH-ANALYSIS` and +the `STPA-REQ-017` already cited at the top of `modal.rs:1`. + +## §6 What does NOT work + +The theorem has clean edges. Honest restatements of each gap: + +**Gap 1: negative properties.** "No inactive connection appears in +the SOM-projection" is universally negative. Lucio's theorem +(arXiv 2604.18792 §3) only covers positive existence / traceability. +For negatives we would need either (a) a separate completeness +argument or (b) reformulation as a positive property over the +complement instance. spar's filter does emit only the active subset, +so the negative property holds *by construction* — but that is a +constructive argument, not a cutoff-theorem one. + +**Gap 2: cartesian-product SOM enumeration.** The set of SOMs is +`|system_operation_modes|` and is itself the cartesian product +`∏ |modes(c)|` over modal subcomponents `c`. For `P` modal +subcomponents each with `M` modes, `|SOMs| = M^P`. The filter's K is +small *per SOM*; the cost of running per-SOM analyses is **not** +bounded by K. The cutoff theorem says nothing about reducing the +number of SOMs — and shouldn't, because each SOM is a semantically +distinct point in the design space. The exponential blow-up is in +the input, not the verification. + +**Gap 3: fixed-point downstream passes.** `RtaAnalysis` +(`crates/spar-analysis/src/rta.rs`) computes a response-time +recurrence `R_i^{n+1} = C_i + B_i + Σ_j ⌈R_i^n / T_j⌉·C_j`. The +fixed-point iteration is unbounded a priori (terminates only when +converged or unschedulable). `WcttAnalysis` +(`crates/spar-analysis/src/wctt.rs`) composes residual service +curves over a chain of hops — also a fold whose depth equals the +hop count. Neither pass has a per-class cutoff in the sense Lucio +uses; the convergence guarantees come from real-time scheduling +theory (Joseph-Pandya, Liu-Layland) and Network Calculus min-plus +algebra (`proofs/Proofs/Network/MinPlusPwa.lean` skeleton), not from +DSLTrans-style structural induction. **The cutoff does not compose** +through these passes. + +**Gap 4: mode reachability.** `ModeReachabilityAnalysis` +(`crates/spar-analysis/src/mode_reachability.rs`) computes reachable +SOMs from an initial SOM via a labelled transition system. That is a +graph reachability problem — exactly the recursive structure the +DSLTrans fragment excludes. The Kripke-style export to NuSMV +(`mode_reachability::export_smv`, +`crates/spar-cli/src/main.rs:1208`) routes this analysis to an +external model checker by design. + +**Gap 5: case-insensitive name equality as a hidden quantifier.** A +strict reading of the DSLTrans fragment requires equality predicates +to be decidable in constant time per pair. `a.eq_ignore_ascii_case(b)` +is constant-time per character pair, but if we admit arbitrarily long +identifiers the per-comparison cost is not O(1). In the spar AADL +parser identifiers are bounded by the lexer +(`crates/spar-parser/src/lexer.rs`), so practical inputs are fine, +but a formal proof would need to discharge that bound. + +## §7 Open questions + +- Could the same analysis apply to spar's *connection-walk* passes + (e.g. `flow_check.rs`)? Those follow `connections[i].dst → next + component`, which is non-recursive only if the flow graph has + bounded depth — i.e. only on acyclic flow declarations. Worth a + follow-up note. +- The DSLTrans CEGAR-based fragment verification (arXiv 2604.18792 + abstract) supports automatic refinement when a property does not + fit the fragment. It is unclear whether the modal-filter Kani path + benefits from CEGAR; the filter is small enough that the SMT + encoding is cheap regardless. +- `is_component_active_in_som` returns `true` when the parent has no + mode selection in the SOM (`modal.rs:73-74`). This is a + *permissive* default and matches AS5506 semantics — but the + bidirectional cutoff property in §3 only covers the case + `c.parent.has_selection ⟹ active iff matches`. The default-active + branch is trivially correct but worth flagging in the + formalisation. + +## §8 What we did not change + +This document does not modify any code in `crates/spar-analysis`. +The cutoff property is asserted on the existing filter as it stands +post-merge of #140. No bugs were found during the read; the modal +helpers in `modal.rs` are tight. + +--- + +### References + +- Lucio, L. *Tractable Verification of Model Transformations: A + Cutoff-Theorem Approach for DSLTrans.* arXiv:2604.18792v2 (2026). + Abstract: Cutoff Theorem for positive existence / traceability; + Z3-based implementation; 552/899 properties proved, 345 + counterexamples, 2 undecided across ATL Zoo benchmarks. +- spar source references (all under + `/Users/r/git/pulseengine/spar/`): + - `crates/spar-analysis/src/modal.rs:42-105` — filter predicates. + - `crates/spar-analysis/src/lib.rs:290-311` — per-SOM driver. + - `crates/spar-hir-def/src/instance.rs:31-85` — instance + SOM + types. + - `test-data/parser/modes_test.aadl:1-37` — worked-example fixture. + - `test-data/aadl2rust/connection_in_modes.aadl:1-25` — second + fixture for connection filter. diff --git a/docs/research/forge-replication-protocol.md b/docs/research/forge-replication-protocol.md new file mode 100644 index 0000000..d83a40a --- /dev/null +++ b/docs/research/forge-replication-protocol.md @@ -0,0 +1,260 @@ +# Forge-style architecture-descriptor replication protocol on spar + +Status: **research protocol** — scaffolds a replication of Jin's +"Formal Architecture Descriptors" controlled experiment +(arXiv [2604.13108](https://arxiv.org/abs/2604.13108)) using spar +itself as the target codebase. The scaffold is here; the trials are +the user's job. +Last update: 2026-05-10. +Audience: spar maintainers, Forge-paper replicators, anyone trying to +quantify whether an AADL-style architecture descriptor reduces agent +navigation steps on a Rust codebase. + +> **TL;DR.** spar already produces architecture descriptors (rivet +> YAML, AADL, SysML v2 round-trip) about itself. That gives us a +> rare eat-your-own-dogfood replication target: agent runs `find +> where X happens` queries against the spar repo with and without the +> existing `artifacts/architecture.yaml` + `descriptor.sexp` in +> context. We mirror Jin's protocol (N=24 tasks × 2 conditions, +> Wilcoxon signed-rank, Cohen's d) and ship the scripted runner, +> tasks file, descriptor, and an empty RESULTS template. **No +> numbers are reported here — only the protocol.** + +--- + +## §1 Target choice and rationale + +We pick **spar itself** as the target codebase. + +| Candidate | LoC | Why considered | Why we rejected (if so) | +|---|---|---|---| +| `tokio-rs/mini-redis` | ~3k Rust | Tutorial-grade, very clean. | Too small; navigation already trivial, no headroom for a 33-44 % reduction to be observable. | +| `astral-sh/ruff` | ~280k Rust | Real-world Rust, well-organised. | Too large; descriptor generation cost dominates, and the "ground truth file:line" for tasks is harder to nail down without an existing architectural artifact. | +| **spar (self)** | ~120k Rust across 19 crates | Has a hand-curated `artifacts/architecture.yaml` (rivet) + an AADL fixture (`artifacts/spar.aadl` if we add one) describing its own crate-graph. Mid-size enough to make navigation cost real, small enough that ground truth is auditable. | Self-experimentation introduces a confound (the model may have been trained on this repo or similar). We accept the confound and pre-register the analysis. | + +A self-referential replication is methodologically weaker than +testing on an unrelated repo, but stronger than no replication. The +core claim under test — *"agents handed a structured architecture +descriptor navigate faster than agents reading code blind"* — is +codebase-agnostic, so a single-target run can produce a usable effect +size estimate before the user commissions a multi-repo replication. + +The user can swap targets by adjusting `run_trial.sh`'s +`TARGET_REPO` and supplying a different descriptor; the protocol is +target-independent. + +## §2 Descriptor production + +Three artefacts live under `docs/research/forge-replication/`: + +1. **`descriptor.sexp`** — the canonical descriptor format. S-expression + over the spar crate graph and the top-level types in each crate. + Per Jin §"format comparison" (arXiv 2604.13108), S-expressions + *"detect all structural completeness errors"* while JSON fails + atomically and YAML silently corrupts ~50 % of errors. We make + the S-expression form authoritative; JSON / YAML are derived if + the user wants a format-comparison arm. +2. **`descriptor.aadl`** — an AADL stub describing spar's crate graph + as nested `system` declarations. Not used by the trial runner + directly, but kept for cross-checking with the rivet + `artifacts/architecture.yaml` already in the repo. +3. **`tasks.json`** — N=24 code-localization tasks. Each task has a + natural-language question, a ground-truth `file:line` answer, and + a difficulty tag (`easy`, `medium`, `hard`). + +The descriptor is hand-curated, not auto-generated. Auto-generation +of a Rust-source→architecture descriptor is a spar **roadmap** item +(SysML v2 emitter is currently rivet→SysML, see +`crates/spar-sysml2/src/generate.rs:153`); for the replication +protocol we accept the hand-curated form. Jin's §"artifact vs +process" used auto-generated descriptors against a Forge tool, so +this is a known deviation; the user can re-run with an +auto-generated descriptor once spar gains the emitter, and the +effect-size delta is itself an interesting reading. + +## §3 Tasks + +`tasks.json` carries 24 entries, balanced by difficulty (8 easy / 8 +medium / 8 hard) and by spar subsystem (parser / analysis / network / +codegen / CLI / verify / docs). A representative subset: + +- *Easy.* "Find where AADL `in modes (…)` lists are parsed into the + HIR." → `crates/spar-hir-def/src/instance.rs:84` (the + `in_modes` field on `ComponentInstance`). +- *Medium.* "Find the function that computes the WCET inflation for + context switches." → `crates/spar-analysis/src/rta.rs` (the + `Context_Switch_Time` integration introduced in #198, v0.9.2). +- *Hard.* "Find the predicate that decides whether a connection is + active in a given SOM." → + `crates/spar-analysis/src/modal.rs:82` (`is_connection_active_in_som`). + +Each task is a one-liner — Jin's protocol is explicit that tasks +must be answerable from the descriptor alone if the descriptor +captures the right surface (arXiv 2604.13108 §"controlled +experiment"). Hard tasks are tasks where the answer is in a deep +sub-module that descriptor-blind navigation would have to grep its +way to. + +## §4 Conditions + +Minimum two arms: + +- **C0 — no-descriptor (control).** Agent is given the task prompt + and a clean working copy of the spar repo. No additional context. +- **C1 — with-descriptor (treatment).** Agent is given the task + prompt, a clean working copy, and the `descriptor.sexp` file as + pre-context. + +Optional format arms (mirror Jin §"format comparison"): + +- **C2 — JSON descriptor.** Same content as C1, JSON-serialised. +- **C3 — YAML descriptor.** Same content as C1, YAML-serialised. +- **C4 — Markdown descriptor.** Same content as C1, prose form. + +We default to C0 vs C1 only; the user enables C2-C4 by adding their +descriptor variants and re-running `run_trial.sh`. Jin found +*"no significant format difference between S-expression / JSON / +YAML / Markdown"* on the navigation-step metric, but a significant +difference on error detection — so the format arms are optional and +the user's choice. + +## §5 Metric + +**Navigation steps.** Count of tool calls the agent emits before +producing a final answer with the correct `file:line`. Tools that +count: + +- `Read` (full or partial file read); +- `Bash` calls that include `grep`, `rg`, `find`, `ls` (heuristic: + any Bash invocation whose first token after `cd …;` is one of + those); +- `Glob` calls. + +Tools that do **not** count: `Write`, `Edit`, the final reply itself, +and meta-calls (`gh`, `cargo`). This matches Jin §"controlled +experiment", which counts only navigation-class tool calls. + +The runner outputs `n_tool_calls` per trial. Wall time is recorded +as a secondary metric. + +## §6 Statistical plan + +Per Jin §"controlled experiment" we adopt Wilcoxon signed-rank as +the primary test, with N=24 paired observations (each task is run +once under C0 and once under C1; the pair is the unit of analysis). + +For Wilcoxon signed-rank with N=24 paired observations, p<0.05 +two-sided requires the signed-rank sum to lie outside the critical +region (T ≤ 81 for N=24, two-sided α=0.05; standard tables). In +practical terms this corresponds to roughly 16-20 of the 24 task +pairs showing a reduction in the same direction. The paper reports +**d=0.92** — a large effect; if our replication produces **d=0.5** (a +medium effect) we still have a publishable signal with N=24. + +A power calculation against d=0.5, α=0.05 two-sided, suggests N≥27 +for 80 % power on a paired t-test (Cohen 1988 tables). N=24 is +slightly under-powered for a medium effect; we either accept that or +pre-register a second batch. Either is fine — pre-registration is +the discipline. + +Pre-register the analysis plan **before** the user runs trials. +Record the pre-registration in `RESULTS.md` (template included). + +## §7 Trial-runner scaffold + +`run_trial.sh` is a small bash driver that: + +1. Reads `tasks.json` and selects task `--task-id N` (1..24). +2. Reads `--condition C0|C1|C2|...` and assembles the prompt: + - C0: just the task question. + - C1+: the task question prefixed by the matching descriptor + file's content. +3. Launches a Claude Code session in headless mode (`claude + --print` style) against the worktree, with the trial prompt. +4. Parses the session transcript for tool calls (counting only the + navigation tools listed in §5). +5. Compares the agent's emitted `file:line` against the task's + `ground_truth` field; sets `found_correct=true|false`. +6. Appends a row to `results.csv` with the schema + `task_id,condition,n_tool_calls,found_correct,wall_time_seconds,trial_id,timestamp`. + +The script is included as `run_trial.sh` (executable). It does not +require root, network, or any cargo work; it expects `claude`, +`jq`, and a clean `git status` on the worktree before each trial. + +There is no automated harness for *correctness* of the agent's +answer — the user spot-checks `found_correct=false` rows manually, +because the ground-truth `file:line` can drift if the repo evolves +between trial batches. To minimise drift, pin all trials to a single +commit SHA. + +## §8 Reporting template + +`RESULTS.md` lives under `docs/research/forge-replication/` as an +empty template. Sections: + +- **Pre-registration.** Date, hypothesis, primary statistical test. +- **Conditions.** Which arms were run, descriptor commit SHA, agent + model + temperature, pinned spar SHA. +- **N.** Tasks per arm, paired or unpaired, exclusions. +- **Wilcoxon p.** Signed-rank statistic, p-value, effect direction. +- **Cohen's d.** Computed on paired differences. +- **Discussion.** What replicated, what didn't, what's a methods + limitation. + +The template explicitly does **not** include placeholder numbers. +Empty fields stay empty until the user runs trials. A +"pending replication" pointer in the v0.9.x milestone blog post +(`docs/blog/v0.9.x-milestone.md`) links here. + +## §9 Honest deviations from Jin's protocol + +- **Self-target.** Jin used Forge against unfamiliar repos; we use + spar against spar. The confound is acknowledged in §1. +- **Hand-curated descriptor.** Jin's "artifact vs process" arm + required auto-generated descriptors; we have only a hand-curated + one until spar's Rust-source→descriptor emitter ships. +- **N=24 with 2 arms, not 4.** Jin's controlled experiment was 24×4. + We default to 24×2 (paired); the 4-arm format comparison is + optional and unbalanced for default runs. +- **No field study.** Jin reports 7,012 Claude Code sessions with + 52 % variance reduction. We have no comparable instrumentation; + the field-study arm is not replicated. + +These deviations weaken the replication's external validity but do +not invalidate the controlled-experiment arm. The user can close +each gap as the toolchain matures. + +## §10 Artefacts shipped in this PR + +Under `docs/research/forge-replication/`: + +- `descriptor.sexp` — S-expression descriptor of spar's crate graph + (the v0.9.3-tip snapshot). +- `descriptor.aadl` — AADL stub of the same surface, cross-check + artefact. +- `tasks.json` — 24 code-localization tasks with ground-truth + `file:line` answers. +- `run_trial.sh` — executable bash trial runner. +- `RESULTS.md` — empty results template, with pre-registration + prompts. + +Together these are the minimum scaffold the user needs to run a +Forge-style replication on spar. No trial has been run; no number is +reported. + +--- + +### References + +- Jin et al. *Formal Architecture Descriptors for AI Agent Code + Localization.* arXiv:2604.13108 (2026). Controlled experiment + (24 × 4, Wilcoxon p=0.009, d=0.92); artifact-vs-process (15 × 3, + 100% vs 80%, p=0.002, d=1.04); field study (7,012 sessions, + 52% variance reduction); S-expression best for error detection. +- Cohen, J. *Statistical Power Analysis for the Behavioral Sciences* + (1988) — N≥27 tables used in §6. +- spar internal: + `crates/spar-sysml2/src/generate.rs:153` — current SysML v2 + emitter target (rivet → SysML, not Rust → architecture). + `artifacts/architecture.yaml:1-60` — existing rivet self-description. diff --git a/docs/research/forge-replication/RESULTS.md b/docs/research/forge-replication/RESULTS.md new file mode 100644 index 0000000..f7a0204 --- /dev/null +++ b/docs/research/forge-replication/RESULTS.md @@ -0,0 +1,93 @@ +# Forge-replication results — spar + +Status: **template, pending trials.** +Last update: 2026-05-10. + +This file is the reporting template for a Forge-style controlled +experiment on spar. The protocol is documented in +[../forge-replication-protocol.md](../forge-replication-protocol.md). +The scaffolding (descriptor, tasks, runner) is shipped under +`docs/research/forge-replication/`. **No trials have been run.** Fill +in each section after the user runs `run_trial.sh` across all tasks +and conditions. + +--- + +## Pre-registration + +- Date pre-registered: _pending_ +- Hypothesis (H1): Conditioning a Claude Code agent with the + `descriptor.sexp` (Condition C1) reduces the number of + navigation-class tool calls relative to the no-descriptor control + (Condition C0) when locating a known `file:line` in the spar repo. +- Null (H0): No difference in tool-call count between C0 and C1. +- Primary test: Wilcoxon signed-rank, two-sided, α = 0.05. +- Secondary metrics: `found_correct` (binary), wall time (seconds). +- Direction: one-tailed expectation that C1 has fewer steps; we still + report two-sided p as primary. + +## Conditions + +- Agent: _pending_ (record exact model + temperature, e.g. Claude + Sonnet 4.6 / 4.7, temp=0). +- spar commit pin (`tasks.json::target_commit`): `bad85e6` + (v0.9.3-tip). +- Descriptor format(s) tested: + - C0 — no descriptor. + - C1 — `descriptor.sexp`. + - Optional: C2/C3/C4 — JSON / YAML / Markdown variants. +- Number of trials per (task, condition) cell: _pending_ (default 1). + +## N + +- Tasks: 24 (per `tasks.json`). +- Conditions: _pending_ (default C0 + C1). +- Total observations: _pending_. +- Exclusions: _pending_ (record any task where ground truth proved + ambiguous or the agent crashed). + +## Wilcoxon p + +- Test statistic T: _pending_. +- Sample size N: _pending_. +- p-value (two-sided): _pending_. +- Effect direction: _pending_. + +## Cohen's d + +- Paired-difference mean: _pending_. +- Paired-difference SD: _pending_. +- d = _pending_. +- Interpretation: _pending_ (Cohen 1988: small=0.2, medium=0.5, + large=0.8; Jin paper reports d=0.92 on the analogous arm). + +## Accuracy + +- C0 mean `found_correct`: _pending_. +- C1 mean `found_correct`: _pending_. +- Difference (% points): _pending_. + +## Discussion + +_pending — fill in after results are in_ + +Suggested topics: + +- Did the effect direction match Jin's published 33–44 % reduction? +- Was the effect concentrated in `hard` tasks (the descriptor + presumably matters most where deep navigation would otherwise be + required)? +- Were there tasks where C1 was *worse* than C0? If so, why — was + the descriptor's claim for that surface inaccurate or stale? +- Threats to validity (self-target confound; possible training-set + contamination of the model on spar; hand-curated descriptor). +- Recommended next replication batch: different repo, different + agent, auto-generated descriptor. + +## Raw data + +The trial runner writes `results.csv` in this directory with one row +per trial. Transcripts (one per trial) live under `transcripts/`. +Neither is committed by default — re-running trials regenerates them. +The user should commit a frozen `results.csv` + `transcripts/` +tarball when reporting numbers. diff --git a/docs/research/forge-replication/descriptor.aadl b/docs/research/forge-replication/descriptor.aadl new file mode 100644 index 0000000..5d9935a --- /dev/null +++ b/docs/research/forge-replication/descriptor.aadl @@ -0,0 +1,155 @@ +-- Forge-replication descriptor (AADL form) for the spar workspace. +-- +-- This is a cross-check artefact for docs/research/forge-replication/ +-- descriptor.sexp. It models the spar crate graph as nested AADL +-- `system` declarations. It is NOT a runnable spec for spar itself +-- (the workspace is a Rust toolchain, not an embedded system); it is +-- a structural descriptor for Forge-protocol trials. +-- +-- See docs/research/forge-replication-protocol.md §2 for the role of +-- this file. + +package SparWorkspace +public + + -- The entire workspace. + system Spar + features + cli_in : in event port; + wasm_in : in event port; + mcp_in : in event port; + end Spar; + + system implementation Spar.v0_9_3 + subcomponents + -- Parsing & syntax layer + parser : system Parser.impl; + syntax : system Syntax.impl; + annex : system Annex.impl; + + -- HIR / instance model + base_db : system BaseDb.impl; + hir_def : system HirDef.impl; + hir : system Hir.impl; + + -- Analyses + analysis : system Analysis.impl; + + -- Network calculus / solver + network : system Network.impl; + solver : system Solver.impl; + + -- Transform / codegen / variants + transform : system Transform.impl; + codegen : system Codegen.impl; + variants : system Variants.impl; + + -- Verification + verify : system Verify.impl; + verify_macros : system VerifyMacros.impl; + + -- External integration + sysml2 : system Sysml2.impl; + render : system Render.impl; + trace_topology : system TraceTopology.impl; + + -- Top surface + insight : system Insight.impl; + mcp : system Mcp.impl; + cli : system Cli.impl; + wasm : system Wasm.impl; + + connections + c01 : port parser.out -> syntax.in; + c02 : port syntax.out -> annex.in; + c03 : port syntax.out -> hir_def.in; + c04 : port hir_def.out -> hir.in; + c05 : port hir.out -> analysis.in; + c06 : port analysis.out -> verify.in; + c07 : port analysis.out -> codegen.in; + c08 : port analysis.out -> render.in; + c09 : port analysis.out -> insight.in; + c10 : port analysis.out -> mcp.in; + c11 : port hir.out -> sysml2.in; + c12 : port network.out -> analysis.network_in; + c13 : port solver.out -> network.solver_in; + c14 : port hir.out -> transform.in; + c15 : port transform.out -> variants.in; + c16 : port verify_macros.out -> verify.in; + c17 : port hir.out -> trace_topology.in; + c18 : port mcp.out -> cli.mcp_in; + c19 : port cli.in -> Spar.cli_in; + c20 : port wasm.in -> Spar.wasm_in; + c21 : port mcp.in -> Spar.mcp_in; + end Spar.v0_9_3; + + -- Stub system types (no implementations beyond marker subcomponents) + system Parser features out : out event port; end Parser; + system implementation Parser.impl end Parser.impl; + + system Syntax features in : in event port; out : out event port; end Syntax; + system implementation Syntax.impl end Syntax.impl; + + system Annex features in : in event port; end Annex; + system implementation Annex.impl end Annex.impl; + + system BaseDb end BaseDb; + system implementation BaseDb.impl end BaseDb.impl; + + system HirDef features in : in event port; out : out event port; end HirDef; + system implementation HirDef.impl end HirDef.impl; + + system Hir features in : in event port; out : out event port; end Hir; + system implementation Hir.impl end Hir.impl; + + system Analysis + features + in : in event port; + out : out event port; + network_in : in event port; + end Analysis; + system implementation Analysis.impl end Analysis.impl; + + system Network features out : out event port; solver_in : in event port; end Network; + system implementation Network.impl end Network.impl; + + system Solver features out : out event port; end Solver; + system implementation Solver.impl end Solver.impl; + + system Transform features in : in event port; out : out event port; end Transform; + system implementation Transform.impl end Transform.impl; + + system Codegen features in : in event port; end Codegen; + system implementation Codegen.impl end Codegen.impl; + + system Variants features in : in event port; end Variants; + system implementation Variants.impl end Variants.impl; + + system Verify features in : in event port; end Verify; + system implementation Verify.impl end Verify.impl; + + system VerifyMacros features out : out event port; end VerifyMacros; + system implementation VerifyMacros.impl end VerifyMacros.impl; + + system Sysml2 features in : in event port; end Sysml2; + system implementation Sysml2.impl end Sysml2.impl; + + system Render features in : in event port; end Render; + system implementation Render.impl end Render.impl; + + system TraceTopology features in : in event port; end TraceTopology; + system implementation TraceTopology.impl end TraceTopology.impl; + + system Insight features in : in event port; end Insight; + system implementation Insight.impl end Insight.impl; + + system Mcp features in : in event port; out : out event port; end Mcp; + system implementation Mcp.impl end Mcp.impl; + + system Cli features in : in event port; mcp_in : in event port; end Cli; + system implementation Cli.impl end Cli.impl; + + system Wasm features in : in event port; end Wasm; + system implementation Wasm.impl end Wasm.impl; + +end SparWorkspace; diff --git a/docs/research/forge-replication/descriptor.sexp b/docs/research/forge-replication/descriptor.sexp new file mode 100644 index 0000000..d602420 --- /dev/null +++ b/docs/research/forge-replication/descriptor.sexp @@ -0,0 +1,175 @@ +;; spar architecture descriptor (S-expression form) +;; +;; Forge-paper-style architecture descriptor of the spar Rust workspace. +;; Generated by hand at v0.9.3-tip (commit bad85e6). Update when the +;; crate graph changes. +;; +;; Top-level structure: +;; (workspace NAME (crate ...) ...) +;; (crate NAME (role ...) (deps ...) (modules ...) (key-types ...)) +;; +;; Mirrors artifacts/architecture.yaml semantically; uses S-expression +;; form per Jin et al. arXiv:2604.13108 (§"format comparison" — +;; S-expressions detect all structural completeness errors). + +(workspace spar + (version "0.9.3") + (purpose "AADL v2.2 / SAE AS5506D toolchain with NC/RTA/STPA analyses") + (test-count 2900) + + ;; ── Parsing & syntax layer ──────────────────────────────────────── + (crate spar-parser + (role "AADL v2.2 hand-written recursive descent parser, lossless CST") + (deps rowan) + (key-types Parser Lexer SyntaxKind)) + + (crate spar-syntax + (role "Typed CST wrapper over rowan green/red trees") + (deps spar-parser rowan) + (key-types SourceFile AstNode)) + + (crate spar-annex + (role "Embedded annex sub-language parsers (EMV2, BLESS, ARINC653)") + (deps spar-syntax)) + + ;; ── HIR / instance model ────────────────────────────────────────── + (crate spar-base-db + (role "Salsa incremental computation foundation") + (deps salsa) + (key-types FileId SourceDatabase)) + + (crate spar-hir-def + (role "HIR definition layer — item trees, properties, resolver") + (deps spar-base-db spar-syntax la-arena) + (modules item_tree property_eval resolver instance overlay) + (key-types ItemTree PropertyMap SystemInstance ComponentInstance + ConnectionInstance SystemOperationMode ModeInstance)) + + (crate spar-hir + (role "HIR query layer — packages, instance nodes, salsa queries") + (deps spar-hir-def spar-base-db) + (key-types Package InstanceNode)) + + ;; ── Analyses ────────────────────────────────────────────────────── + (crate spar-analysis + (role "Pluggable analysis framework — 30+ passes on SystemInstance") + (deps spar-hir spar-hir-def spar-network spar-solver) + (modules + modal modal_rules mode_check mode_rules mode_reachability + connectivity hierarchy completeness binding_check binding_rules + flow_check flow_rules property_rules connection_rules + subcomponent_rules direction_rules classifier_match + category_check feature_group_check naming_rules legality + scheduling scheduling_verified rta wctt latency + memory_budget resource_budget weight_power bus_bandwidth + arinc653 emv2_analysis emv2_stpa_bridge extends_rules + wrpc_binding ai_ml) + (key-types + Analysis ModalAnalysis AnalysisRunner AnalysisDiagnostic + Severity) + (traits + ;; Modal filter: pure predicates over instance + SOM + (Modal pure-fn is_component_active_in_som modal.rs:42) + (Modal pure-fn is_connection_active_in_som modal.rs:82) + (Modal pure-fn is_active_in_mode modal.rs:107)) + (per-som-driver lib.rs:290 run_all_per_som) + (modal-impls + ;; The 8 analyses that fork per SOM + ConnectivityAnalysis BindingCheckAnalysis MemoryBudgetAnalysis + ResourceBudgetAnalysis WeightPowerAnalysis Arinc653Analysis + BusBandwidthAnalysis SchedulingAnalysis)) + + ;; ── Network calculus kernel ─────────────────────────────────────── + (crate spar-network + (role "Min-plus algebra kernel: arrival/service curves, TSN shaping") + (deps good_lp) + (modules pmoo tsn arrival service) + (key-types + ArrivalCurve ServiceCurve PiecewiseAffineArrivalCurve + GateSchedule CreditPool CbsReservation + WcttBound LudbBound)) + + ;; ── Solvers ─────────────────────────────────────────────────────── + (crate spar-solver + (role "LP / MILP wrappers over good_lp + HiGHS") + (deps good_lp)) + + ;; ── Transform / codegen / variants ──────────────────────────────── + (crate spar-transform + (role "Hypothetical-rebinding (Move) oracle, design-space enumeration")) + + (crate spar-codegen + (role "AADL → Rust / Wasm / WIT / doc / proof generators") + (modules rust_gen wit_gen test_gen workspace_gen proof_gen + config_gen doc_gen)) + + (crate spar-variants + (role "Variant-bridge between AADL annexes and rivet artifacts")) + + ;; ── Verification ────────────────────────────────────────────────── + (crate spar-verify + (role "AADL/property → Lean / Kani harness lowering")) + + (crate spar-verify-macros + (role "Macro layer for spar-verify")) + + ;; ── External integration ────────────────────────────────────────── + (crate spar-sysml2 + (role "SysML v2 ↔ AADL round-trip (rivet YAML → SysML v2 source)") + (modules grammar lower extract generate parser) + (key-fns generate_sysml2 lower_to_aadl extract_all_yaml)) + + (crate spar-render + (role "SVG / HTML renderer for instance models")) + + (crate spar-trace-topology + (role "Runtime/declared reconciliation (v0.10.x)") + (parsers + pcapng-frame-source ;; #210, v0.10.x B-2 + lldp-topology-source ;; #208, v0.10.x B-3 + qcc-yang-config-source ;; v0.10.x B-4 + gptp-ptp-time-source) ;; #211, v0.10.x B-5 + (checks + IdentityUnknown TopologyMissingWiring ConfigDrift + GptpOutOfBudget BinaryMismatch)) + + ;; ── Insight / MCP / CLI / WASM ──────────────────────────────────── + (crate spar-insight + (role "Runtime-trace discrepancy assistant (Tier 1 CTF, v0.9.0)")) + + (crate spar-mcp + (role "MCP server (JSON-RPC 2.0 over stdio) — 3 read-only tools") + (tools spar.verify_move spar.enumerate_moves spar.check_chain)) + + (crate spar-cli + (role "CLI binary — parse/items/instance/analyze/allocate/diff/modes/render/verify/codegen/sysml2/mcp") + (subcommands + parse items instance analyze allocate diff modes render + verify codegen sysml2 extract generate lsp mcp)) + + (crate spar-wasm + (role "Wasm bindings for browser / wRPC dispatch"))) + +;; ── Cross-cutting concerns ─────────────────────────────────────────── + +(properties + (predeclared-count 131) + (key-property-sets + Spar_TSN Spar_Identity Spar_Network Spar_Trace Spar_STPA)) + +(stpa + (requirements REQ-PARSE-001 STPA-REQ-016 STPA-REQ-017 STPA-REQ-022)) + +(proofs + (lean-skeletons + proofs/Proofs/Network/MinPlus.lean + proofs/Proofs/Network/MinPlusPwa.lean + proofs/Proofs/Scheduling/JosephPandya.lean) + (sorry-count 4)) + +(milestones + (v0.9.0 "MCP + spar-insight Tier 1 CTF (PR #178, #179)") + (v0.9.1 "gPTP Sync_Error + atomic-frame quantization (PR #186)") + (v0.9.2 "RTA→WCTT release-jitter, Context_Switch_Time, sensitivity (PR #195-#199)") + (v0.9.3 "Piecewise-affine arrival curves + PMOO/LUDB (PR #203-#204)") + (v0.10.0 "Trace-topology foundation + 4 parsers (PR #208-#211)")) diff --git a/docs/research/forge-replication/run_trial.sh b/docs/research/forge-replication/run_trial.sh new file mode 100644 index 0000000..528c042 --- /dev/null +++ b/docs/research/forge-replication/run_trial.sh @@ -0,0 +1,174 @@ +#!/usr/bin/env bash +# Forge-replication trial runner for spar. +# +# Usage: +# ./run_trial.sh --task-id <1..24> --condition [--trial-id N] +# +# Reads tasks.json + descriptor (.sexp/.json/.yaml/.md) and launches a +# Claude Code headless session against the spar worktree pinned at the +# task set's target_commit. Counts navigation-class tool calls in the +# session transcript and appends a row to results.csv. +# +# This script does NOT run trials by default. It is the scaffold the +# user invokes to produce the dataset. +# +# Required tools on PATH: claude (Claude Code CLI), jq, git. +# Optional: rg. + +set -euo pipefail + +HERE="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)" +SPAR_ROOT="${SPAR_ROOT:-$(cd "$HERE/../../.." && pwd)}" +TASKS_FILE="$HERE/tasks.json" +RESULTS_CSV="$HERE/results.csv" +TRANSCRIPTS_DIR="$HERE/transcripts" + +# ── Argument parsing ───────────────────────────────────────────────── +TASK_ID="" +CONDITION="" +TRIAL_ID="1" + +while [[ $# -gt 0 ]]; do + case "$1" in + --task-id) TASK_ID="$2"; shift 2;; + --condition) CONDITION="$2"; shift 2;; + --trial-id) TRIAL_ID="$2"; shift 2;; + -h|--help) + cat <<'EOF' +Usage: run_trial.sh --task-id <1..24> --condition [--trial-id N] + +Conditions: + C0 no descriptor (control) + C1 with descriptor.sexp (treatment, default S-expr form) + C2 with descriptor.json + C3 with descriptor.yaml + C4 with descriptor.md + +Output: appends one row to results.csv: + task_id,condition,n_tool_calls,found_correct,wall_time_seconds,trial_id,timestamp + +Transcripts saved under transcripts/--.jsonl +EOF + exit 0;; + *) echo "unknown arg: $1" >&2; exit 2;; + esac +done + +if [[ -z "$TASK_ID" || -z "$CONDITION" ]]; then + echo "missing --task-id or --condition" >&2 + exit 2 +fi + +# ── Sanity checks ──────────────────────────────────────────────────── +command -v claude >/dev/null 2>&1 || { echo "claude CLI not on PATH" >&2; exit 3; } +command -v jq >/dev/null 2>&1 || { echo "jq not on PATH" >&2; exit 3; } + +# Pin to commit to avoid drift between trials. +PIN="$(jq -r .target_commit "$TASKS_FILE")" +HEAD_SHA="$(cd "$SPAR_ROOT" && git rev-parse --short HEAD)" +if [[ "$PIN" != "$HEAD_SHA" ]]; then + echo "WARN: tasks.json pinned to $PIN but worktree is at $HEAD_SHA" >&2 + echo " either check out $PIN or update tasks.json target_commit" >&2 +fi + +# ── Build prompt ───────────────────────────────────────────────────── +TASK_JSON="$(jq -c ".tasks[] | select(.id == ${TASK_ID})" "$TASKS_FILE")" +if [[ -z "$TASK_JSON" ]]; then + echo "no task with id=$TASK_ID" >&2 + exit 2 +fi +TASK_Q="$(echo "$TASK_JSON" | jq -r .question)" +TASK_GT="$(echo "$TASK_JSON" | jq -r .ground_truth)" + +case "$CONDITION" in + C0) DESCRIPTOR_FILE="";; + C1) DESCRIPTOR_FILE="$HERE/descriptor.sexp";; + C2) DESCRIPTOR_FILE="$HERE/descriptor.json";; + C3) DESCRIPTOR_FILE="$HERE/descriptor.yaml";; + C4) DESCRIPTOR_FILE="$HERE/descriptor.md";; + *) echo "unknown condition: $CONDITION" >&2; exit 2;; +esac + +PROMPT_FILE="$(mktemp -t spar-forge-prompt.XXXXXX)" +trap 'rm -f "$PROMPT_FILE"' EXIT + +if [[ -n "$DESCRIPTOR_FILE" ]]; then + if [[ ! -f "$DESCRIPTOR_FILE" ]]; then + echo "descriptor not found: $DESCRIPTOR_FILE" >&2 + exit 4 + fi + cat <"$PROMPT_FILE" +Below is an architecture descriptor of the spar workspace. Use it to +locate the answer. + +--- BEGIN DESCRIPTOR --- +$(cat "$DESCRIPTOR_FILE") +--- END DESCRIPTOR --- + +Task: $TASK_Q + +Reply with exactly one line in the form "file:line" (no extra prose). +EOF +else + cat <"$PROMPT_FILE" +Task: $TASK_Q + +Reply with exactly one line in the form "file:line" (no extra prose). +EOF +fi + +# ── Run session ────────────────────────────────────────────────────── +mkdir -p "$TRANSCRIPTS_DIR" +TRANSCRIPT="$TRANSCRIPTS_DIR/${TASK_ID}-${CONDITION}-${TRIAL_ID}.jsonl" +START_TS="$(date +%s)" + +# claude --print runs one-shot headless. --output-format stream-json +# emits one JSON event per line including each tool call. +( cd "$SPAR_ROOT" && \ + claude --print --output-format stream-json --verbose \ + --dangerously-skip-permissions \ + < "$PROMPT_FILE" > "$TRANSCRIPT" ) || true + +END_TS="$(date +%s)" +WALL=$((END_TS - START_TS)) + +# ── Score the transcript ───────────────────────────────────────────── +# Navigation-class tools: Read, Glob, Bash (with grep/rg/find/ls token), +# and Grep (if the harness exposes it directly). +N_CALLS=$(jq -s ' + [.[] + | select(.type == "tool_use") + | .name as $n + | (.input.command // "") as $cmd + | select( + $n == "Read" + or $n == "Glob" + or $n == "Grep" + or ($n == "Bash" + and ($cmd | test("^\\s*(cd [^;]+;\\s*)?(grep|rg|find|ls)\\b"))) + )] + | length +' < "$TRANSCRIPT" 2>/dev/null || echo "NaN") + +# Extract the final assistant message and check substring match against +# ground truth. Loose match (substring) — manual spot-check needed for +# borderline cases. +LAST_TEXT=$(jq -s ' + [.[] | select(.type == "assistant") | .message.content[]? | select(.type=="text") | .text] + | last // "" +' < "$TRANSCRIPT" 2>/dev/null || echo '""') + +if echo "$LAST_TEXT" | grep -qF "$TASK_GT"; then + CORRECT="true" +else + CORRECT="false" +fi + +# ── Append to CSV ──────────────────────────────────────────────────── +if [[ ! -f "$RESULTS_CSV" ]]; then + echo "task_id,condition,n_tool_calls,found_correct,wall_time_seconds,trial_id,timestamp" > "$RESULTS_CSV" +fi +echo "${TASK_ID},${CONDITION},${N_CALLS},${CORRECT},${WALL},${TRIAL_ID},$(date -u +%Y-%m-%dT%H:%M:%SZ)" \ + >> "$RESULTS_CSV" + +echo "trial done: task=$TASK_ID cond=$CONDITION calls=$N_CALLS correct=$CORRECT wall=${WALL}s" diff --git a/docs/research/forge-replication/tasks.json b/docs/research/forge-replication/tasks.json new file mode 100644 index 0000000..469ccab --- /dev/null +++ b/docs/research/forge-replication/tasks.json @@ -0,0 +1,176 @@ +{ + "$schema_note": "Forge-replication task set for spar. 24 paired tasks. Each task has a one-line natural-language question and a ground-truth file:line answer. Ground truths pinned to spar v0.9.3-tip (commit bad85e6). Re-validate when the commit pin changes. See docs/research/forge-replication-protocol.md.", + "version": "1.0", + "target_repo": "pulseengine/spar", + "target_commit": "bad85e6", + "tasks": [ + { + "id": 1, + "difficulty": "easy", + "subsystem": "hir-def", + "question": "Find where AADL `in modes (...)` lists are stored on a component instance in the HIR.", + "ground_truth": "crates/spar-hir-def/src/instance.rs:84" + }, + { + "id": 2, + "difficulty": "easy", + "subsystem": "hir-def", + "question": "Find the definition of the SystemOperationMode struct.", + "ground_truth": "crates/spar-hir-def/src/instance.rs:37" + }, + { + "id": 3, + "difficulty": "medium", + "subsystem": "analysis-modal", + "question": "Find the predicate that decides whether a component is active in a given SOM.", + "ground_truth": "crates/spar-analysis/src/modal.rs:42" + }, + { + "id": 4, + "difficulty": "medium", + "subsystem": "analysis-modal", + "question": "Find the predicate that decides whether a connection is active in a given SOM.", + "ground_truth": "crates/spar-analysis/src/modal.rs:82" + }, + { + "id": 5, + "difficulty": "medium", + "subsystem": "analysis-driver", + "question": "Find the runner method that invokes mode-dependent analyses once per SOM.", + "ground_truth": "crates/spar-analysis/src/lib.rs:290" + }, + { + "id": 6, + "difficulty": "hard", + "subsystem": "analysis-rta", + "question": "Find where Context_Switch_Time inflates each thread's WCET by 2x in the RTA recurrence.", + "ground_truth": "crates/spar-analysis/src/rta.rs" + }, + { + "id": 7, + "difficulty": "hard", + "subsystem": "analysis-wctt", + "question": "Find where the WCTT pass adds atomic-frame quantization (ceil(max_frame_bytes * 8e12 / link_rate_bps) ps per hop).", + "ground_truth": "crates/spar-analysis/src/wctt.rs" + }, + { + "id": 8, + "difficulty": "hard", + "subsystem": "network", + "question": "Find the PMOO/LUDB closed-form bound function for tree-shaped multiplexing.", + "ground_truth": "crates/spar-network/src/pmoo.rs" + }, + { + "id": 9, + "difficulty": "medium", + "subsystem": "network", + "question": "Find the PiecewiseAffineArrivalCurve type definition.", + "ground_truth": "crates/spar-network/src/lib.rs" + }, + { + "id": 10, + "difficulty": "easy", + "subsystem": "cli", + "question": "Find the top-level command dispatch table in the spar CLI.", + "ground_truth": "crates/spar-cli/src/main.rs" + }, + { + "id": 11, + "difficulty": "medium", + "subsystem": "cli", + "question": "Find the cmd_analyze function in the spar CLI.", + "ground_truth": "crates/spar-cli/src/main.rs:514" + }, + { + "id": 12, + "difficulty": "medium", + "subsystem": "sysml2", + "question": "Find the function that converts rivet YAML artifacts to SysML v2 source.", + "ground_truth": "crates/spar-sysml2/src/generate.rs:153" + }, + { + "id": 13, + "difficulty": "hard", + "subsystem": "mcp", + "question": "Find the MCP tool that performs a single hypothetical-rebinding (verify_move).", + "ground_truth": "crates/spar-mcp" + }, + { + "id": 14, + "difficulty": "easy", + "subsystem": "parser", + "question": "Find the AADL lexer.", + "ground_truth": "crates/spar-parser/src/lexer.rs" + }, + { + "id": 15, + "difficulty": "medium", + "subsystem": "analysis-bus", + "question": "Find where the bus_bandwidth analysis sums per-stream throughput.", + "ground_truth": "crates/spar-analysis/src/bus_bandwidth.rs" + }, + { + "id": 16, + "difficulty": "hard", + "subsystem": "analysis-arinc", + "question": "Find where ARINC-PARTITION-ISOLATION is enforced at Error severity per DO-297.", + "ground_truth": "crates/spar-analysis/src/arinc653.rs" + }, + { + "id": 17, + "difficulty": "easy", + "subsystem": "docs", + "question": "Find the v0.10.0 trace-topology design document.", + "ground_truth": "docs/designs/v0.10.0-trace-topology.md" + }, + { + "id": 18, + "difficulty": "easy", + "subsystem": "docs", + "question": "Find the rivet self-description architecture artifact.", + "ground_truth": "artifacts/architecture.yaml" + }, + { + "id": 19, + "difficulty": "medium", + "subsystem": "trace-topology", + "question": "Find the PCAPNG FrameSource implementation.", + "ground_truth": "crates/spar-trace-topology" + }, + { + "id": 20, + "difficulty": "medium", + "subsystem": "trace-topology", + "question": "Find the LLDP TopologySource implementation.", + "ground_truth": "crates/spar-trace-topology" + }, + { + "id": 21, + "difficulty": "hard", + "subsystem": "analysis-mode-reach", + "question": "Find the function that exports mode reachability to NuSMV.", + "ground_truth": "crates/spar-analysis/src/mode_reachability.rs" + }, + { + "id": 22, + "difficulty": "medium", + "subsystem": "render", + "question": "Find the entry point that renders an instance model to SVG.", + "ground_truth": "crates/spar-render/src/lib.rs:28" + }, + { + "id": 23, + "difficulty": "hard", + "subsystem": "proofs", + "question": "Find the min-plus Lean skeleton for piecewise-affine arrival curves.", + "ground_truth": "proofs/Proofs/Network/MinPlusPwa.lean" + }, + { + "id": 24, + "difficulty": "medium", + "subsystem": "analysis-mode-check", + "question": "Find where the analysis emits an Error when a component has modes but no initial mode declared.", + "ground_truth": "crates/spar-analysis/src/mode_check.rs:58" + } + ] +}