Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
20 changes: 10 additions & 10 deletions docs/API_SURFACE.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@

## Purpose

This document describes the stable Comptextv7 API, dashboard, export, and report
This document describes the stable CompText V7 API, dashboard, export, and report
surfaces that future benchmark/regression summaries can reference. It also
records integration expectations for sanitized outputs from
`ProfRandom92/Comptext-Daimler-Experiment-` without introducing runtime coupling.
Expand Down Expand Up @@ -32,7 +32,7 @@ available. This guardrail is intentionally lightweight: it validates helper
scripts, available local checks, and contract examples while
benchmark/regression evidence remains a sanitized report handoff from
`ProfRandom92/Comptext-Daimler-Experiment-`, not a runtime dependency of
Comptextv7.
CompText V7.

Branch discipline remains part of the API contract process: create a feature
branch from `main` when available, open a PR, request review, and never push
Expand Down Expand Up @@ -77,7 +77,7 @@ summaries include:
## Export/report endpoints

`/export.json` and `/export.csv` are the primary report handoff endpoints inside
Comptextv7. They should remain deterministic enough for review, small enough for
CompText V7. They should remain deterministic enough for review, small enough for
PR artifacts, and explicit about any schema changes.

`docs/reports/dashboard-health-summary.json` is the dashboard-facing static
Expand All @@ -94,14 +94,14 @@ Future report endpoints such as `/reports/benchmark-summary`,
`/reports/regression-summary`, or `/reports/sanitization-summary` should not be
added until a schema, security review, and issue scope approve them. If added,
they should consume sanitized summaries only and should not execute experiment
repository workloads from the Comptextv7 runtime.
repository workloads from the CompText V7 runtime.

## Payload and report contract expectations

Accepted report summaries should be:

- Synthetic in documentation examples.
- Sanitized before being copied into Comptextv7.
- Sanitized before being copied into CompText V7.
- Small enough to review in a pull request.
- Text-based: Markdown, JSON, or CSV.
- Explicit about `source_repo`, `target_repo`, `report_type`, status, timestamp
Expand All @@ -115,7 +115,7 @@ Machine-readable contract schemas now live under `contracts/` and are written as
lightweight JSON Schema-like documents that future agents and CI can inspect
without adding runtime dependencies:

- `contracts/api-dashboard.schema.json` describes Comptextv7 API routes,
- `contracts/api-dashboard.schema.json` describes CompText V7 API routes,
dashboard views, export formats, sanitized report integration points, and
security notes.
- `contracts/benchmark-summary.schema.json` describes synthetic benchmark
Expand Down Expand Up @@ -183,9 +183,9 @@ of reimplementing release-readiness logic.

## Compatibility with benchmark/regression reports

Benchmark summaries should map to Comptextv7 review surfaces this way:
Benchmark summaries should map to CompText V7 review surfaces this way:

| Report type | Expected Comptextv7 use |
| Report type | Expected CompText V7 use |
| --- | --- |
| `benchmark_summary` | Compare p50/p95/p99, RPS, error rate, and payload size for dashboard/API routes. |
| `regression_summary` | Decide whether a PR should merge, require remediation, or split into smaller changes. |
Expand All @@ -198,14 +198,14 @@ or high/critical forensic findings.

## Dashboard/API boundaries

Comptextv7 may:
CompText V7 may:

- Display sanitized benchmark/regression status.
- Export local dashboard evidence as JSON or CSV.
- Document future report contracts.
- Add small schema-version fields in future PRs.

Comptextv7 should not yet:
CompText V7 should not yet:

- Import code from `ProfRandom92/Comptext-Daimler-Experiment-`.
- Run experiment repository workloads as part of normal dashboard/API startup.
Expand Down
30 changes: 15 additions & 15 deletions docs/research_positioning.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,9 +2,9 @@

CompText V7 is the deterministic replay-integrity layer for compressed operational agent traces. It asks whether compact or reconstructed operational state can preserve the evidence, constraints, blockers, dependencies, recovery paths, capability boundaries, and tool-order signals needed to replay a safe operational trajectory after compression. The project is complementary to learned context-compression research, RAG evaluation, vector-memory systems, serving-layer cache optimization, and durable workflow infrastructure, but it does not replace those systems or claim solved AI memory.

## What CompTextv7 measures
## What CompText V7 measures

CompTextv7 measures fixture-bound replay survivability with deterministic artifacts. Current metrics and labels are intended to show whether explicitly encoded operational fields survive replay pressure, not whether a model answer is useful or semantically complete.
CompText V7 measures fixture-bound replay survivability with deterministic artifacts. Current metrics and labels are intended to show whether explicitly encoded operational fields survive replay pressure, not whether a model answer is useful or semantically complete.

Measured signals include:

Expand All @@ -20,15 +20,15 @@ Measured signals include:

## Replay-survivability evaluator brief

CompTextv7 evaluates replay survivability of compact operational state: whether fixture-authored operational fields can be compacted, reconstructed, replayed, and audited without relying on an LLM judge. The current prototype measures field survival, evidence survival, operational drift, and deterministic failure labels against checked-in fixtures. Its claims are therefore fixture-bound and prototype-scoped: it can show what the current validators detect under replay/compression pressure, not whether a deployed agent or memory product will succeed in the wild.
CompText V7 evaluates replay survivability of compact operational state: whether fixture-authored operational fields can be compacted, reconstructed, replayed, and audited without relying on an LLM judge. The current prototype measures field survival, evidence survival, operational drift, and deterministic failure labels against checked-in fixtures. Its claims are therefore fixture-bound and prototype-scoped: it can show what the current validators detect under replay/compression pressure, not whether a deployed agent or memory product will succeed in the wild.

Adjacent benchmark ecosystems include long-term memory benchmarks, RAG evaluation, long-horizon agent evaluation, software-agent/task benchmarks, and context-compression evaluation. Those ecosystems often evaluate task success, retrieval, answer quality, memory recall, or downstream performance. CompTextv7 is complementary: it evaluates whether compact operational state remains replayable and auditable, and it identifies which blockers, constraints, evidence, dependencies, recovery paths, or tool-order signals fail under compression/replay pressure.
Adjacent benchmark ecosystems include long-term memory benchmarks, RAG evaluation, long-horizon agent evaluation, software-agent/task benchmarks, and context-compression evaluation. Those ecosystems often evaluate task success, retrieval, answer quality, memory recall, or downstream performance. CompText V7 is complementary: it evaluates whether compact operational state remains replayable and auditable, and it identifies which blockers, constraints, evidence, dependencies, recovery paths, or tool-order signals fail under compression/replay pressure.

Why this matters: fluent summaries can lose blockers, constraints, evidence, dependencies, or recovery paths while still reading well. CompTextv7 treats that as deterministic replay degradation, not subjective text quality. The review path is the current trust chain: fixtures, generators, committed artifacts, Markdown summaries, README/doc values, artifact drift validation, and CI checks. See [Iterative Replay Degradation](iterative_replay_degradation.md), [Benchmark Explanation](BENCHMARK_EXPLANATION.md), the committed [iterative replay degradation summary](../artifacts/iterative_replay_degradation_results.summary.md), and [`scripts/validate_replay_artifact_drift.py`](../scripts/validate_replay_artifact_drift.py).
Why this matters: fluent summaries can lose blockers, constraints, evidence, dependencies, or recovery paths while still reading well. CompText V7 treats that as deterministic replay degradation, not subjective text quality. The review path is the current trust chain: fixtures, generators, committed artifacts, Markdown summaries, README/doc values, artifact drift validation, and CI checks. See [Iterative Replay Degradation](iterative_replay_degradation.md), [Benchmark Explanation](BENCHMARK_EXPLANATION.md), the committed [iterative replay degradation summary](../artifacts/iterative_replay_degradation_results.summary.md), and [`scripts/validate_replay_artifact_drift.py`](../scripts/validate_replay_artifact_drift.py).

## What CompTextv7 does not measure
## What CompText V7 does not measure

CompTextv7 does not measure general intelligence, answer quality, production readiness, or universal memory. It intentionally avoids:
CompText V7 does not measure general intelligence, answer quality, production readiness, or universal memory. It intentionally avoids:

- LLM judges or subjective scoring;
- embeddings, vector databases, graph stores, and external APIs;
Expand All @@ -50,23 +50,23 @@ The core contribution is a small deterministic review layer for operational repl

## Operational state vs raw chat history

CompTextv7 focuses on operational state, not raw chat-history retention. Rather than preserving every dialogue turn, it extracts, compacts, reconstructs, and verifies the fields that fixtures declare operationally relevant: tasks, constraints, blockers, evidence, dependencies, tool order, recovery actions, and continuation requirements.
CompText V7 focuses on operational state, not raw chat-history retention. Rather than preserving every dialogue turn, it extracts, compacts, reconstructs, and verifies the fields that fixtures declare operationally relevant: tasks, constraints, blockers, evidence, dependencies, tool order, recovery actions, and continuation requirements.

This framing is intentionally narrower than semantic memory. A replay can pass only for the fields represented in the fixture and checked by the deterministic validator.

## How deterministic replay validation differs from adjacent categories

| Category | What that category usually evaluates or provides | CompTextv7 boundary |
| Category | What that category usually evaluates or provides | CompText V7 boundary |
| --- | --- | --- |
| RAG evaluation | Retrieval quality, answer grounding, citation coverage, or generated-answer quality. | CompTextv7 does not retrieve documents or judge generated answers. It checks whether fixture-defined operational state survives compact/replay cycles. |
| Vector memory | Embedding-based recall and similarity search over stored memories. | CompTextv7 does not use embeddings or vector databases. It compares explicit fixture IDs, fields, attachments, and normalized values. |
| KV-cache compression | Serving-layer efficiency for model attention/cache reuse. | CompTextv7 does not optimize model internals or inference caches. It emits reviewable replay artifacts and field-survival metrics. |
| Workflow orchestration | Durable execution, retries, scheduling, state machines, and tool execution. | CompTextv7 does not run autonomous workflows. It validates whether replayed operational state still contains fixture-defined continuation requirements. |
| Learned context compression | Model-learned summaries or compressed prompts optimized for downstream performance. | CompTextv7 does not train or evaluate a learned compressor. It measures deterministic replay preservation under controlled fixtures. |
| RAG evaluation | Retrieval quality, answer grounding, citation coverage, or generated-answer quality. | CompText V7 does not retrieve documents or judge generated answers. It checks whether fixture-defined operational state survives compact/replay cycles. |
| Vector memory | Embedding-based recall and similarity search over stored memories. | CompText V7 does not use embeddings or vector databases. It compares explicit fixture IDs, fields, attachments, and normalized values. |
| KV-cache compression | Serving-layer efficiency for model attention/cache reuse. | CompText V7 does not optimize model internals or inference caches. It emits reviewable replay artifacts and field-survival metrics. |
| Workflow orchestration | Durable execution, retries, scheduling, state machines, and tool execution. | CompText V7 does not run autonomous workflows. It validates whether replayed operational state still contains fixture-defined continuation requirements. |
| Learned context compression | Model-learned summaries or compressed prompts optimized for downstream performance. | CompText V7 does not train or evaluate a learned compressor. It measures deterministic replay preservation under controlled fixtures. |
Comment on lines +61 to +65

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The repetition of the project name in every cell of this column is redundant since it is already specified in the column header. Removing it would improve readability and align with the style used in other tables and lists in the documentation (e.g., the 'Not claimed' section on line 83 and the 'Stable API routes' table in docs/API_SURFACE.md).

Suggested change
| RAG evaluation | Retrieval quality, answer grounding, citation coverage, or generated-answer quality. | CompText V7 does not retrieve documents or judge generated answers. It checks whether fixture-defined operational state survives compact/replay cycles. |
| Vector memory | Embedding-based recall and similarity search over stored memories. | CompText V7 does not use embeddings or vector databases. It compares explicit fixture IDs, fields, attachments, and normalized values. |
| KV-cache compression | Serving-layer efficiency for model attention/cache reuse. | CompText V7 does not optimize model internals or inference caches. It emits reviewable replay artifacts and field-survival metrics. |
| Workflow orchestration | Durable execution, retries, scheduling, state machines, and tool execution. | CompText V7 does not run autonomous workflows. It validates whether replayed operational state still contains fixture-defined continuation requirements. |
| Learned context compression | Model-learned summaries or compressed prompts optimized for downstream performance. | CompText V7 does not train or evaluate a learned compressor. It measures deterministic replay preservation under controlled fixtures. |
| RAG evaluation | Retrieval quality, answer grounding, citation coverage, or generated-answer quality. | Does not retrieve documents or judge generated answers. It checks whether fixture-defined operational state survives compact/replay cycles. |
| Vector memory | Embedding-based recall and similarity search over stored memories. | Does not use embeddings or vector databases. It compares explicit fixture IDs, fields, attachments, and normalized values. |
| KV-cache compression | Serving-layer efficiency for model attention/cache reuse. | Does not optimize model internals or inference caches. It emits reviewable replay artifacts and field-survival metrics. |
| Workflow orchestration | Durable execution, retries, scheduling, state machines, and tool execution. | Does not run autonomous workflows. It validates whether replayed operational state still contains fixture-defined continuation requirements. |
| Learned context compression | Model-learned summaries or compressed prompts optimized for downstream performance. | Does not train or evaluate a learned compressor. It measures deterministic replay preservation under controlled fixtures. |


## Artifact-backed JSON and CI checks

CompTextv7 uses artifact-backed JSON and deterministic Markdown summaries so reviewers can inspect the exact replay evidence for a commit. CI artifacts are evidence records for tested fixtures; they are not universal guarantees.
CompText V7 uses artifact-backed JSON and deterministic Markdown summaries so reviewers can inspect the exact replay evidence for a commit. CI artifacts are evidence records for tested fixtures; they are not universal guarantees.

## Fixture-bound baseline interpretation

Expand Down
Loading