Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 4 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -145,6 +145,10 @@ workspaces/**/artifacts/**
!workspaces/black-box/artifacts/h2-output-cloud-geometry-20260525.json
!workspaces/black-box/artifacts/h2-output-cloud-geometry-seed177-20260525.json
!workspaces/black-box/artifacts/h2-output-cloud-geometry-label-shuffle-20260525.json
!workspaces/black-box/artifacts/h2-output-cloud-geometry-shared-position-256-20260525.json
!workspaces/black-box/artifacts/h2-output-cloud-geometry-shared-position-256-label-shuffle-20260525.json
!workspaces/black-box/artifacts/h2-output-cloud-geometry-class-ordered-subset-256-20260525.json
!workspaces/black-box/artifacts/h2-output-cloud-geometry-class-ordered-subset-256-label-shuffle-20260525.json
!workspaces/black-box/artifacts/beans-lora-member-denoising-loss-scout-20260513.json
!workspaces/black-box/artifacts/clid-image-identity-boundary-20260511.json
!workspaces/black-box/artifacts/midfreq-same-noise-residual-cache-audit-20260512.json
Expand Down
2 changes: 1 addition & 1 deletion AGENTS.md
Original file line number Diff line number Diff line change
Expand Up @@ -28,7 +28,7 @@ Do not start from memory or old chat context. Re-anchor on repository files.

## Current Operating State

- Active work: `2026-05-25 H2 output-cloud geometry cache review is the latest metric verdict. It is a strong Research-side candidate on the existing H2 response cache (seed 176 logistic AUC = 0.961529, TPR@1%FPR = 0.333984, TPR@0.1%FPR = 0.117188; seed 177 AUC = 0.961048; label-shuffle AUC = 0.507595), but it is not admitted because the source cache used class-ordered sample offsets and needs a reseeded or interleaved order-control cache before any promotion. Do not create Platform/Runtime schema, bundle export, UI type, runner, KDE/shadow-density/repeat-count sweeps, or same-cache feature sweeps from this result. active_gpu_question = none; next_gpu_candidate = none; CPU sidecar = none selected after H2 output-cloud cache review. Feature-packet consumer lane remains deferred. LeakyCLIP remains CLIP / multimodal privacy watch-plus. ReDiffuse DDPM/STL-10 remains closed by default after the weak bounded scout (AUC = 0.4996337890625) and weak SimA-style score-norm scorer (AUC = 0.5052947998046875).`
- Active work: `2026-05-25 H2 output-cloud geometry is the latest metric verdict. It is a strong Research-side candidate on the existing H2 response cache (seed 176 logistic AUC = 0.961529, TPR@1%FPR = 0.333984, TPR@0.1%FPR = 0.117188; seed 177 AUC = 0.961048; label-shuffle AUC = 0.507595). The bounded 256/256 shared-position order-control scout preserved the signal (AUC = 0.967819, TPR@1%FPR = 0.410156, TPR@0.1%FPR = 0.132812; label-shuffle AUC = 0.464066), so class-ordered seed offset is not a sufficient explanation. It is still not admitted because this remains Research-side H2 response-cache geometry, not a second public asset or Platform/Runtime contract. Do not create Platform/Runtime schema, bundle export, UI type, runner, KDE/shadow-density/repeat-count sweeps, same-cache feature sweeps, or a full 512/512 rerun just to complete a table. active_gpu_question = none; next_gpu_candidate = none; CPU sidecar = none selected after H2 output-cloud order-control scout. Feature-packet consumer lane remains deferred. LeakyCLIP remains CLIP / multimodal privacy watch-plus. ReDiffuse DDPM/STL-10 remains closed by default after the weak bounded scout (AUC = 0.4996337890625) and weak SimA-style score-norm scorer (AUC = 0.5052947998046875).`
- Next GPU candidate: none selected
- Long-horizon control: follow `ROADMAP.md` section
`Long-Horizon Research Task Board(2026-05-13 起)` before reopening any
Expand Down
29 changes: 18 additions & 11 deletions ROADMAP.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,11 +4,12 @@

## 2026-05-25 H2 output-cloud geometry 候选信号

最新决策:H2 response-strength 的既有 `512 / 512` response cache 暴露出一个强的
output-output geometry 候选信号,但在 order-control 通过前不晋升、不释放产品消费、
不扩展同 cache 特征工程。该复查只读取现有
`workspaces/black-box/runs/h2-response-strength-512-20260501-r1/response-cache.npz`,
没有生成新响应、没有下载资产、没有运行 GPU。
最新决策:H2 response-strength 的 output-cloud geometry 是 Research-side 强候选,
并且已通过一个有界 `256 / 256` shared-position order-control scout;但它仍不晋升、
不释放产品消费、不扩展同 cache 特征工程,也不默认补跑完整 `512 / 512`
shared-position。第一轮复查读取既有
`workspaces/black-box/runs/h2-response-strength-512-20260501-r1/response-cache.npz`;
控制轮生成了本地 `256 / 256` shared-position cache,没有下载资产。

该 scorer 刻意排除 seed-to-output distance,只使用同 timestep repeat 间 RMSE、
不同 timestep centroid RMSE 和 response-cloud Gram/PCA 特征。主结果为
Expand All @@ -19,13 +20,19 @@ seed `177` 稳定性仍为 `AUC = 0.961048`,`TPR@1%FPR = 0.353516`,
`TPR@0.1%FPR = 0.130859`;label-shuffle sanity 回到随机级
`AUC = 0.507595`。

关键 caveat:源 cache 生成时 member 侧 `sample_offset = 0`,nonmember 侧
`sample_offset = len(member_indices)`,output-output geometry 对采样种子和响应云形态敏感,
所以当前强信号可能混入 class-ordered sampling effect。该结果只能作为
Research-side 强候选;下一次重新评估只能是一个有界 reseeded / interleaved
order-control response-cache scout。当前 slots 仍为:
已完成的 order-control:`--seed-offset-policy shared-position` 让 member / nonmember
使用相同 per-position seed offset。该 `256 / 256` 控制 cache 上 output-cloud
logistic 仍为 `AUC = 0.967819`,`ASR = 0.923828`,
`TPR@1%FPR = 0.410156`,`TPR@0.1%FPR = 0.132812`;label-shuffle
回到随机级 `AUC = 0.464066`。同尺寸旧 class-ordered subset 为
`AUC = 0.967438`,`TPR@1%FPR = 0.179688`,
`TPR@0.1%FPR = 0.105469`。因此 class-ordered seed offset 不再是该强信号的充分解释。

该结果只能作为 Research-side 强候选;下一步不是同 cache sweep,也不是为了补表格跑
完整 `512 / 512` shared-position。重新打开只应基于正式机制晋升、第二公开资产或独立消费合约。
当前 slots 仍为:
`active_gpu_question = none`,`next_gpu_candidate = none`,
`CPU sidecar = none selected after H2 output-cloud cache review`。
`CPU sidecar = none selected after H2 output-cloud order-control scout`。
See
[docs/evidence/h2-output-cloud-geometry-20260525.md](docs/evidence/h2-output-cloud-geometry-20260525.md)。

Expand Down
100 changes: 77 additions & 23 deletions docs/evidence/h2-output-cloud-geometry-20260525.md
Original file line number Diff line number Diff line change
@@ -1,16 +1,17 @@
# H2 Output-Cloud Geometry Cache Review

> Date: 2026-05-25
> Status: candidate complementary signal / CPU-only cache review / order-control required before promotion / no GPU release / no admitted row
> Status: candidate complementary signal / order-control scout passed / no admitted row / no 512/512 rerun selected

## Question

在已有 H2 response-strength cache 上,输出之间的几何结构是否携带不同于
seed-to-output distance 的 membership 信号?

本轮只复用现有
第一轮只复用现有
`workspaces/black-box/runs/h2-response-strength-512-20260501-r1/response-cache.npz`。
没有生成新响应、没有下载资产、没有运行 GPU,也没有扩展同一路线的 KDE、shadow
随后只释放一个有界 `256 / 256` shared-position order-control scout,用来回答
class-ordered seed-offset caveat。没有下载资产,也没有扩展同一路线的 KDE、shadow
density、repeat-count 或特征 sweep。

## Contract
Expand Down Expand Up @@ -87,46 +88,99 @@ Label-shuffle sanity:

这说明 scorer/evaluation 管线没有明显的标签直通泄漏。

## Critical Caveat
## Shared-Position Order-Control Scout

该结果仍然不能晋升。源 cache 的响应生成存在 class-ordered seed offset:
`scripts/run_h2_response_strength_validation.py` member 侧使用
`sample_offset = 0`,nonmember 侧使用 `sample_offset = len(member_indices)`。
Output-output geometry 对采样种子和响应云形态敏感,因此当前强信号可能混入
源 `512 / 512` cache 的响应生成存在 class-ordered seed offset:
`scripts/run_h2_response_strength_validation.py` 的历史默认行为是 member
`sample_offset = 0`,nonmember `sample_offset = len(member_indices)`。
Output-output geometry 对采样种子和响应云形态敏感,因此必须检查强信号是否只是
class-ordered sampling effect。

这不是要继续在同一个 cache 上补表格;它只定义一个非常窄的下一步:
如果需要推进,最多释放一个有界 order-control / reseeded / interleaved
response-cache scout,用来判断该强信号是否跨 class-order 控制保留。

当前允许的最小脚本改动仅限于生成这个控制 cache:
本轮只加入一个窄的 seed policy 控制:
`scripts/run_h2_response_strength_validation.py --seed-offset-policy shared-position`。
该模式会让 member / nonmember 使用相同 per-position seed offset,并在
`summary.json` 中标记 `order_control_scout = true`。它只用于重新评估
class-ordered sampling effect,不代表 admission,也不得直接生成 Platform /
Runtime row。
该模式让 member / nonmember 使用相同 per-position seed offset,并在
`summary.json` 中标记 `order_control_scout = true`。运行边界为
`256 / 256`,timesteps `40 / 80 / 120 / 160`,repeats `2`,seed `176`,
holdout repeats `7`,bootstrap iters `100`。GPU scout 用时 `208.866516s`。

Runner summary 的 H2 distance scorer 在 shared-position 下仍为正但尾部弱:

| Metric | Raw H2 logistic | Lowpass H2 logistic |
| --- | ---: | ---: |
| AUC | `0.906967` | `0.898102` |
| ASR | `0.837891` | `0.828125` |
| TPR@1%FPR | `0.058594` | `0.066406` |
| TPR@0.1%FPR | `0.003906` | `0.003906` |

Output-cloud geometry review on the same shared-position cache:
`workspaces/black-box/artifacts/h2-output-cloud-geometry-shared-position-256-20260525.json`

| Metric | Shared-position `256 / 256` |
| --- | ---: |
| AUC | `0.967819` |
| ASR | `0.923828` |
| TPR@1%FPR | `0.410156` |
| TPR@0.1%FPR | `0.132812` |

Label-shuffle sanity for the shared-position cache:
`workspaces/black-box/artifacts/h2-output-cloud-geometry-shared-position-256-label-shuffle-20260525.json`

| Metric | Shared-position label shuffle |
| --- | ---: |
| AUC | `0.464066` |
| ASR | `0.505859` |
| TPR@1%FPR | `0.003906` |
| TPR@0.1%FPR | `0.0` |

Same-size historical class-ordered subset from the old cache:
`workspaces/black-box/artifacts/h2-output-cloud-geometry-class-ordered-subset-256-20260525.json`

| Metric | Class-ordered subset `256 / 256` |
| --- | ---: |
| AUC | `0.967438` |
| ASR | `0.916016` |
| TPR@1%FPR | `0.179688` |
| TPR@0.1%FPR | `0.105469` |

Class-ordered subset label shuffle:
`workspaces/black-box/artifacts/h2-output-cloud-geometry-class-ordered-subset-256-label-shuffle-20260525.json`

| Metric | Class-ordered subset label shuffle |
| --- | ---: |
| AUC | `0.427902` |
| ASR | `0.5` |
| TPR@1%FPR | `0.0` |
| TPR@0.1%FPR | `0.0` |

Interpretation: shared-position order-control did not collapse the output-cloud
geometry signal, and its label-shuffle check returns random-level. This removes
the previous class-ordered seed-offset caveat as a sufficient explanation for
the signal. The result still does not imply product admission: it is one
controlled scout on H2 DDPM/CIFAR10 response-cache geometry, not a second
public asset or Platform/Runtime contract.

## Decision

`candidate complementary signal / order-control required / no admitted row`。
`candidate complementary signal / order-control scout passed / no admitted row`。

保留为 Research-side 强候选,因为它满足三个有价值条件:

- 它是不同 observable:output-output cloud geometry,而不是 seed-to-output distance。
- 它在同一 H2 cache 上明显强于 raw/lowpass H2 logistic。
- 它通过了 seed-177 稳定性和 label-shuffle sanity。
- 它在 `256 / 256` shared-position order-control scout 中没有因 seed-offset 控制而坍塌。

但当前不做以下事情:

- 不升级到 Platform/Runtime admitted bundle。
- 不新增产品 schema、Runtime runner、UI 类型或 bundle row。
- 不在同一 cache 上展开 KDE、shadow density、repeat-count、特征族或融合 sweep。
- 不释放 GPU 或大下载。
- 不释放完整 `512 / 512` shared-position GPU rerun 或大下载;当前 `256 / 256`
order-control 已经回答了会改变路线的 caveat。

下一次重新评估只允许基于一个 order-control cache 的结果。如果 reseeded /
interleaved cache 仍保持强 AUC 和严格尾部恢复,再讨论是否进入更正式的 H2
output-cloud 机制线;如果不保持,该候选直接关闭为 class-ordered response-cache
artifact。
下一次重新评估不应是同 cache feature sweep 或为了表格好看的 `512 / 512` 补跑。
只有在需要正式晋升机制线、发现第二公开资产、或要建立独立消费合约时,才重新定义
更高成本的验证任务。

## Platform and Runtime Impact

Expand Down
1 change: 1 addition & 0 deletions docs/evidence/reproduction-status.md
Original file line number Diff line number Diff line change
Expand Up @@ -32,6 +32,7 @@ Smoke tests and dry runs are engineering validation, not benchmark claims.
| Track | Status | Notes |
| --- | --- | --- |
| Black-box `recon` | `evidence-ready` | Strongest black-box method and admitted non-CLiD product row. Public data limits strict paper-aligned claims. The bounded public-100 step30 rerun plus unified artifact summary yields the promoted coherent packet: `AUC = 0.837`, `ASR = 0.74`, `TPR@1%FPR = 0.22`, `TPR@0.1%FPR = 0.11`. See [non-clid-black-box-reselection.md](non-clid-black-box-reselection.md), [recon-product-validation-contract.md](recon-product-validation-contract.md), [recon-product-validation-result.md](recon-product-validation-result.md), and [../product-bridge/recon-product-validation-handoff.md](../product-bridge/recon-product-validation-handoff.md). |
| Black-box `H2 output-cloud geometry` | `hold-candidate` | Strong Research-side output-output geometry signal on H2 response caches, but not an admitted Platform/Runtime row. Existing `512 / 512` cache review gives `AUC = 0.961529`, `TPR@1%FPR = 0.333984`, `TPR@0.1%FPR = 0.117188`; seed `177` is stable and label shuffle is random-level. The `256 / 256` shared-position order-control scout preserves the signal (`AUC = 0.967819`, `TPR@1%FPR = 0.410156`, `TPR@0.1%FPR = 0.132812`) with random-level label shuffle (`AUC = 0.464066`), so class-ordered seed offset is not a sufficient explanation. Do not promote, add schema/runner/UI/bundle rows, run same-cache feature sweeps, or schedule a full `512 / 512` rerun by default. See [h2-output-cloud-geometry-20260525.md](h2-output-cloud-geometry-20260525.md). |
| Black-box `CLiD` | `hold-candidate` | Selected as a bounded black-box lane after H2 SD/CelebA text-to-image transfer was protocol-blocked. The official CPU `inter_output/*` replay is strong (`AUC = 0.961277`, `TPR@1%FPR = 0.675470`, `ASR = 0.891957`) and now has a machine-readable candidate-only card, but row identity remains blocked because the public score rows are numeric-only and the 2026-05-15 authenticated HF `mia_COCO.zip` `HEAD`/`Range` recheck still returned `403`. Earlier local prompt-conditioned packets were strong and repeat-stable, but prompt-neutral perturbation collapses the signal, swapped-prompt control is degraded, within-split prompt shuffle is weak and seed-sensitive, prompt-text-only review is moderate AUC but weak strict-tail, and control attribution shows auxiliary-feature instability under prompt controls. Current evidence supports a prompt-conditioned diagnostic claim only, not admitted general black-box evidence. No next CLiD GPU task is selected. See [../product-bridge/clid-candidate-evidence-card.md](../product-bridge/clid-candidate-evidence-card.md), [clid-official-inter-output-replay-20260515.md](clid-official-inter-output-replay-20260515.md), [clid-identity-manifest-gate-20260515.md](clid-identity-manifest-gate-20260515.md), [black-box-next-lane-selection.md](black-box-next-lane-selection.md), [clid-bridge-contract.md](clid-bridge-contract.md), [clid-score-schema-gate.md](clid-score-schema-gate.md), [clid-tiny-score-bridge.md](clid-tiny-score-bridge.md), [clid-100-score-packet.md](clid-100-score-packet.md), [clid-candidate-integrity-review.md](clid-candidate-integrity-review.md), [clid-repeat-stability.md](clid-repeat-stability.md), [clid-prompt-perturbation.md](clid-prompt-perturbation.md), [clid-prompt-conditioning-boundary.md](clid-prompt-conditioning-boundary.md), [clid-swapped-prompt-control.md](clid-swapped-prompt-control.md), [clid-within-split-shuffle-control.md](clid-within-split-shuffle-control.md), [clid-prompt-text-only-review.md](clid-prompt-text-only-review.md), and [clid-control-attribution.md](clid-control-attribution.md). |
| Black-box `variation` | `code-ready` | API-only support method; needs real query data for stronger claims. |
| Feature-packet consumer lane | `deferred-candidate` | 2026-05-25 consumer verdict keeps the gray-box feature-packet lane out of Platform/Runtime. Tracing the Roots remains positive Research evidence (`AUC = 0.815826`, `TPR@1%FPR = 0.134000`), but live narrow public-surface recheck found no second non-source-equivalent public feature-packet and no raw checkpoint/sample/regeneration assets. Do not add feature-packet schema, bundle export, validators, tests, Platform UI type, Runtime runner, GPU task, or download from this singleton. See [feature-packet-channel-consumer-verdict-20260525.md](feature-packet-channel-consumer-verdict-20260525.md) and [../product-bridge/feature-packet-lane.md](../product-bridge/feature-packet-lane.md). |
Expand Down
14 changes: 9 additions & 5 deletions docs/evidence/workspace-evidence-index.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,15 +6,19 @@ This index separates current track state from archived research history.

Latest Research update:
[h2-output-cloud-geometry-20260525.md](h2-output-cloud-geometry-20260525.md)
records a CPU-only metric verdict on the existing H2 response-strength cache.
records a metric verdict on the H2 response-strength cache plus a bounded
`256 / 256` shared-position order-control scout.
The output-output geometry scorer is a strong Research-side candidate
(`AUC = 0.961529`, `TPR@1%FPR = 0.333984`,
`TPR@0.1%FPR = 0.117188`) and is stable under seed `177`
(`AUC = 0.961048`), while label-shuffle sanity returns random-level
(`AUC = 0.507595`). It is not admitted because the source cache used
class-ordered sample offsets and needs a reseeded or interleaved order-control
cache before promotion. Decision: `candidate complementary signal /
order-control required / no admitted row / no download / no GPU release`.
(`AUC = 0.507595`). The shared-position order-control scout also stays strong
(`AUC = 0.967819`, `TPR@1%FPR = 0.410156`,
`TPR@0.1%FPR = 0.132812`) with random-level label shuffle (`AUC = 0.464066`).
It is not admitted because this remains a Research-side H2 response-cache
geometry candidate, not a second public asset or Platform/Runtime contract.
Decision: `candidate complementary signal / order-control scout passed /
no admitted row / no download / no 512/512 rerun selected`.

Previous Research update:
[feature-packet-channel-consumer-verdict-20260525.md](feature-packet-channel-consumer-verdict-20260525.md)
Expand Down
Loading
Loading