From 7886254380bd340a2d3dc78021756621feb5dc24 Mon Sep 17 00:00:00 2001 From: Delicious233 Date: Mon, 25 May 2026 06:22:25 +0800 Subject: [PATCH] docs: record h2 order-control seed stability --- .gitignore | 2 + AGENTS.md | 2 +- ROADMAP.md | 11 +- .../h2-output-cloud-geometry-20260525.md | 52 +++++- docs/evidence/reproduction-status.md | 2 +- docs/evidence/workspace-evidence-index.md | 9 +- workspaces/black-box/README.md | 7 +- ...-shared-position-seed177-256-20260525.json | 166 ++++++++++++++++++ ...on-seed177-256-label-shuffle-20260525.json | 142 +++++++++++++++ workspaces/black-box/plan.md | 9 +- 10 files changed, 385 insertions(+), 17 deletions(-) create mode 100644 workspaces/black-box/artifacts/h2-output-cloud-geometry-shared-position-seed177-256-20260525.json create mode 100644 workspaces/black-box/artifacts/h2-output-cloud-geometry-shared-position-seed177-256-label-shuffle-20260525.json diff --git a/.gitignore b/.gitignore index 63fd119d..f0ad526a 100644 --- a/.gitignore +++ b/.gitignore @@ -147,6 +147,8 @@ workspaces/**/artifacts/** !workspaces/black-box/artifacts/h2-output-cloud-geometry-label-shuffle-20260525.json !workspaces/black-box/artifacts/h2-output-cloud-geometry-shared-position-256-20260525.json !workspaces/black-box/artifacts/h2-output-cloud-geometry-shared-position-256-label-shuffle-20260525.json +!workspaces/black-box/artifacts/h2-output-cloud-geometry-shared-position-seed177-256-20260525.json +!workspaces/black-box/artifacts/h2-output-cloud-geometry-shared-position-seed177-256-label-shuffle-20260525.json !workspaces/black-box/artifacts/h2-output-cloud-geometry-class-ordered-subset-256-20260525.json !workspaces/black-box/artifacts/h2-output-cloud-geometry-class-ordered-subset-256-label-shuffle-20260525.json !workspaces/black-box/artifacts/beans-lora-member-denoising-loss-scout-20260513.json diff --git a/AGENTS.md b/AGENTS.md index 1d833ab9..e25fa8f1 100644 --- a/AGENTS.md +++ b/AGENTS.md @@ -28,7 +28,7 @@ Do not start from memory or old chat context. Re-anchor on repository files. ## Current Operating State -- Active work: `2026-05-25 H2 output-cloud geometry is the latest metric verdict. It is a strong Research-side candidate on the existing H2 response cache (seed 176 logistic AUC = 0.961529, TPR@1%FPR = 0.333984, TPR@0.1%FPR = 0.117188; seed 177 AUC = 0.961048; label-shuffle AUC = 0.507595). The bounded 256/256 shared-position order-control scout preserved the signal (AUC = 0.967819, TPR@1%FPR = 0.410156, TPR@0.1%FPR = 0.132812; label-shuffle AUC = 0.464066), so class-ordered seed offset is not a sufficient explanation. It is still not admitted because this remains Research-side H2 response-cache geometry, not a second public asset or Platform/Runtime contract. Do not create Platform/Runtime schema, bundle export, UI type, runner, KDE/shadow-density/repeat-count sweeps, same-cache feature sweeps, or a full 512/512 rerun just to complete a table. active_gpu_question = none; next_gpu_candidate = none; CPU sidecar = none selected after H2 output-cloud order-control scout. Feature-packet consumer lane remains deferred. LeakyCLIP remains CLIP / multimodal privacy watch-plus. ReDiffuse DDPM/STL-10 remains closed by default after the weak bounded scout (AUC = 0.4996337890625) and weak SimA-style score-norm scorer (AUC = 0.5052947998046875).` +- Active work: `2026-05-25 H2 output-cloud geometry is the latest metric verdict. It is a strong Research-side candidate on the existing H2 response cache (seed 176 logistic AUC = 0.961529, TPR@1%FPR = 0.333984, TPR@0.1%FPR = 0.117188; seed 177 AUC = 0.961048; label-shuffle AUC = 0.507595). The bounded 256/256 shared-position order-control scout preserved the signal (AUC = 0.967819, TPR@1%FPR = 0.410156, TPR@0.1%FPR = 0.132812; label-shuffle AUC = 0.464066), so class-ordered seed offset is not a sufficient explanation. The same controlled boundary at seed 177 remains strong (AUC = 0.956192, TPR@1%FPR = 0.285156, TPR@0.1%FPR = 0.109375; label-shuffle AUC = 0.484070), so the controlled signal is not single-seed. It is still not admitted because this remains Research-side H2 response-cache geometry, not a second public asset or Platform/Runtime contract. Do not create Platform/Runtime schema, bundle export, UI type, runner, KDE/shadow-density/repeat-count sweeps, same-cache feature sweeps, or a full 512/512 rerun just to complete a table. active_gpu_question = none; next_gpu_candidate = none; CPU sidecar = none selected after H2 output-cloud order-control seed-stability scout. Feature-packet consumer lane remains deferred. LeakyCLIP remains CLIP / multimodal privacy watch-plus. ReDiffuse DDPM/STL-10 remains closed by default after the weak bounded scout (AUC = 0.4996337890625) and weak SimA-style score-norm scorer (AUC = 0.5052947998046875).` - Next GPU candidate: none selected - Long-horizon control: follow `ROADMAP.md` section `Long-Horizon Research Task Board(2026-05-13 起)` before reopening any diff --git a/ROADMAP.md b/ROADMAP.md index 45421444..22359198 100644 --- a/ROADMAP.md +++ b/ROADMAP.md @@ -5,11 +5,13 @@ ## 2026-05-25 H2 output-cloud geometry 候选信号 最新决策:H2 response-strength 的 output-cloud geometry 是 Research-side 强候选, -并且已通过一个有界 `256 / 256` shared-position order-control scout;但它仍不晋升、 +并且已通过有界 `256 / 256` shared-position order-control 和 seed-stability scout; +但它仍不晋升、 不释放产品消费、不扩展同 cache 特征工程,也不默认补跑完整 `512 / 512` shared-position。第一轮复查读取既有 `workspaces/black-box/runs/h2-response-strength-512-20260501-r1/response-cache.npz`; -控制轮生成了本地 `256 / 256` shared-position cache,没有下载资产。 +控制轮生成了本地 `256 / 256` shared-position cache,稳定性轮只把 seed 从 `176` +改成 `177`,没有下载资产。 该 scorer 刻意排除 seed-to-output distance,只使用同 timestep repeat 间 RMSE、 不同 timestep centroid RMSE 和 response-cloud Gram/PCA 特征。主结果为 @@ -27,12 +29,15 @@ logistic 仍为 `AUC = 0.967819`,`ASR = 0.923828`, 回到随机级 `AUC = 0.464066`。同尺寸旧 class-ordered subset 为 `AUC = 0.967438`,`TPR@1%FPR = 0.179688`, `TPR@0.1%FPR = 0.105469`。因此 class-ordered seed offset 不再是该强信号的充分解释。 +同边界 seed `177` shared-position scout 继续保持强信号:output-cloud logistic +`AUC = 0.956192`,`ASR = 0.896484`,`TPR@1%FPR = 0.285156`, +`TPR@0.1%FPR = 0.109375`;label-shuffle 回到随机级 `AUC = 0.484070`。 该结果只能作为 Research-side 强候选;下一步不是同 cache sweep,也不是为了补表格跑 完整 `512 / 512` shared-position。重新打开只应基于正式机制晋升、第二公开资产或独立消费合约。 当前 slots 仍为: `active_gpu_question = none`,`next_gpu_candidate = none`, -`CPU sidecar = none selected after H2 output-cloud order-control scout`。 +`CPU sidecar = none selected after H2 output-cloud order-control seed-stability scout`。 See [docs/evidence/h2-output-cloud-geometry-20260525.md](docs/evidence/h2-output-cloud-geometry-20260525.md)。 diff --git a/docs/evidence/h2-output-cloud-geometry-20260525.md b/docs/evidence/h2-output-cloud-geometry-20260525.md index 9487f8a6..2c8e20c0 100644 --- a/docs/evidence/h2-output-cloud-geometry-20260525.md +++ b/docs/evidence/h2-output-cloud-geometry-20260525.md @@ -1,7 +1,7 @@ # H2 Output-Cloud Geometry Cache Review > Date: 2026-05-25 -> Status: candidate complementary signal / order-control scout passed / no admitted row / no 512/512 rerun selected +> Status: candidate complementary signal / order-control scout passed / shared-position seed-stable / no admitted row / no 512/512 rerun selected ## Question @@ -11,8 +11,9 @@ seed-to-output distance 的 membership 信号? 第一轮只复用现有 `workspaces/black-box/runs/h2-response-strength-512-20260501-r1/response-cache.npz`。 随后只释放一个有界 `256 / 256` shared-position order-control scout,用来回答 -class-ordered seed-offset caveat。没有下载资产,也没有扩展同一路线的 KDE、shadow -density、repeat-count 或特征 sweep。 +class-ordered seed-offset caveat;再释放一个同边界的 seed `177` 稳定性 scout, +用来判断 order-control 后的强信号是否只是单 seed 现象。没有下载资产,也没有扩展 +同一路线的 KDE、shadow density、repeat-count 或特征 sweep。 ## Contract @@ -159,9 +160,51 @@ the signal. The result still does not imply product admission: it is one controlled scout on H2 DDPM/CIFAR10 response-cache geometry, not a second public asset or Platform/Runtime contract. +## Shared-Position Seed-177 Stability Scout + +为避免把 order-control 结论建立在单个 random seed 上,下一步只跑同边界 +`256 / 256` shared-position seed `177`。运行边界不扩大:timesteps +`40 / 80 / 120 / 160`,repeats `2`,holdout repeats `7`,bootstrap iters +`100`。GPU scout 用时 `185.470864s`。 + +Runner summary 的 H2 distance scorer: + +| Metric | Raw H2 logistic | Lowpass H2 logistic | +| --- | ---: | ---: | +| AUC | `0.911255` | `0.896698` | +| ASR | `0.851562` | `0.828125` | +| TPR@1%FPR | `0.113281` | `0.093750` | +| TPR@0.1%FPR | `0.0` | `0.062500` | + +Output-cloud geometry review: +`workspaces/black-box/artifacts/h2-output-cloud-geometry-shared-position-seed177-256-20260525.json` + +| Metric | Shared-position seed `177` | +| --- | ---: | +| AUC | `0.956192` | +| ASR | `0.896484` | +| TPR@1%FPR | `0.285156` | +| TPR@0.1%FPR | `0.109375` | + +Label-shuffle sanity: +`workspaces/black-box/artifacts/h2-output-cloud-geometry-shared-position-seed177-256-label-shuffle-20260525.json` + +| Metric | Seed `177` label shuffle | +| --- | ---: | +| AUC | `0.484070` | +| ASR | `0.513672` | +| TPR@1%FPR | `0.023438` | +| TPR@0.1%FPR | `0.011719` | + +Interpretation: the shared-position output-cloud signal remains strong under +seed `177`, and label shuffle stays random-level. Together with seed `176`, this +supports the narrower conclusion that output-cloud geometry is a stable H2 +mechanism candidate after seed-offset control. It still does not create a +second public asset, a product contract, or an admitted row. + ## Decision -`candidate complementary signal / order-control scout passed / no admitted row`。 +`candidate complementary signal / order-control scout passed / seed-stable / no admitted row`。 保留为 Research-side 强候选,因为它满足三个有价值条件: @@ -169,6 +212,7 @@ public asset or Platform/Runtime contract. - 它在同一 H2 cache 上明显强于 raw/lowpass H2 logistic。 - 它通过了 seed-177 稳定性和 label-shuffle sanity。 - 它在 `256 / 256` shared-position order-control scout 中没有因 seed-offset 控制而坍塌。 +- 它在 shared-position seed `177` scout 中仍保持强 AUC 和非零严格尾部恢复。 但当前不做以下事情: diff --git a/docs/evidence/reproduction-status.md b/docs/evidence/reproduction-status.md index cb9e5edb..8773324a 100644 --- a/docs/evidence/reproduction-status.md +++ b/docs/evidence/reproduction-status.md @@ -32,7 +32,7 @@ Smoke tests and dry runs are engineering validation, not benchmark claims. | Track | Status | Notes | | --- | --- | --- | | Black-box `recon` | `evidence-ready` | Strongest black-box method and admitted non-CLiD product row. Public data limits strict paper-aligned claims. The bounded public-100 step30 rerun plus unified artifact summary yields the promoted coherent packet: `AUC = 0.837`, `ASR = 0.74`, `TPR@1%FPR = 0.22`, `TPR@0.1%FPR = 0.11`. See [non-clid-black-box-reselection.md](non-clid-black-box-reselection.md), [recon-product-validation-contract.md](recon-product-validation-contract.md), [recon-product-validation-result.md](recon-product-validation-result.md), and [../product-bridge/recon-product-validation-handoff.md](../product-bridge/recon-product-validation-handoff.md). | -| Black-box `H2 output-cloud geometry` | `hold-candidate` | Strong Research-side output-output geometry signal on H2 response caches, but not an admitted Platform/Runtime row. Existing `512 / 512` cache review gives `AUC = 0.961529`, `TPR@1%FPR = 0.333984`, `TPR@0.1%FPR = 0.117188`; seed `177` is stable and label shuffle is random-level. The `256 / 256` shared-position order-control scout preserves the signal (`AUC = 0.967819`, `TPR@1%FPR = 0.410156`, `TPR@0.1%FPR = 0.132812`) with random-level label shuffle (`AUC = 0.464066`), so class-ordered seed offset is not a sufficient explanation. Do not promote, add schema/runner/UI/bundle rows, run same-cache feature sweeps, or schedule a full `512 / 512` rerun by default. See [h2-output-cloud-geometry-20260525.md](h2-output-cloud-geometry-20260525.md). | +| Black-box `H2 output-cloud geometry` | `hold-candidate` | Strong Research-side output-output geometry signal on H2 response caches, but not an admitted Platform/Runtime row. Existing `512 / 512` cache review gives `AUC = 0.961529`, `TPR@1%FPR = 0.333984`, `TPR@0.1%FPR = 0.117188`; seed `177` is stable and label shuffle is random-level. The `256 / 256` shared-position order-control scout preserves the signal (`AUC = 0.967819`, `TPR@1%FPR = 0.410156`, `TPR@0.1%FPR = 0.132812`) with random-level label shuffle (`AUC = 0.464066`), so class-ordered seed offset is not a sufficient explanation. The same controlled boundary at seed `177` remains strong (`AUC = 0.956192`, `TPR@1%FPR = 0.285156`, `TPR@0.1%FPR = 0.109375`) with random-level label shuffle (`AUC = 0.484070`), so the controlled signal is not single-seed. Do not promote, add schema/runner/UI/bundle rows, run same-cache feature sweeps, or schedule a full `512 / 512` rerun by default. See [h2-output-cloud-geometry-20260525.md](h2-output-cloud-geometry-20260525.md). | | Black-box `CLiD` | `hold-candidate` | Selected as a bounded black-box lane after H2 SD/CelebA text-to-image transfer was protocol-blocked. The official CPU `inter_output/*` replay is strong (`AUC = 0.961277`, `TPR@1%FPR = 0.675470`, `ASR = 0.891957`) and now has a machine-readable candidate-only card, but row identity remains blocked because the public score rows are numeric-only and the 2026-05-15 authenticated HF `mia_COCO.zip` `HEAD`/`Range` recheck still returned `403`. Earlier local prompt-conditioned packets were strong and repeat-stable, but prompt-neutral perturbation collapses the signal, swapped-prompt control is degraded, within-split prompt shuffle is weak and seed-sensitive, prompt-text-only review is moderate AUC but weak strict-tail, and control attribution shows auxiliary-feature instability under prompt controls. Current evidence supports a prompt-conditioned diagnostic claim only, not admitted general black-box evidence. No next CLiD GPU task is selected. See [../product-bridge/clid-candidate-evidence-card.md](../product-bridge/clid-candidate-evidence-card.md), [clid-official-inter-output-replay-20260515.md](clid-official-inter-output-replay-20260515.md), [clid-identity-manifest-gate-20260515.md](clid-identity-manifest-gate-20260515.md), [black-box-next-lane-selection.md](black-box-next-lane-selection.md), [clid-bridge-contract.md](clid-bridge-contract.md), [clid-score-schema-gate.md](clid-score-schema-gate.md), [clid-tiny-score-bridge.md](clid-tiny-score-bridge.md), [clid-100-score-packet.md](clid-100-score-packet.md), [clid-candidate-integrity-review.md](clid-candidate-integrity-review.md), [clid-repeat-stability.md](clid-repeat-stability.md), [clid-prompt-perturbation.md](clid-prompt-perturbation.md), [clid-prompt-conditioning-boundary.md](clid-prompt-conditioning-boundary.md), [clid-swapped-prompt-control.md](clid-swapped-prompt-control.md), [clid-within-split-shuffle-control.md](clid-within-split-shuffle-control.md), [clid-prompt-text-only-review.md](clid-prompt-text-only-review.md), and [clid-control-attribution.md](clid-control-attribution.md). | | Black-box `variation` | `code-ready` | API-only support method; needs real query data for stronger claims. | | Feature-packet consumer lane | `deferred-candidate` | 2026-05-25 consumer verdict keeps the gray-box feature-packet lane out of Platform/Runtime. Tracing the Roots remains positive Research evidence (`AUC = 0.815826`, `TPR@1%FPR = 0.134000`), but live narrow public-surface recheck found no second non-source-equivalent public feature-packet and no raw checkpoint/sample/regeneration assets. Do not add feature-packet schema, bundle export, validators, tests, Platform UI type, Runtime runner, GPU task, or download from this singleton. See [feature-packet-channel-consumer-verdict-20260525.md](feature-packet-channel-consumer-verdict-20260525.md) and [../product-bridge/feature-packet-lane.md](../product-bridge/feature-packet-lane.md). | diff --git a/docs/evidence/workspace-evidence-index.md b/docs/evidence/workspace-evidence-index.md index 685fcb71..43d7d160 100644 --- a/docs/evidence/workspace-evidence-index.md +++ b/docs/evidence/workspace-evidence-index.md @@ -6,8 +6,8 @@ This index separates current track state from archived research history. Latest Research update: [h2-output-cloud-geometry-20260525.md](h2-output-cloud-geometry-20260525.md) -records a metric verdict on the H2 response-strength cache plus a bounded -`256 / 256` shared-position order-control scout. +records a metric verdict on the H2 response-strength cache plus bounded +`256 / 256` shared-position order-control and seed-stability scouts. The output-output geometry scorer is a strong Research-side candidate (`AUC = 0.961529`, `TPR@1%FPR = 0.333984`, `TPR@0.1%FPR = 0.117188`) and is stable under seed `177` @@ -15,9 +15,12 @@ The output-output geometry scorer is a strong Research-side candidate (`AUC = 0.507595`). The shared-position order-control scout also stays strong (`AUC = 0.967819`, `TPR@1%FPR = 0.410156`, `TPR@0.1%FPR = 0.132812`) with random-level label shuffle (`AUC = 0.464066`). +The same controlled boundary at seed `177` remains strong (`AUC = 0.956192`, +`TPR@1%FPR = 0.285156`, `TPR@0.1%FPR = 0.109375`) with random-level label +shuffle (`AUC = 0.484070`). It is not admitted because this remains a Research-side H2 response-cache geometry candidate, not a second public asset or Platform/Runtime contract. -Decision: `candidate complementary signal / order-control scout passed / +Decision: `candidate complementary signal / order-control scout passed / seed-stable / no admitted row / no download / no 512/512 rerun selected`. Previous Research update: diff --git a/workspaces/black-box/README.md b/workspaces/black-box/README.md index 6f409072..686c50d7 100644 --- a/workspaces/black-box/README.md +++ b/workspaces/black-box/README.md @@ -11,8 +11,11 @@ label-shuffle sanity 回到随机级。后续 `256 / 256` shared-position order-control scout 仍为 `AUC = 0.967819`、`TPR@1%FPR = 0.410156`、 `TPR@0.1%FPR = 0.132812`,label-shuffle `AUC = 0.464066`,因此 - class-ordered seed offset 不是充分解释。但它仍只是 Research-side H2 - response-cache geometry 候选,不是第二公开资产或产品合约。 + class-ordered seed offset 不是充分解释。同边界 seed `177` shared-position + scout 仍为 `AUC = 0.956192`、`TPR@1%FPR = 0.285156`、 + `TPR@0.1%FPR = 0.109375`,label-shuffle `AUC = 0.484070`,说明该候选 + 在 order-control 后不是单 seed 现象。但它仍只是 Research-side H2 response-cache + geometry 候选,不是第二公开资产或产品合约。 不要把它扩成 KDE、shadow density、repeat-count 或同 cache feature sweep; 不要补跑完整 `512 / 512` 只为表格好看;不要新增 Platform/Runtime schema、 runner 或 admitted bundle row。 diff --git a/workspaces/black-box/artifacts/h2-output-cloud-geometry-shared-position-seed177-256-20260525.json b/workspaces/black-box/artifacts/h2-output-cloud-geometry-shared-position-seed177-256-20260525.json new file mode 100644 index 00000000..9b51ebfc --- /dev/null +++ b/workspaces/black-box/artifacts/h2-output-cloud-geometry-shared-position-seed177-256-20260525.json @@ -0,0 +1,166 @@ +{ + "status": "ready", + "track": "black-box", + "method": "H2 output-cloud geometry scorer", + "mode": "cpu-cache-review", + "response_cache": "workspaces\\black-box\\runs\\h2-response-strength-256-shared-position-seed177-20260525-r1\\response-cache.npz", + "inputs": { + "sample_count": 512, + "member_count": 256, + "nonmember_count": 256, + "timesteps": [ + 40, + 80, + 120, + 160 + ], + "repeat_count": 2, + "feature_count": 17, + "feature_names": [ + "within_timestep_pair_rmse_40", + "within_timestep_pair_rmse_80", + "within_timestep_pair_rmse_120", + "within_timestep_pair_rmse_160", + "within_timestep_pair_rmse_mean", + "within_timestep_pair_rmse_std", + "within_timestep_pair_rmse_slope", + "centroid_rmse_40_80", + "centroid_rmse_40_120", + "centroid_rmse_40_160", + "centroid_rmse_80_120", + "centroid_rmse_80_160", + "centroid_rmse_120_160", + "centroid_rmse_mean", + "centroid_rmse_std", + "cloud_pca_trace", + "cloud_pca_top_share" + ], + "seed": 177, + "label_mode": "original", + "holdout_repeats": 7, + "bootstrap_iters": 100 + }, + "simple": { + "best_by_auc": { + "name": "centroid_rmse_40_160", + "orientation": "negative_higher_is_member", + "metrics": { + "auc": 0.794266, + "asr": 0.755859, + "tpr_at_1pct_fpr": 0.015625, + "tpr_at_0_1pct_fpr": 0.007812, + "member_score_mean": -0.035444, + "nonmember_score_mean": -0.045246 + } + }, + "best_by_low_fpr": { + "name": "cloud_pca_top_share", + "orientation": "negative_higher_is_member", + "metrics": { + "auc": 0.626831, + "asr": 0.619141, + "tpr_at_1pct_fpr": 0.058594, + "tpr_at_0_1pct_fpr": 0.042969, + "member_score_mean": -0.264214, + "nonmember_score_mean": -0.273958 + } + } + }, + "logistic": { + "aggregate_metrics": { + "auc": 0.956192, + "asr": 0.896484, + "tpr_at_1pct_fpr": 0.285156, + "tpr_at_0_1pct_fpr": 0.109375, + "member_score_mean": 0.797283, + "nonmember_score_mean": 0.200697 + }, + "aggregate_ci95": { + "auc": { + "p025": 0.941745, + "p975": 0.970779 + }, + "asr": { + "p025": 0.877881, + "p975": 0.917041 + }, + "tpr_at_1pct_fpr": { + "p025": 0.116406, + "p975": 0.678613 + }, + "tpr_at_0_1pct_fpr": { + "p025": 0.07793, + "p975": 0.364649 + } + }, + "mean_coefficients": [ + 2.492539, + 0.229399, + 0.14844, + 0.910024, + 0.794222, + -0.250528, + -0.167823, + -0.138342, + -3.268879, + -3.724328, + 0.825758, + 0.516866, + 2.113911, + -0.684683, + -0.41095, + -0.702947, + -0.039379 + ], + "prediction_count": { + "min": 7, + "max": 7, + "mean": 7.0 + } + }, + "comparison": { + "raw_h2_logistic": { + "auc": 0.911255, + "asr": 0.851562, + "tpr_at_1pct_fpr": 0.113281, + "tpr_at_0_1pct_fpr": 0.0, + "member_score_mean": 0.737363, + "nonmember_score_mean": 0.263165 + }, + "lowpass_h2_logistic": { + "auc": 0.896698, + "asr": 0.828125, + "tpr_at_1pct_fpr": 0.09375, + "tpr_at_0_1pct_fpr": 0.0625, + "member_score_mean": 0.727362, + "nonmember_score_mean": 0.273117 + }, + "output_cloud_minus_raw_h2": { + "auc": 0.044937, + "asr": 0.044922, + "tpr_at_1pct_fpr": 0.171875, + "tpr_at_0_1pct_fpr": 0.109375 + }, + "output_cloud_minus_lowpass_h2": { + "auc": 0.059494, + "asr": 0.068359, + "tpr_at_1pct_fpr": 0.191406, + "tpr_at_0_1pct_fpr": 0.046875 + } + }, + "decision_gate": { + "uses_only_output_output_geometry": true, + "does_not_generate_new_responses": true, + "nonzero_strict_tail": true, + "beats_best_simple_low_fpr": true, + "reopen_allowed": false, + "requires_reseeded_or_interleaved_cache_before_promotion": true + }, + "verdict": "candidate_complementary_output_cloud_geometry", + "notes": [ + "This is a CPU-only scorer review on an existing H2 response cache.", + "It intentionally excludes seed-to-output distance features so it cannot collapse back into H2 simple distance.", + "A positive result is candidate-only until reseeded or interleaved response-cache controls rule out class-ordered sampling effects.", + "Do not expand this cache into KDE, shadow density, repeat-count, or same-cache feature sweeps." + ] +} \ No newline at end of file diff --git a/workspaces/black-box/artifacts/h2-output-cloud-geometry-shared-position-seed177-256-label-shuffle-20260525.json b/workspaces/black-box/artifacts/h2-output-cloud-geometry-shared-position-seed177-256-label-shuffle-20260525.json new file mode 100644 index 00000000..d391cb21 --- /dev/null +++ b/workspaces/black-box/artifacts/h2-output-cloud-geometry-shared-position-seed177-256-label-shuffle-20260525.json @@ -0,0 +1,142 @@ +{ + "status": "ready", + "track": "black-box", + "method": "H2 output-cloud geometry scorer", + "mode": "cpu-cache-review", + "response_cache": "workspaces\\black-box\\runs\\h2-response-strength-256-shared-position-seed177-20260525-r1\\response-cache.npz", + "inputs": { + "sample_count": 512, + "member_count": 256, + "nonmember_count": 256, + "timesteps": [ + 40, + 80, + 120, + 160 + ], + "repeat_count": 2, + "feature_count": 17, + "feature_names": [ + "within_timestep_pair_rmse_40", + "within_timestep_pair_rmse_80", + "within_timestep_pair_rmse_120", + "within_timestep_pair_rmse_160", + "within_timestep_pair_rmse_mean", + "within_timestep_pair_rmse_std", + "within_timestep_pair_rmse_slope", + "centroid_rmse_40_80", + "centroid_rmse_40_120", + "centroid_rmse_40_160", + "centroid_rmse_80_120", + "centroid_rmse_80_160", + "centroid_rmse_120_160", + "centroid_rmse_mean", + "centroid_rmse_std", + "cloud_pca_trace", + "cloud_pca_top_share" + ], + "seed": 177, + "label_mode": "shuffled_seed_177", + "holdout_repeats": 7, + "bootstrap_iters": 100 + }, + "simple": { + "best_by_auc": { + "name": "cloud_pca_top_share", + "orientation": "negative_higher_is_member", + "metrics": { + "auc": 0.518509, + "asr": 0.527344, + "tpr_at_1pct_fpr": 0.0, + "tpr_at_0_1pct_fpr": 0.0, + "member_score_mean": -0.268277, + "nonmember_score_mean": -0.269895 + } + }, + "best_by_low_fpr": { + "name": "centroid_rmse_40_160", + "orientation": "negative_higher_is_member", + "metrics": { + "auc": 0.515167, + "asr": 0.544922, + "tpr_at_1pct_fpr": 0.023438, + "tpr_at_0_1pct_fpr": 0.003906, + "member_score_mean": -0.040165, + "nonmember_score_mean": -0.040525 + } + } + }, + "logistic": { + "aggregate_metrics": { + "auc": 0.48407, + "asr": 0.513672, + "tpr_at_1pct_fpr": 0.023438, + "tpr_at_0_1pct_fpr": 0.011719, + "member_score_mean": 0.500029, + "nonmember_score_mean": 0.50166 + }, + "aggregate_ci95": { + "auc": { + "p025": 0.442889, + "p975": 0.53199 + }, + "asr": { + "p025": 0.511719, + "p975": 0.554785 + }, + "tpr_at_1pct_fpr": { + "p025": 0.003906, + "p975": 0.046875 + }, + "tpr_at_0_1pct_fpr": { + "p025": 0.003906, + "p975": 0.035156 + } + }, + "mean_coefficients": [ + -0.17842, + -0.255633, + 0.287435, + -0.2456, + -0.087512, + 0.110718, + -0.05195, + -0.018815, + -0.308328, + 0.260794, + -0.079668, + -0.443812, + 0.404135, + -0.018786, + -0.248044, + 0.792548, + -0.037794 + ], + "prediction_count": { + "min": 7, + "max": 7, + "mean": 7.0 + } + }, + "comparison": { + "raw_h2_logistic": null, + "lowpass_h2_logistic": null, + "output_cloud_minus_raw_h2": null, + "output_cloud_minus_lowpass_h2": null + }, + "decision_gate": { + "uses_only_output_output_geometry": true, + "does_not_generate_new_responses": true, + "nonzero_strict_tail": true, + "beats_best_simple_low_fpr": false, + "reopen_allowed": false, + "requires_reseeded_or_interleaved_cache_before_promotion": true + }, + "verdict": "label_shuffle_sanity_random_level", + "notes": [ + "This is a CPU-only scorer review on an existing H2 response cache.", + "It intentionally excludes seed-to-output distance features so it cannot collapse back into H2 simple distance.", + "A positive result is candidate-only until reseeded or interleaved response-cache controls rule out class-ordered sampling effects.", + "Do not expand this cache into KDE, shadow density, repeat-count, or same-cache feature sweeps." + ] +} \ No newline at end of file diff --git a/workspaces/black-box/plan.md b/workspaces/black-box/plan.md index 05d420f4..e42a8243 100644 --- a/workspaces/black-box/plan.md +++ b/workspaces/black-box/plan.md @@ -42,7 +42,10 @@ sanity check. A bounded `256 / 256` shared-position order-control scout preserved the signal (`AUC = 0.967819`, `TPR@1%FPR = 0.410156`, `TPR@0.1%FPR = 0.132812`) with random-level label shuffle (`AUC = 0.464066`), - so class-ordered seed offset is not a sufficient explanation. It remains + so class-ordered seed offset is not a sufficient explanation. The same + boundary at seed `177` remains strong (`AUC = 0.956192`, + `TPR@1%FPR = 0.285156`, `TPR@0.1%FPR = 0.109375`) with random-level label + shuffle (`AUC = 0.484070`), so it is not a single-seed artifact. It remains candidate-only: do not promote it into Platform or Runtime runners from this cache family. - `simple image-to-image distance`: bounded single-asset evidence on @@ -78,8 +81,8 @@ ## Next Action No black-box GPU or CPU sidecar is selected. The H2 output-cloud order-control -scout has answered the seed-offset caveat at decision value; do not turn the -candidate into same-cache feature work or a full `512 / 512` rerun just to +and seed-stability scouts have answered the current caveats at decision value; +do not turn the candidate into same-cache feature work or a full `512 / 512` rerun just to complete a table. The broader root long-horizon queue still continues Lane A only with a non-duplicate asset that has exact target identity, member/nonmember split artifacts, and response or score coverage.