Skip to content

Commit f91ab18

Browse files
committed
docs(release): align 1.3.0 docs with baseline gating contract, segment report-only behavior, and cache-path legacy alias
1 parent 2df02fd commit f91ab18

6 files changed

Lines changed: 130 additions & 188 deletions

File tree

CHANGELOG.md

Lines changed: 51 additions & 124 deletions
Original file line numberDiff line numberDiff line change
@@ -4,146 +4,73 @@
44

55
### Overview
66

7-
This release improves clone-detection precision and explainability with deterministic
8-
normalization and CFG upgrades, adds segment-level internal clone reporting, refreshes
9-
the HTML report UI, and introduces baseline versioning.
10-
11-
**Breaking change:** CI workflows that reuse old baselines must regenerate them.
12-
13-
### Clone Detection Accuracy
14-
15-
- **Commutative normalization**
16-
Canonicalized operand order for `+`, `*`, `|`, `&`, `^` only for provably safe constant
17-
domains. Symbolic operands are no longer reordered.
18-
19-
- **Local logical equivalence**
20-
Normalized `not (x in y)` to `x not in y` and `not (x is y)` to `x is not y` without
21-
De Morgan transformations or broader boolean rewrites.
22-
23-
- **Call-target preservation**
24-
Kept symbolic call targets during normalization to avoid conflating different APIs
25-
(for example, `load_user(...)` vs `delete_user(...)`).
26-
27-
### CFG Precision
28-
29-
- **Short‑circuit modeling**
30-
Represented `and`/`or` as micro‑CFGs with explicit branch splits after each operand.
31-
32-
- **Exception linking**
33-
Linked `try/except` only to statements that may raise (calls, attribute access, indexing,
34-
`await`, `yield from`, `raise`) instead of blanket links.
35-
36-
### Detection Integrity
37-
38-
- **Internal CFG marker hardening**
39-
Switched CFG metadata markers to an internal namespace (`__CC_META__::...`) emitted as
40-
synthetic AST names, preventing collisions with user string literals.
41-
42-
- **Ordered control-flow semantics**
43-
Modeled `break`/`continue` as terminating loop transitions, added correct `for/while ... else`
44-
semantics, preserved `match case` evaluation order, and preserved `except` handler order.
45-
46-
- **Deterministic traversal order**
47-
Sorted Python file discovery to stabilize processing and report ordering across runs/platforms.
48-
49-
### Segment‑Level Detection
50-
51-
- **Window fingerprints**
52-
Added deterministic segment windows inside functions for internal clone discovery.
53-
54-
- **Candidate generation**
55-
Used an order‑insensitive signature for candidate grouping and a strict segment hash for
56-
final confirmation. Segment matches do not affect baseline or CI failure logic.
57-
58-
- **Noise reduction (report‑only)**
59-
Merged overlapping segment windows into a single span per function and suppressed
60-
boilerplate-only groups (attribute assignment wiring) with deterministic AST criteria.
7+
This release improves detection precision, determinism, and auditability, adds
8+
segment-level reporting, refreshes the HTML report UI, and hardens baseline/cache
9+
contracts for CI usage.
10+
11+
**Breaking (CI):** baseline contract checks are stricter. Legacy or mismatched baselines
12+
must be regenerated.
13+
14+
### Detection Engine
15+
16+
- Safe normalization upgrades: local logical equivalence, proven-domain commutative
17+
canonicalization, and preserved symbolic call targets.
18+
- Internal CFG metadata markers were moved to the `__CC_META__::...` namespace and emitted
19+
as synthetic AST names to prevent collisions with user string literals.
20+
- CFG precision upgrades: short-circuit micro-CFG, selective `try/except` raise-linking,
21+
loop `break`/`continue` jump semantics, `for/while ... else`, and ordered `match`/`except`.
22+
- Deterministic traversal and ordering improvements for stable clone grouping/report output.
23+
- Segment-level internal detection added with strict candidate->hash confirmation; remains
24+
report-only (not part of baseline/CI fail criteria).
25+
- Segment report noise reduction: overlapping windows are merged and boilerplate-only groups
26+
are suppressed using deterministic AST criteria.
6127

6228
### Baseline & CI
6329

64-
- Baselines are now **versioned** and include a schema version.
65-
- Mismatched baseline versions **fail fast** and require regeneration.
66-
- Added baseline tamper-evident integrity for v1.3+ files (`generator`, `payload_sha256`)
67-
while keeping legacy baseline behavior as explicit regeneration-required fail-fast.
68-
- Added configurable size guards (`--max-baseline-size-mb`, `--max-cache-size-mb`):
69-
oversized cache is ignored with warning; oversized/invalid/untrusted baseline is ignored
70-
outside gating mode and treated as empty baseline.
71-
- Behavioral hardening (CLI): baseline validation is now an explicit contract
72-
(legacy/version/schema/python/integrity/size states). In `--fail-on-new`/`--ci`,
73-
untrusted baseline states fail fast with deterministic exit codes.
30+
- Baseline format is versioned (`baseline_version`, `schema_version`) and legacy baselines
31+
fail fast with regeneration guidance.
32+
- Added tamper-evident baseline integrity for v1.3+ (`generator`, `payload_sha256`).
33+
- Added configurable size guards: `--max-baseline-size-mb`, `--max-cache-size-mb`.
34+
- Behavioral hardening: in normal mode, untrusted baseline states are ignored with warning
35+
and compared as empty; in `--fail-on-new` / `--ci`, they fail fast with deterministic exit codes.
7436

75-
**Breaking (CI):** baseline version mismatch now fails hard; CI requires baseline regeneration on upgrade.
76-
77-
Update the baseline:
37+
Update baseline after upgrade:
7838

7939
```bash
8040
codeclone . --update-baseline
8141
```
8242

83-
### CLI UX (CI)
84-
85-
- Added `--version` for standard version output.
86-
- Added `--cache-path` (legacy alias: `--cache-dir`) and clarified cache help text.
87-
- Added `--ci` preset (`--fail-on-new --no-color --quiet`).
88-
- Improved `--fail-on-new` output with aggregated counts and clear next steps.
89-
- Added strict report output extension validation (`.html`, `.json`, `.txt`).
90-
- Centralized user-facing CLI strings in `codeclone/ui_messages.py` to keep text contracts
91-
consistent and maintainable.
92-
- Refined Summary output: a single compact table with deterministic metric order and
93-
explicit `Files analyzed` semantics (cache-aware), plus stable compact output for
94-
`--quiet/--ci`.
95-
96-
### HTML Report UI
97-
98-
- **Visual refresh**
99-
Introduced a modernized HTML report layout with a sticky top bar and improved spacing.
100-
101-
- **Interactive tooling**
102-
Added a command palette, keyboard shortcuts, toast notifications, and quick actions
103-
(export, stats, charts, navigation).
43+
### CLI & Reports
10444

105-
- **Reporting widgets**
106-
Added a stats dashboard and chart container for high-level clone metrics.
45+
- Added `--version`, `--cache-path` (legacy alias: `--cache-dir`), and `--ci` preset.
46+
- Added strict output extension validation for `--html/.html`, `--json/.json`, `--text/.txt`.
47+
- Summary output was redesigned for deterministic, cache-aware metrics across standard and CI modes.
48+
- User-facing CLI messages were centralized in `codeclone/ui_messages.py`.
49+
- HTML/TXT/JSON reports now include consistent provenance metadata (baseline/cache status fields).
50+
- Clone group/report ordering is deterministic and aligned across HTML/TXT/JSON outputs.
10751

108-
- **Icon system**
109-
Replaced emoji glyphs with inline SVG icons for consistent rendering and a fully
110-
self-contained UI.
52+
### HTML UI
11153

112-
- **Segment reporting**
113-
Added a dedicated “Segment clones” section and summary metric in HTML/TXT/JSON outputs.
54+
- Refreshed layout with improved navigation and dashboard widgets.
55+
- Added command palette and keyboard shortcuts.
56+
- Replaced emoji icons with inline SVG icons.
57+
- Hardened escaping (text + attribute context) and snippet fallback behavior.
11458

115-
- **Escaping and snippet resilience**
116-
Hardened HTML escaping for text and attribute contexts, and added a safe fallback when
117-
source snippets are unavailable during report rendering.
118-
119-
### Cache & Internals
120-
121-
- Extended cache schema to store segment fingerprints (cache version bump).
122-
- Default cache location moved to `<root>/.cache/codeclone/cache.json` (project‑local).
123-
- Added a legacy cache warning for `~/.cache/codeclone/cache.json` with guidance to
124-
delete it and add `.cache/` to `.gitignore`.
125-
- Strengthened cache integrity handling with constant-time signature checks and explicit
126-
warnings for oversized cache files.
127-
- Added deterministic deep-schema cache entry validation (`stat/units/blocks/segments`);
128-
invalid cache entries are ignored instead of affecting analysis results.
129-
130-
### Packaging
131-
132-
- Removed an invalid PyPI classifier from the package metadata.
133-
134-
### Documentation
59+
### Cache & Security
13560

136-
- Updated architecture and CFG documentation to reflect new normalization, CFG, and
137-
segment‑level detection behavior.
138-
- Updated README, SECURITY, and CONTRIBUTING guidance for 1.3.0.
61+
- Cache default moved to `<root>/.cache/codeclone/cache.json` with legacy path warning.
62+
- Cache schema was extended to include segment data (`CACHE_VERSION=1.1`).
63+
- Cache integrity uses constant-time signature checks and deep schema validation.
64+
- Invalid/oversized cache is ignored deterministically and rebuilt from source.
65+
- Added security regressions for traversal safety, report escaping, baseline/cache integrity,
66+
and deterministic report ordering across formats.
67+
- Fixed POSIX parser CPU guard to avoid lowering `RLIMIT_CPU` hard limit.
13968

140-
### Testing & Security
69+
### Documentation & Packaging
14170

142-
- Expanded security tests (HTML escaping and safety checks).
143-
- Added regression tests for deterministic report ordering across HTML/TXT/JSON,
144-
baseline/cache integrity edge cases, and symlink traversal/loop safety.
145-
- Fixed POSIX parser CPU guard to avoid lowering `RLIMIT_CPU` hard limit, preventing
146-
potential process termination in long CI test sessions.
71+
- Updated README and docs (`architecture`, `cfg`, `SECURITY`, `CONTRIBUTING`) to reflect
72+
current contracts and behaviors.
73+
- Removed an invalid PyPI classifier from package metadata.
14774

14875
---
14976

CONTRIBUTING.md

Lines changed: 7 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -96,7 +96,10 @@ Such changes often require design-level discussion and may be staged across vers
9696

9797
- Baselines are **versioned**. Regenerate with `codeclone . --update-baseline`
9898
when detection logic or CodeClone version changes.
99+
- Baselines in 1.3+ are tamper-evident (`generator`, `payload_sha256`).
99100
- Baseline verification must use the same Python `major.minor` version.
101+
- In `--fail-on-new` / `--ci`, untrusted baseline states fail fast. Outside gating
102+
mode, baseline is ignored with warning and comparison proceeds against an empty baseline.
100103

101104
---
102105

@@ -113,15 +116,15 @@ pip install -e .[dev]
113116
Run tests:
114117

115118
```bash
116-
pytest
119+
uv run pytest
117120
```
118121

119122
Static checks:
120123

121124
```bash
122-
mypy
123-
ruff check .
124-
ruff format .
125+
uv run mypy .
126+
uv run ruff check .
127+
uv run ruff format .
125128
```
126129

127130
---

README.md

Lines changed: 32 additions & 25 deletions
Original file line numberDiff line numberDiff line change
@@ -84,7 +84,9 @@ Typical use cases:
8484
- Current CFG semantics (v1):
8585
- `and` / `or` are modeled as short‑circuit micro‑CFG branches,
8686
- `try/except` links only from statements that may raise,
87-
- `break` and `continue` are treated as statements (no jump targets),
87+
- `break` / `continue` are modeled as terminating loop transitions with explicit targets,
88+
- `for/while ... else` semantics are preserved structurally,
89+
- `match case` and `except` handler order is preserved structurally,
8890
- after-blocks are explicit and always present,
8991
- focus is on **structural similarity**, not precise runtime semantics.
9092

@@ -176,7 +178,12 @@ Baselines are **versioned**. If CodeClone is upgraded, regenerate the baseline t
176178
CI deterministic and explainable.
177179
Baseline format in 1.3+ is tamper-evident (`generator`, `payload_sha256`) and validated
178180
before baseline comparison.
179-
Invalid or oversized baseline files fail fast in CI mode and must be regenerated.
181+
182+
Trusted vs untrusted baseline behavior (`invalid`, `too_large`, `generator_mismatch`,
183+
`integrity_missing`, `integrity_failed`):
184+
185+
- ignored with warning in non-gating mode (comparison falls back to empty baseline),
186+
- fail-fast in `--fail-on-new` / `--ci` (exit code `2`).
180187

181188
### 2. Use in CI
182189

@@ -296,29 +303,29 @@ See full design and semantics:
296303
297304
## CLI Options
298305
299-
| Option | Description | Default |
300-
|-------------------------------|------------------------------------------------------------------|--------------------------------------|
301-
| `root` | Project root directory to scan | `.` |
302-
| `--version` | Print CodeClone version and exit | - |
303-
| `--min-loc` | Minimum function LOC to analyze | `15` |
304-
| `--min-stmt` | Minimum AST statements to analyze | `6` |
305-
| `--processes` | Number of worker processes | `4` |
306-
| `--cache-path FILE` | Cache file path | `<root>/.cache/codeclone/cache.json` |
307-
| `--cache-dir FILE` | Legacy alias for `--cache-path` | - |
308-
| `--max-cache-size-mb MB` | Max cache size before ignore + warning | `50` |
309-
| `--baseline FILE` | Baseline file path | `codeclone.baseline.json` |
310-
| `--max-baseline-size-mb MB` | Max baseline size before fail-fast | `5` |
311-
| `--update-baseline` | Regenerate baseline from current results | `False` |
312-
| `--fail-on-new` | Fail if new function/block clone groups appear vs baseline | `False` |
313-
| `--fail-threshold MAX_CLONES` | Fail if total clone groups (`function + block`) exceed threshold | `-1` (disabled) |
314-
| `--ci` | CI preset: `--fail-on-new --no-color --quiet` | `False` |
315-
| `--html FILE` | Write HTML report (`.html`) | - |
316-
| `--json FILE` | Write JSON report (`.json`) | - |
317-
| `--text FILE` | Write text report (`.txt`) | - |
318-
| `--no-progress` | Disable progress bar output | `False` |
319-
| `--no-color` | Disable ANSI colors | `False` |
320-
| `--quiet` | Minimize output (warnings/errors still shown) | `False` |
321-
| `--verbose` | Show hash details for new clone groups in fail output | `False` |
306+
| Option | Description | Default |
307+
|-------------------------------|----------------------------------------------------------------------|--------------------------------------|
308+
| `root` | Project root directory to scan | `.` |
309+
| `--version` | Print CodeClone version and exit | - |
310+
| `--min-loc` | Minimum function LOC to analyze | `15` |
311+
| `--min-stmt` | Minimum AST statements to analyze | `6` |
312+
| `--processes` | Number of worker processes | `4` |
313+
| `--cache-path FILE` | Cache file path | `<root>/.cache/codeclone/cache.json` |
314+
| `--cache-dir FILE` | Legacy alias for `--cache-path` | - |
315+
| `--max-cache-size-mb MB` | Max cache size before ignore + warning | `50` |
316+
| `--baseline FILE` | Baseline file path | `codeclone.baseline.json` |
317+
| `--max-baseline-size-mb MB` | Max baseline size; untrusted baseline fails in CI, ignored otherwise | `5` |
318+
| `--update-baseline` | Regenerate baseline from current results | `False` |
319+
| `--fail-on-new` | Fail if new function/block clone groups appear vs baseline | `False` |
320+
| `--fail-threshold MAX_CLONES` | Fail if total clone groups (`function + block`) exceed threshold | `-1` (disabled) |
321+
| `--ci` | CI preset: `--fail-on-new --no-color --quiet` | `False` |
322+
| `--html FILE` | Write HTML report (`.html`) | - |
323+
| `--json FILE` | Write JSON report (`.json`) | - |
324+
| `--text FILE` | Write text report (`.txt`) | - |
325+
| `--no-progress` | Disable progress bar output | `False` |
326+
| `--no-color` | Disable ANSI colors | `False` |
327+
| `--quiet` | Minimize output (warnings/errors still shown) | `False` |
328+
| `--verbose` | Show hash details for new clone groups in fail output | `False` |
322329

323330
## License
324331

SECURITY.md

Lines changed: 6 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -39,7 +39,12 @@ Additional safeguards:
3939
- HTML report content is escaped in both text and attribute contexts to prevent script injection.
4040
- Reports are static and do not execute analyzed code.
4141
- Scanner traversal is root-confined and prevents symlink-based path escape.
42-
- Baseline files are schema/type validated with size limits; invalid baselines fail fast.
42+
- Baseline files are schema/type validated with size limits and tamper-evident integrity fields
43+
(`generator`, `payload_sha256` for v1.3+).
44+
- Baseline integrity is tamper-evident (audit signal), not tamper-proof cryptographic signing.
45+
An actor who can rewrite baseline content and recompute `payload_sha256` can still alter it.
46+
- In `--fail-on-new` / `--ci`, untrusted baseline states fail fast; otherwise baseline is ignored
47+
with explicit warning and comparison proceeds against an empty baseline.
4348
- Cache files are HMAC-signed (constant-time comparison), size-limited, and ignored on mismatch.
4449
- Cache secrets are stored next to the cache (`.cache_secret`) and must not be committed.
4550

0 commit comments

Comments
 (0)