Skip to content

Commit 985e5c2

Browse files
committed
feat(baseline): tamper-evident baseline contract
1 parent 03fcb6f commit 985e5c2

30 files changed

Lines changed: 2931 additions & 945 deletions

CHANGELOG.md

Lines changed: 19 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
# Changelog
22

3-
## [1.3.0] - 2026-02-05
3+
## [1.3.0] - 2026-02-08
44

55
### Overview
66

@@ -63,6 +63,14 @@ the HTML report UI, and introduces baseline versioning.
6363

6464
- Baselines are now **versioned** and include a schema version.
6565
- Mismatched baseline versions **fail fast** and require regeneration.
66+
- Baseline loading is now strict: invalid schema/types or oversized baseline files
67+
fail fast to preserve CI integrity.
68+
- Added baseline tamper-evident integrity for v1.3+ files (`generator`, `payload_sha256`)
69+
while keeping legacy baseline behavior as explicit regeneration-required fail-fast.
70+
- Added configurable size guards (`--max-baseline-size-mb`, `--max-cache-size-mb`):
71+
oversized baseline fails fast, oversized cache is ignored with warning.
72+
- Behavioral hardening (CLI): baseline validation is now an explicit contract
73+
(legacy/version/schema/python/integrity/size states) with deterministic fail-fast behavior.
6674

6775
**Breaking (CI):** baseline version mismatch now fails hard; CI requires baseline regeneration on upgrade.
6876

@@ -104,12 +112,20 @@ codeclone . --update-baseline
104112
- **Segment reporting**
105113
Added a dedicated “Segment clones” section and summary metric in HTML/TXT/JSON outputs.
106114

115+
- **Escaping and snippet resilience**
116+
Hardened HTML escaping for text and attribute contexts, and added a safe fallback when
117+
source snippets are unavailable during report rendering.
118+
107119
### Cache & Internals
108120

109121
- Extended cache schema to store segment fingerprints (cache version bump).
110122
- Default cache location moved to `<root>/.cache/codeclone/cache.json` (project‑local).
111123
- Added a legacy cache warning for `~/.cache/codeclone/cache.json` with guidance to
112124
delete it and add `.cache/` to `.gitignore`.
125+
- Strengthened cache integrity handling with constant-time signature checks and explicit
126+
warnings for oversized cache files.
127+
- Added deterministic deep-schema cache entry validation (`stat/units/blocks/segments`);
128+
invalid cache entries are ignored instead of affecting analysis results.
113129

114130
### Packaging
115131

@@ -124,6 +140,8 @@ codeclone . --update-baseline
124140
### Testing & Security
125141

126142
- Expanded security tests (HTML escaping and safety checks).
143+
- Added regression tests for deterministic report ordering across HTML/TXT/JSON,
144+
baseline/cache integrity edge cases, and symlink traversal/loop safety.
127145

128146
---
129147

README.md

Lines changed: 13 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -144,6 +144,9 @@ All report formats include provenance metadata for auditability:
144144
`codeclone_version`, `python_version`, `baseline_path`, `baseline_version`,
145145
`baseline_schema_version`, `baseline_python_version`, `baseline_loaded`,
146146
`baseline_status` (and cache metadata when available).
147+
`baseline_status` values: `ok`, `missing`, `legacy`, `invalid`,
148+
`mismatch_version`, `mismatch_schema`, `mismatch_python`,
149+
`generator_mismatch`, `integrity_missing`, `integrity_failed`, `too_large`.
147150

148151
Generate an HTML report:
149152

@@ -171,6 +174,9 @@ Commit the generated baseline file to the repository.
171174

172175
Baselines are **versioned**. If CodeClone is upgraded, regenerate the baseline to keep
173176
CI deterministic and explainable.
177+
Baseline format in 1.3+ is tamper-evident (`generator`, `payload_sha256`) and validated
178+
before baseline comparison.
179+
Invalid or oversized baseline files fail fast in CI mode and must be regenerated.
174180

175181
### 2. Use in CI
176182

@@ -207,6 +213,11 @@ You can override this path with `--cache-path` (`--cache-dir` is a legacy alias)
207213
If you used an older version of CodeClone, delete the legacy cache file at
208214
`~/.cache/codeclone/cache.json` and add `.cache/` to `.gitignore`.
209215

216+
Cache integrity checks are strict: signature mismatch or oversized cache files are ignored
217+
with an explicit warning, then rebuilt from source.
218+
Cache entries are validated against expected structure/types; invalid entries are ignored
219+
deterministically.
220+
210221
### Python Version Consistency for Baseline Checks
211222

212223
Due to inherent differences in Python’s AST between interpreter versions, baseline
@@ -294,7 +305,9 @@ See full design and semantics:
294305
| `--processes` | Number of worker processes | `4` |
295306
| `--cache-path FILE` | Cache file path | `<root>/.cache/codeclone/cache.json` |
296307
| `--cache-dir FILE` | Legacy alias for `--cache-path` | - |
308+
| `--max-cache-size-mb MB` | Max cache size before ignore + warning | `50` |
297309
| `--baseline FILE` | Baseline file path | `codeclone.baseline.json` |
310+
| `--max-baseline-size-mb MB` | Max baseline size before fail-fast | `5` |
298311
| `--update-baseline` | Regenerate baseline from current results | `False` |
299312
| `--fail-on-new` | Fail if new function/block clone groups appear vs baseline | `False` |
300313
| `--fail-threshold MAX_CLONES` | Fail if total clone groups (`function + block`) exceed threshold | `-1` (disabled) |

SECURITY.md

Lines changed: 4 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -36,9 +36,11 @@ ongoing security review.
3636

3737
Additional safeguards:
3838

39-
- HTML report content is escaped to prevent script injection.
39+
- HTML report content is escaped in both text and attribute contexts to prevent script injection.
4040
- Reports are static and do not execute analyzed code.
41-
- Cache files are HMAC-signed; signature mismatches are ignored.
41+
- Scanner traversal is root-confined and prevents symlink-based path escape.
42+
- Baseline files are schema/type validated with size limits; invalid baselines fail fast.
43+
- Cache files are HMAC-signed (constant-time comparison), size-limited, and ignored on mismatch.
4244
- Cache secrets are stored next to the cache (`.cache_secret`) and must not be committed.
4345

4446
---

codeclone.baseline.json

Lines changed: 7 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1,7 +1,12 @@
11
{
2-
"functions": [],
2+
"functions": [
3+
"15f730bde91427ef6cc7878fd98d8f9d2ca3b57e|20-49"
4+
],
35
"blocks": [],
46
"python_version": "3.13",
57
"baseline_version": "1.3.0",
6-
"schema_version": 1
8+
"schema_version": 1,
9+
"generator": "codeclone",
10+
"payload_sha256": "ab5bbbe4c098b6aae44202f69c7f26398d4d85ae759ca096418d6fc054571cf1",
11+
"created_at": "2026-02-08T09:06:10+00:00"
712
}

codeclone/_cli_args.py

Lines changed: 161 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,161 @@
1+
"""
2+
CodeClone — AST and CFG-based code clone detector for Python
3+
focused on architectural duplication.
4+
5+
Copyright (c) 2026 Den Rozhnovskiy
6+
Licensed under the MIT License.
7+
"""
8+
9+
from __future__ import annotations
10+
11+
import argparse
12+
from typing import cast
13+
14+
from . import ui_messages as ui
15+
16+
17+
class _HelpFormatter(argparse.ArgumentDefaultsHelpFormatter):
18+
def _get_help_string(self, action: argparse.Action) -> str:
19+
if action.dest == "cache_path":
20+
return action.help or ""
21+
return cast(str, super()._get_help_string(action))
22+
23+
24+
def build_parser(version: str) -> argparse.ArgumentParser:
25+
ap = argparse.ArgumentParser(
26+
prog="codeclone",
27+
description="AST and CFG-based code clone detector for Python.",
28+
formatter_class=_HelpFormatter,
29+
)
30+
ap.add_argument(
31+
"--version",
32+
action="version",
33+
version=ui.version_output(version),
34+
help=ui.HELP_VERSION,
35+
)
36+
37+
core_group = ap.add_argument_group("Target")
38+
core_group.add_argument(
39+
"root",
40+
nargs="?",
41+
default=".",
42+
help=ui.HELP_ROOT,
43+
)
44+
45+
tune_group = ap.add_argument_group("Analysis Tuning")
46+
tune_group.add_argument(
47+
"--min-loc",
48+
type=int,
49+
default=15,
50+
help=ui.HELP_MIN_LOC,
51+
)
52+
tune_group.add_argument(
53+
"--min-stmt",
54+
type=int,
55+
default=6,
56+
help=ui.HELP_MIN_STMT,
57+
)
58+
tune_group.add_argument(
59+
"--processes",
60+
type=int,
61+
default=4,
62+
help=ui.HELP_PROCESSES,
63+
)
64+
tune_group.add_argument(
65+
"--cache-path",
66+
dest="cache_path",
67+
metavar="FILE",
68+
default=None,
69+
help=ui.HELP_CACHE_PATH,
70+
)
71+
tune_group.add_argument(
72+
"--cache-dir",
73+
dest="cache_path",
74+
metavar="FILE",
75+
default=None,
76+
help=ui.HELP_CACHE_DIR_LEGACY,
77+
)
78+
tune_group.add_argument(
79+
"--max-cache-size-mb",
80+
type=int,
81+
default=50,
82+
metavar="MB",
83+
help=ui.HELP_MAX_CACHE_SIZE_MB,
84+
)
85+
86+
ci_group = ap.add_argument_group("Baseline & CI/CD")
87+
ci_group.add_argument(
88+
"--baseline",
89+
default="codeclone.baseline.json",
90+
help=ui.HELP_BASELINE,
91+
)
92+
ci_group.add_argument(
93+
"--max-baseline-size-mb",
94+
type=int,
95+
default=5,
96+
metavar="MB",
97+
help=ui.HELP_MAX_BASELINE_SIZE_MB,
98+
)
99+
ci_group.add_argument(
100+
"--update-baseline",
101+
action="store_true",
102+
help=ui.HELP_UPDATE_BASELINE,
103+
)
104+
ci_group.add_argument(
105+
"--fail-on-new",
106+
action="store_true",
107+
help=ui.HELP_FAIL_ON_NEW,
108+
)
109+
ci_group.add_argument(
110+
"--fail-threshold",
111+
type=int,
112+
default=-1,
113+
metavar="MAX_CLONES",
114+
help=ui.HELP_FAIL_THRESHOLD,
115+
)
116+
ci_group.add_argument(
117+
"--ci",
118+
action="store_true",
119+
help=ui.HELP_CI,
120+
)
121+
122+
out_group = ap.add_argument_group("Reporting")
123+
out_group.add_argument(
124+
"--html",
125+
dest="html_out",
126+
metavar="FILE",
127+
help=ui.HELP_HTML,
128+
)
129+
out_group.add_argument(
130+
"--json",
131+
dest="json_out",
132+
metavar="FILE",
133+
help=ui.HELP_JSON,
134+
)
135+
out_group.add_argument(
136+
"--text",
137+
dest="text_out",
138+
metavar="FILE",
139+
help=ui.HELP_TEXT,
140+
)
141+
out_group.add_argument(
142+
"--no-progress",
143+
action="store_true",
144+
help=ui.HELP_NO_PROGRESS,
145+
)
146+
out_group.add_argument(
147+
"--no-color",
148+
action="store_true",
149+
help=ui.HELP_NO_COLOR,
150+
)
151+
out_group.add_argument(
152+
"--quiet",
153+
action="store_true",
154+
help=ui.HELP_QUIET,
155+
)
156+
out_group.add_argument(
157+
"--verbose",
158+
action="store_true",
159+
help=ui.HELP_VERBOSE,
160+
)
161+
return ap

codeclone/_cli_meta.py

Lines changed: 43 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,43 @@
1+
"""
2+
CodeClone — AST and CFG-based code clone detector for Python
3+
focused on architectural duplication.
4+
5+
Copyright (c) 2026 Den Rozhnovskiy
6+
Licensed under the MIT License.
7+
"""
8+
9+
from __future__ import annotations
10+
11+
import sys
12+
from pathlib import Path
13+
from typing import Any
14+
15+
from .baseline import Baseline
16+
17+
18+
def _current_python_version() -> str:
19+
return f"{sys.version_info.major}.{sys.version_info.minor}"
20+
21+
22+
def _build_report_meta(
23+
*,
24+
codeclone_version: str,
25+
baseline_path: Path,
26+
baseline: Baseline,
27+
baseline_loaded: bool,
28+
baseline_status: str,
29+
cache_path: Path,
30+
cache_used: bool,
31+
) -> dict[str, Any]:
32+
return {
33+
"codeclone_version": codeclone_version,
34+
"python_version": _current_python_version(),
35+
"baseline_path": str(baseline_path),
36+
"baseline_version": baseline.baseline_version,
37+
"baseline_schema_version": baseline.schema_version,
38+
"baseline_python_version": baseline.python_version,
39+
"baseline_loaded": baseline_loaded,
40+
"baseline_status": baseline_status,
41+
"cache_path": str(cache_path),
42+
"cache_used": cache_used,
43+
}

codeclone/_cli_paths.py

Lines changed: 36 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,36 @@
1+
"""
2+
CodeClone — AST and CFG-based code clone detector for Python
3+
focused on architectural duplication.
4+
5+
Copyright (c) 2026 Den Rozhnovskiy
6+
Licensed under the MIT License.
7+
"""
8+
9+
from __future__ import annotations
10+
11+
import sys
12+
from collections.abc import Callable
13+
from pathlib import Path
14+
15+
from rich.console import Console
16+
17+
18+
def expand_path(p: str) -> Path:
19+
return Path(p).expanduser().resolve()
20+
21+
22+
def _validate_output_path(
23+
path: str,
24+
*,
25+
expected_suffix: str,
26+
label: str,
27+
console: Console,
28+
invalid_message: Callable[..., str],
29+
) -> Path:
30+
out = Path(path).expanduser()
31+
if out.suffix.lower() != expected_suffix:
32+
console.print(
33+
invalid_message(label=label, path=out, expected_suffix=expected_suffix)
34+
)
35+
sys.exit(2)
36+
return out.resolve()

0 commit comments

Comments
 (0)