@@ -68,7 +68,7 @@ Typical use cases:
6868### Segment-level internal clone detection
6969
7070- Detects repeated ** segment windows** inside the same function.
71- - Uses a two‑ step deterministic match (candidate signature → strict hash).
71+ - Uses a two- step deterministic match (candidate signature → strict hash).
7272- Included in reports for explainability, ** not** in baseline/CI failure logic.
7373
7474### Control-Flow Awareness (CFG v1)
@@ -82,7 +82,7 @@ Typical use cases:
8282 - ` with ` / ` async with `
8383 - ` match ` / ` case ` (Python 3.10+)
8484- Current CFG semantics (v1):
85- - ` and ` / ` or ` are modeled as short‑ circuit micro‑ CFG branches,
85+ - ` and ` / ` or ` are modeled as short- circuit micro- CFG branches,
8686 - ` try/except ` links only from statements that may raise,
8787 - ` break ` / ` continue ` are modeled as terminating loop transitions with explicit targets,
8888 - ` for/while ... else ` semantics are preserved structurally,
@@ -115,9 +115,7 @@ This design keeps clone detection **stable, deterministic, and low-noise**.
115115pip install codeclone
116116```
117117
118- Python ** 3.10+** is required.
119-
120- ---
118+ Python 3.10+ is required.
121119
122120## Quick Start
123121
@@ -142,14 +140,6 @@ codeclone . \
142140 --text .cache/codeclone/report.txt
143141```
144142
145- All report formats include provenance metadata for auditability:
146- ` codeclone_version ` , ` python_version ` , ` baseline_path ` , ` baseline_version ` ,
147- ` baseline_schema_version ` , ` baseline_python_version ` , ` baseline_loaded ` ,
148- ` baseline_status ` (and cache metadata when available).
149- ` baseline_status ` values: ` ok ` , ` missing ` , ` legacy ` , ` invalid ` ,
150- ` mismatch_version ` , ` mismatch_schema ` , ` mismatch_python ` ,
151- ` generator_mismatch ` , ` integrity_missing ` , ` integrity_failed ` , ` too_large ` .
152-
153143Generate an HTML report:
154144
155145``` bash
@@ -162,9 +152,35 @@ Check version:
162152codeclone --version
163153```
164154
155+ ---
156+
157+ ## Reports and Metadata
158+
159+ All report formats include provenance metadata for auditability:
160+
161+ ` codeclone_version ` , ` python_version ` , ` baseline_path ` , ` baseline_version ` ,
162+ ` baseline_schema_version ` , ` baseline_python_version ` , ` baseline_loaded ` ,
163+ ` baseline_status ` (and cache metadata when available).
164+
165+ baseline_status values:
166+
167+ - ` ok `
168+ - ` missing `
169+ - ` legacy `
170+ - ` invalid `
171+ - ` mismatch_version `
172+ - ` mismatch_schema `
173+ - ` mismatch_python `
174+ - ` generator_mismatch `
175+ - ` integrity_missing `
176+ - ` integrity_failed `
177+ - ` too_large `
178+
179+ ---
180+
165181## Baseline Workflow (Recommended)
166182
167- ### 1. Create a baseline
183+ 1 . Create a baseline
168184
169185Run once on your current codebase:
170186
@@ -174,18 +190,28 @@ codeclone . --update-baseline
174190
175191Commit the generated baseline file to the repository.
176192
177- Baselines are ** versioned** . If CodeClone is upgraded, regenerate the baseline to keep
193+ Baselines are versioned. If CodeClone is upgraded, regenerate the baseline to keep
178194CI deterministic and explainable.
179- Baseline format in 1.3+ is tamper-evident (` generator ` , ` payload_sha256 ` ) and validated
195+
196+ Baseline format in 1.3+ is tamper-evident (generator, payload_sha256) and validated
180197before baseline comparison.
181198
182- Trusted vs untrusted baseline behavior (` invalid ` , ` too_large ` , ` generator_mismatch ` ,
183- ` integrity_missing ` , ` integrity_failed ` ):
199+ 2 . Trusted vs untrusted baseline behavior
184200
185- - ignored with warning in non-gating mode (comparison falls back to empty baseline),
186- - fail-fast in ` --fail-on-new ` / ` --ci ` (exit code ` 2 ` ).
201+ Baseline states considered untrusted:
187202
188- ### 2. Use in CI
203+ - ` invalid `
204+ - ` too_large `
205+ - ` generator_mismatch `
206+ - ` integrity_missing `
207+ - ` integrity_failed `
208+
209+ Behavior:
210+
211+ - in normal mode, untrusted baseline is ignored with a warning (comparison falls back to empty baseline);
212+ - in ` --fail-on-new ` / ` --ci ` , untrusted baseline fails fast (exit code 2).
213+
214+ 3 . Use in CI
189215
190216``` bash
191217codeclone . --ci
@@ -199,21 +225,23 @@ codeclone . --ci --html .cache/codeclone/report.html
199225
200226` --ci ` is equivalent to ` --fail-on-new --no-color --quiet ` .
201227
202- ---
203-
204228Behavior:
205229
206230- existing clones are allowed,
207- - the build fails if * new* clones appear,
231+ - the build fails if new clones appear,
208232- refactoring that removes duplication is always allowed.
209233
210- ` --fail-on-new ` exits with a non-zero code when new clones are detected.
234+ ` --fail-on-new ` / ` --ci ` exits with a non-zero code when new clones are detected.
235+
236+ ---
211237
212238### Cache
213239
214240By default, CodeClone stores the cache per project at:
215241
216- ` <root>/.cache/codeclone/cache.json `
242+ ``` bash
243+ < root> /.cache/codeclone/cache.json
244+ ```
217245
218246You can override this path with ` --cache-path ` (` --cache-dir ` is a legacy alias).
219247
@@ -222,10 +250,13 @@ If you used an older version of CodeClone, delete the legacy cache file at
222250
223251Cache integrity checks are strict: signature mismatch or oversized cache files are ignored
224252with an explicit warning, then rebuilt from source.
253+
225254Cache entries are validated against expected structure/types; invalid entries are ignored
226255deterministically.
227256
228- ### Python Version Consistency for Baseline Checks
257+ ---
258+
259+ ## Python Version Consistency for Baseline Checks
229260
230261Due to inherent differences in Python’s AST between interpreter versions, baseline
231262generation and verification must be performed using the same Python version.
@@ -256,27 +287,25 @@ repos:
256287
257288## What CodeClone Is (and Is Not)
258289
259- ### CodeClone **is**
290+ ### CodeClone Is
260291
261292- an architectural analysis tool,
262293- a duplication radar,
263294- a CI guard against copy-paste,
264295- a control-flow-aware clone detector.
265296
266- ### CodeClone **is not**
297+ ### CodeClone Is Not
267298
268299- a linter,
269300- a formatter,
270301- a semantic equivalence prover,
271302- a runtime analyzer.
272303
273- ---
274-
275304## How It Works (High Level)
276305
2773061. Parse Python source into AST.
2783072. Normalize AST (names, constants, attributes, annotations).
279- 3. Build a ** Control Flow Graph (CFG)** per function.
308+ 3. Build a Control Flow Graph (CFG) per function.
2803094. Compute stable CFG fingerprints.
2813105. Extract segment windows for internal clone discovery.
2823116. Detect function-level, block-level, and segment-level clones.
@@ -290,10 +319,10 @@ See the architectural overview:
290319
291320## Control Flow Graph (CFG)
292321
293- Starting from ** version 1.1.0** , CodeClone uses a ** Control Flow Graph (CFG)**
322+ Starting from version 1.1.0, CodeClone uses a Control Flow Graph (CFG)
294323to improve structural clone detection robustness.
295324
296- The CFG is a ** structural abstraction** , not a runtime execution model.
325+ The CFG is a structural abstraction, not a runtime execution model.
297326
298327See full design and semantics:
299328
0 commit comments