Commit 452f5a7
committed
Add MinHash fingerprinting and SIMILAR_TO edges for near-clone detection
Compute K=64 MinHash signatures from normalized AST node-type trigrams
during function extraction, then generate SIMILAR_TO edges via LSH
(b=32, r=2) for function pairs with Jaccard >= 0.95.
- src/simhash/minhash.{h,c}: MinHash compute, Jaccard, hex encode/decode,
LSH index with band hashing for O(n) candidate generation
- src/pipeline/pass_similarity.c: post-pass reads fingerprints from node
properties, builds LSH index, emits SIMILAR_TO edges with jaccard and
same_file metadata. Same-language only, max 10 edges per node.
- internal/cbm/cbm.h: fingerprint fields on CBMDefinition
- internal/cbm/extract_defs.c: compute_fingerprint() hook at 3 extraction
sites after complexity, skip functions with < 10 AST body nodes
- pass_definitions.c + pass_parallel.c: serialize fingerprint to "fp" hex
in properties_json for both sequential and parallel pipeline paths
- pipeline.c + pipeline_incremental.c: register pass_similarity in both
full and incremental post-pass lists
- tests/test_simhash.c: 28 tests across 4 suites (core, LSH, edge gen,
pipeline integration with generated Go project + incremental)1 parent e07443b commit 452f5a7
File tree
14 files changed
+1946
-4
lines changed- internal/cbm
- src
- pipeline
- simhash
- tests
14 files changed
+1946
-4
lines changed| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
186 | 186 | | |
187 | 187 | | |
188 | 188 | | |
| 189 | + | |
| 190 | + | |
| 191 | + | |
| 192 | + | |
189 | 193 | | |
190 | 194 | | |
191 | 195 | | |
| |||
233 | 237 | | |
234 | 238 | | |
235 | 239 | | |
236 | | - | |
| 240 | + | |
237 | 241 | | |
238 | 242 | | |
239 | 243 | | |
| |||
301 | 305 | | |
302 | 306 | | |
303 | 307 | | |
304 | | - | |
| 308 | + | |
| 309 | + | |
| 310 | + | |
305 | 311 | | |
306 | 312 | | |
307 | 313 | | |
| |||
505 | 511 | | |
506 | 512 | | |
507 | 513 | | |
508 | | - | |
509 | | - | |
| 514 | + | |
| 515 | + | |
510 | 516 | | |
511 | 517 | | |
512 | 518 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
339 | 339 | | |
340 | 340 | | |
341 | 341 | | |
| 342 | + | |
| 343 | + | |
| 344 | + | |
| 345 | + | |
| 346 | + | |
| 347 | + | |
| 348 | + | |
| 349 | + | |
| 350 | + | |
| 351 | + | |
| 352 | + | |
| 353 | + | |
| 354 | + | |
342 | 355 | | |
343 | 356 | | |
344 | 357 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
103 | 103 | | |
104 | 104 | | |
105 | 105 | | |
| 106 | + | |
| 107 | + | |
106 | 108 | | |
107 | 109 | | |
108 | 110 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
3 | 3 | | |
4 | 4 | | |
5 | 5 | | |
| 6 | + | |
6 | 7 | | |
7 | 8 | | |
8 | 9 | | |
| |||
28 | 29 | | |
29 | 30 | | |
30 | 31 | | |
| 32 | + | |
| 33 | + | |
| 34 | + | |
| 35 | + | |
| 36 | + | |
| 37 | + | |
| 38 | + | |
| 39 | + | |
| 40 | + | |
| 41 | + | |
| 42 | + | |
| 43 | + | |
| 44 | + | |
| 45 | + | |
| 46 | + | |
| 47 | + | |
| 48 | + | |
| 49 | + | |
| 50 | + | |
| 51 | + | |
| 52 | + | |
| 53 | + | |
| 54 | + | |
| 55 | + | |
31 | 56 | | |
32 | 57 | | |
33 | 58 | | |
| |||
1421 | 1446 | | |
1422 | 1447 | | |
1423 | 1448 | | |
| 1449 | + | |
| 1450 | + | |
| 1451 | + | |
1424 | 1452 | | |
1425 | 1453 | | |
1426 | 1454 | | |
| |||
1855 | 1883 | | |
1856 | 1884 | | |
1857 | 1885 | | |
| 1886 | + | |
| 1887 | + | |
| 1888 | + | |
1858 | 1889 | | |
1859 | 1890 | | |
1860 | 1891 | | |
| |||
1988 | 2019 | | |
1989 | 2020 | | |
1990 | 2021 | | |
| 2022 | + | |
| 2023 | + | |
| 2024 | + | |
1991 | 2025 | | |
1992 | 2026 | | |
1993 | 2027 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
20 | 20 | | |
21 | 21 | | |
22 | 22 | | |
| 23 | + | |
23 | 24 | | |
24 | 25 | | |
25 | 26 | | |
| |||
191 | 192 | | |
192 | 193 | | |
193 | 194 | | |
| 195 | + | |
| 196 | + | |
| 197 | + | |
| 198 | + | |
| 199 | + | |
| 200 | + | |
| 201 | + | |
| 202 | + | |
194 | 203 | | |
195 | 204 | | |
196 | 205 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
39 | 39 | | |
40 | 40 | | |
41 | 41 | | |
| 42 | + | |
42 | 43 | | |
43 | 44 | | |
44 | 45 | | |
| |||
212 | 213 | | |
213 | 214 | | |
214 | 215 | | |
| 216 | + | |
| 217 | + | |
| 218 | + | |
| 219 | + | |
| 220 | + | |
| 221 | + | |
| 222 | + | |
| 223 | + | |
| 224 | + | |
| 225 | + | |
215 | 226 | | |
216 | 227 | | |
217 | 228 | | |
| |||
0 commit comments