Skip to content

Commit a4aa6dd

Browse files
committed
feat: add similarity_analyzer: Fast code similarity detection tool
Features: - High-performance similarity detection using rapidfuzz (C++) - Multi-process parallel comparison (bypasses Python GIL) - Smart pattern-based file grouping - Real-time progress display with ETA - Configurable similarity threshold Includes: - Full test suite (27 tests) - GitHub Actions CI workflow - Comprehensive README documentation Signed-off-by: VIFEX <vifextech@foxmail.com>
1 parent bfbd9b9 commit a4aa6dd

6 files changed

Lines changed: 1177 additions & 0 deletions

File tree

Lines changed: 135 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,135 @@
1+
name: Similarity Analyzer CI
2+
3+
on:
4+
push:
5+
branches: [ main, master ]
6+
paths:
7+
- 'similarity_analyzer/**'
8+
pull_request:
9+
branches: [ main, master ]
10+
paths:
11+
- 'similarity_analyzer/**'
12+
workflow_dispatch:
13+
14+
jobs:
15+
test:
16+
runs-on: ${{ matrix.os }}
17+
strategy:
18+
fail-fast: false
19+
matrix:
20+
os: [ubuntu-latest, macos-latest, windows-latest]
21+
python-version: ['3.8', '3.9', '3.10', '3.11', '3.12']
22+
23+
steps:
24+
- uses: actions/checkout@v4
25+
26+
- name: Set up Python ${{ matrix.python-version }}
27+
uses: actions/setup-python@v5
28+
with:
29+
python-version: ${{ matrix.python-version }}
30+
31+
- name: Install dependencies
32+
run: |
33+
python -m pip install --upgrade pip
34+
pip install pytest pytest-cov
35+
pip install -r similarity_analyzer/requirements.txt
36+
37+
- name: Run tests
38+
run: |
39+
cd similarity_analyzer
40+
python -m pytest test_similarity_analyzer.py -v --cov=similarity_analyzer --cov-report=xml
41+
42+
- name: Upload coverage
43+
uses: codecov/codecov-action@v3
44+
if: matrix.os == 'ubuntu-latest' && matrix.python-version == '3.11'
45+
with:
46+
files: ./similarity_analyzer/coverage.xml
47+
fail_ci_if_error: false
48+
49+
lint:
50+
runs-on: ubuntu-latest
51+
steps:
52+
- uses: actions/checkout@v4
53+
54+
- name: Set up Python
55+
uses: actions/setup-python@v5
56+
with:
57+
python-version: '3.11'
58+
59+
- name: Install linting tools
60+
run: |
61+
python -m pip install --upgrade pip
62+
pip install flake8 black isort mypy
63+
64+
- name: Check formatting with black
65+
run: |
66+
black --check similarity_analyzer/*.py || echo "Formatting issues found"
67+
68+
- name: Check imports with isort
69+
run: |
70+
isort --check-only similarity_analyzer/*.py || echo "Import order issues found"
71+
72+
- name: Lint with flake8
73+
run: |
74+
flake8 similarity_analyzer/*.py --max-line-length=120 --ignore=E501,W503
75+
76+
benchmark:
77+
runs-on: ubuntu-latest
78+
needs: test
79+
steps:
80+
- uses: actions/checkout@v4
81+
82+
- name: Set up Python
83+
uses: actions/setup-python@v5
84+
with:
85+
python-version: '3.11'
86+
87+
- name: Install dependencies
88+
run: |
89+
pip install rapidfuzz
90+
91+
- name: Create test files
92+
run: |
93+
mkdir -p test_files
94+
for i in $(seq 1 50); do
95+
cat > test_files/File${i}_RGB565.cpp << 'EOF'
96+
/**
97+
* Auto-generated test file
98+
*/
99+
class File${i}_RGB565 {
100+
void draw() {
101+
for (int x = 0; x < width; x++) {
102+
for (int y = 0; y < height; y++) {
103+
pixel = getRGB565(x, y);
104+
buffer[y * width + x] = pixel;
105+
}
106+
}
107+
}
108+
};
109+
EOF
110+
done
111+
for i in $(seq 1 50); do
112+
cat > test_files/File${i}_RGB888.cpp << 'EOF'
113+
/**
114+
* Auto-generated test file
115+
*/
116+
class File${i}_RGB888 {
117+
void draw() {
118+
for (int x = 0; x < width; x++) {
119+
for (int y = 0; y < height; y++) {
120+
pixel = getRGB888(x, y);
121+
buffer[y * width + x] = pixel;
122+
}
123+
}
124+
}
125+
};
126+
EOF
127+
done
128+
129+
- name: Run benchmark
130+
run: |
131+
cd similarity_analyzer
132+
echo "=== Benchmark: 100 files ==="
133+
time python similarity_analyzer.py ../test_files --find-pairs -q
134+
echo ""
135+
echo "=== Benchmark complete ==="

similarity_analyzer/README.md

Lines changed: 157 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,157 @@
1+
# Similarity Analyzer
2+
3+
A high-performance tool for detecting similar source code files that can be unified using macros or templates.
4+
5+
## Features
6+
7+
- **Fast**: Uses [rapidfuzz](https://github.com/maxbachmann/rapidfuzz) (C++ implementation), ~100x faster than Python's difflib
8+
- **Parallel**: Multi-process execution bypasses Python GIL for true parallelism
9+
- **Smart Grouping**: Automatically groups files by naming patterns
10+
- **Progress Display**: Real-time progress bar with ETA
11+
- **Configurable**: Adjustable similarity threshold, file extensions, and worker count
12+
13+
## Installation
14+
15+
```bash
16+
# Install dependencies
17+
pip install rapidfuzz
18+
19+
# Or install all requirements
20+
pip install -r requirements.txt
21+
```
22+
23+
## Usage
24+
25+
### Basic Usage
26+
27+
```bash
28+
# Analyze a source directory
29+
python similarity_analyzer.py ./src
30+
31+
# With implementation directory (to track which files are done)
32+
python similarity_analyzer.py ./src --impl-dir ./impl
33+
34+
# Find all similar pairs globally
35+
python similarity_analyzer.py ./src --find-pairs
36+
37+
# Custom similarity threshold (default: 80%)
38+
python similarity_analyzer.py ./src -t 0.85 -p
39+
```
40+
41+
### Command Line Options
42+
43+
| Option | Short | Description |
44+
|--------|-------|-------------|
45+
| `src_dir` | | Source directory to analyze (required) |
46+
| `--impl-dir` | `-i` | Implementation directory to check progress |
47+
| `--threshold` | `-t` | Similarity threshold 0.0-1.0 (default: 0.80) |
48+
| `--find-pairs` | `-p` | Find all similar file pairs globally |
49+
| `--workers` | `-w` | Number of worker processes (0=auto) |
50+
| `--ext` | | File extensions (default: .cpp .c .h .hpp) |
51+
| `--quiet` | `-q` | Minimal output |
52+
| `--version` | `-v` | Show version |
53+
54+
### Examples
55+
56+
```bash
57+
# Analyze C files only
58+
python similarity_analyzer.py ./src --ext .c .h
59+
60+
# Use 8 worker processes
61+
python similarity_analyzer.py ./src -w 8 -p
62+
63+
# Find files with 90%+ similarity
64+
python similarity_analyzer.py ./src -t 0.90 --find-pairs
65+
66+
# Quiet mode for scripting
67+
python similarity_analyzer.py ./src -q
68+
```
69+
70+
## Output
71+
72+
### Pattern-based Groups
73+
74+
Files are automatically grouped by naming patterns:
75+
76+
```
77+
✅ Painter*Bitmap (15 files, 92% similar)
78+
Lines: 9611 | Impl: 0, Not: 15
79+
○ PainterRGB565Bitmap.cpp (638 lines)
80+
○ PainterRGB888Bitmap.cpp (550 lines)
81+
...
82+
```
83+
84+
- ✅ = 90%+ similarity (excellent template candidate)
85+
- 🔶 = 70-90% similarity (good candidate)
86+
- ❌ = <70% similarity (probably too different)
87+
88+
### Template Candidates
89+
90+
Shows groups that exceed the similarity threshold:
91+
92+
```
93+
📦 LCD8*DebugPrinter (99%, saves ~435 lines)
94+
○ LCD8ABGR2222DebugPrinter.cpp
95+
○ LCD8ARGB2222DebugPrinter.cpp
96+
```
97+
98+
### Global Similar Pairs
99+
100+
When using `--find-pairs`, shows similar files not caught by pattern grouping:
101+
102+
```
103+
Found 11 additional similar pairs:
104+
89%: ✓Box.cpp <-> ✓PixelDataWidget.cpp
105+
85%: ✓Image.cpp <-> ✓Button.cpp
106+
```
107+
108+
## Algorithm
109+
110+
1. **Load Phase**: Files are loaded in parallel using multiprocessing
111+
2. **Normalization**: Code is normalized by:
112+
- Removing block comments
113+
- Replacing class names with placeholders
114+
- Replacing variant keywords (RGB565, ARGB8888, etc.)
115+
- Compressing whitespace
116+
3. **Grouping**: Files are grouped by filename patterns
117+
4. **Comparison**:
118+
- Group analysis: All pairs within each group
119+
- Global search: All pairs across files (with line-count filtering)
120+
5. **Output**: Results sorted by similarity
121+
122+
## Performance
123+
124+
| Backend | Speed | Notes |
125+
|---------|-------|-------|
126+
| rapidfuzz | ~600/s | C++ implementation |
127+
| difflib | ~5/s | Python fallback |
128+
129+
For a typical project with 143 files (4200 pairs):
130+
- rapidfuzz: ~14 seconds
131+
- difflib: ~4+ hours
132+
133+
## API Usage
134+
135+
```python
136+
from similarity_analyzer import (
137+
normalize_code,
138+
calc_similarity,
139+
find_similar_pairs,
140+
FileData
141+
)
142+
143+
# Compare two strings
144+
sim = calc_similarity("code1", "code2")
145+
print(f"Similarity: {sim:.0%}")
146+
147+
# Analyze files
148+
file_data = {
149+
'file1.cpp': FileData('file1.cpp', 100, normalized1),
150+
'file2.cpp': FileData('file2.cpp', 120, normalized2),
151+
}
152+
pairs = find_similar_pairs(file_data, threshold=0.8, num_workers=4)
153+
```
154+
155+
## License
156+
157+
MIT License - See [LICENSE](../LICENSE)

similarity_analyzer/__init__.py

Lines changed: 34 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,34 @@
1+
# Similarity Analyzer Python Package
2+
3+
from .similarity_analyzer import (
4+
normalize_code,
5+
calc_similarity,
6+
load_file,
7+
compare_pair,
8+
group_by_pattern,
9+
analyze_group,
10+
find_similar_pairs,
11+
format_time,
12+
FileData,
13+
SimilarityResult,
14+
GroupAnalysis,
15+
DEFAULT_VARIANTS,
16+
DEFAULT_PATTERNS,
17+
)
18+
19+
__version__ = "1.0.0"
20+
__all__ = [
21+
'normalize_code',
22+
'calc_similarity',
23+
'load_file',
24+
'compare_pair',
25+
'group_by_pattern',
26+
'analyze_group',
27+
'find_similar_pairs',
28+
'format_time',
29+
'FileData',
30+
'SimilarityResult',
31+
'GroupAnalysis',
32+
'DEFAULT_VARIANTS',
33+
'DEFAULT_PATTERNS',
34+
]
Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1 @@
1+
rapidfuzz>=2.0.0

0 commit comments

Comments
 (0)