Skip to content

Commit cc443a1

Browse files
committed
feat(analyzer): implement normalized scoring
with confidence weights - Introduce weighted average scoring to reduce noise in action selection - Add 'normalizeScoring' and 'signalWeights' to configuration - Normalize signal sources (branch, diff-stat, keywords, patterns) to 0.0–1.0 - Implement fallback threshold for heuristic action determination - Add comprehensive tests for normalized scoring logic - Update documentation in ALGORITHMS.md and CONFIGURATION.md
1 parent 56932c9 commit cc443a1

5 files changed

Lines changed: 449 additions & 55 deletions

File tree

ALGORITHMS.md

Lines changed: 125 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,125 @@
1+
# Gitmit Algorithms
2+
3+
## Overview
4+
Gitmit generates Conventional Commit messages by combining git diff parsing, heuristic analysis, weighted scoring, and template selection. The pipeline is fully offline and deterministic, with optional AI as a separate layer.
5+
6+
```
7+
Git status/diff → Parser → Analyzer → Templater → Formatter → Commit message
8+
```
9+
10+
## 1. Change Collection (Parser)
11+
**Location:** `internal/parser/git.go`
12+
13+
1. **Staged file discovery:** `git status --porcelain` is scanned to identify staged files and their actions (A/M/D/R/C).
14+
2. **Per-file diff extraction:** For each staged file, `git diff --cached -U0 -- <file>` is streamed.
15+
3. **Line stats:** Added/removed lines are counted by diff prefixes (`+`/`-`).
16+
4. **Major change flag:** A file is marked `IsMajor` when added+removed lines ≥ 500.
17+
18+
The parser returns a list of `Change` objects and aggregates totals for diff-stat analysis.
19+
20+
## 2. Analyzer: Feature & Context Extraction
21+
**Location:** `internal/analyzer/analyzer.go`
22+
23+
### 2.1 File/Topic/Item Detection
24+
- **Topic** is inferred from directory path with configurable overrides (`topicMappings`).
25+
- **Item** defaults to the filename without extension.
26+
- **Purpose** is inferred from keyword mappings and built-in keyword heuristics.
27+
28+
### 2.2 Symbol Extraction
29+
Regex-based extraction detects structures from added lines:
30+
- **Functions** (Go, JS/TS, Python, Java)
31+
- **Structs/Classes**
32+
- **Methods** (receiver-based Go methods)
33+
34+
These symbols are used to populate `{item}` placeholders and improve specificity.
35+
36+
### 2.3 Change Pattern Detection
37+
Single-file patterns include:
38+
- error handling, tests, imports, docs/comments, refactors
39+
- API/database/performance/security indicators
40+
- validation, logging, middleware, DI, CLI changes
41+
42+
### 2.4 Multi-file Pattern Detection
43+
Across all changes, Gitmit detects patterns such as:
44+
- **feature-addition** (many new files)
45+
- **bug-fix-cascade** (many modified files with fix keywords)
46+
- **refactor-sweep** (mixed A/M/D)
47+
- **test-suite-update** / **config-update**
48+
- **api-redesign** / **database-migration**
49+
50+
### 2.5 Special-Case Fallbacks
51+
Early exits provide deterministic messages for clear cases:
52+
- Single added file → `feat`
53+
- Single deleted file → `chore`
54+
- Only docs/config/deps → `docs`/`ci`/`chore(deps)`
55+
56+
## 3. Action (Type) Scoring Algorithm
57+
The commit **action** is determined by a weighted score map, with support for normalized confidence weights (default).
58+
59+
### 3.1 Normalized Scoring (Default)
60+
Gitmit uses **normalized confidence weights** to reduce noise when multiple signals compete.
61+
62+
1. **Normalize signals (0–1):**
63+
- **Branch hint:** 1.0 if branch name matches an action, 0.0 otherwise.
64+
- **Diff-stat:** 0–1 based on distance from thresholds (added/removed ratio).
65+
- **Keywords:** Raw keyword scores are normalized relative to the highest-scoring action.
66+
- **Multi-file patterns:** 1.0 if a relevant pattern is detected, 0.0 otherwise.
67+
2. **Apply confidence weights:**
68+
- branch: 0.35
69+
- diff-stat: 0.25
70+
- keywords: 0.25
71+
- multi-file patterns: 0.15
72+
3. **Final score:** `sum(weight × normalized_signal)` per action.
73+
4. **Selection:** The action with the highest final score is selected.
74+
5. **Fallback:** If top action score < 0.35, Gitmit falls back to file-based heuristics.
75+
76+
### 3.2 Legacy Additive Scoring
77+
If `normalizeScoring` is disabled in config, Gitmit falls back to raw score aggregation:
78+
1. **Branch name hints:** +3 to matching action.
79+
2. **Diff-stat ratio:** +2 to `feat` or `refactor`.
80+
3. **Keyword scoring:** per-action weights are added directly.
81+
4. **Multi-file patterns:** +3 or +4 to relevant actions.
82+
83+
## 4. Scope Selection
84+
- Single topic → that topic
85+
- Single directory → directory name
86+
- 2–3 topics → combined scope (sorted)
87+
- Many topics → most common or `core`
88+
- Commit history can override scope when consistent across recent commits
89+
90+
## 5. Template Selection & Scoring
91+
**Location:** `internal/templater/templater.go`
92+
93+
1. **Template group resolution:** action → template group (A/M/D/R/DOC/SECURITY/MISC).
94+
2. **Topic match:** exact → fuzzy → `_default`.
95+
3. **Template scoring:**
96+
- Base score 1.0
97+
- +2.0 for matching detected patterns
98+
- +1.5 for using detected symbols
99+
- +1.0 for meaningful purpose placeholders
100+
- +0.5–1.5 for file-type relevance
101+
- +1.0 for major change templates
102+
- -0.5 for generic templates when specifics exist
103+
4. **History de-dup:** recent messages are avoided when possible.
104+
105+
The highest-scoring template is selected, and placeholders (`{topic}`, `{item}`, `{purpose}`, `{source}`, `{target}`) are replaced.
106+
107+
## 6. Alternative Suggestions (Diversity Algorithm)
108+
When regenerating suggestions:
109+
- Used messages are filtered out.
110+
- Similarity is computed using:
111+
- **Word-level Jaccard similarity (60%)**
112+
- **Character position matching (40%)**
113+
- A diversity bonus favors less similar suggestions.
114+
- A small random factor introduces controlled variation.
115+
116+
## 7. Configuration Influence
117+
**Location:** `internal/config/config.go` + `docs/CONFIGURATION.md`
118+
119+
Configuration can adjust:
120+
- Topic mappings
121+
- Keyword mappings and weights
122+
- Diff-stat thresholds
123+
- Project-specific defaults
124+
125+
This allows the algorithm’s weighting to be tuned without code changes.

docs/CONFIGURATION.md

Lines changed: 29 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -104,6 +104,35 @@ Controls the threshold for the diff stat analysis algorithm. This ratio determin
104104
}
105105
```
106106

107+
### Normalized Scoring
108+
109+
**`normalizeScoring`** (boolean, default: true)
110+
111+
Enables normalized confidence weights for action selection. This algorithm reduces noise when multiple weak signals compete by calculating a weighted average instead of a raw additive score.
112+
113+
**`signalWeights`** (object)
114+
115+
Defines the confidence weights for different signal sources. Only used when `normalizeScoring` is `true`.
116+
117+
**Default weights:**
118+
- `branch`: 0.35 (strongest signal)
119+
- `diffStat`: 0.25
120+
- `keywords`: 0.25
121+
- `patterns`: 0.15 (multi-file patterns)
122+
123+
**Example:**
124+
```json
125+
{
126+
"normalizeScoring": true,
127+
"signalWeights": {
128+
"branch": 0.5,
129+
"diffStat": 0.2,
130+
"keywords": 0.2,
131+
"patterns": 0.1
132+
}
133+
}
134+
```
135+
107136
### Topic Mappings
108137

109138
**`topicMappings`** (object)

0 commit comments

Comments
 (0)