|
| 1 | +# Gitmit Algorithms |
| 2 | + |
| 3 | +## Overview |
| 4 | +Gitmit generates Conventional Commit messages by combining git diff parsing, heuristic analysis, weighted scoring, and template selection. The pipeline is fully offline and deterministic, with optional AI as a separate layer. |
| 5 | + |
| 6 | +``` |
| 7 | +Git status/diff → Parser → Analyzer → Templater → Formatter → Commit message |
| 8 | +``` |
| 9 | + |
| 10 | +## 1. Change Collection (Parser) |
| 11 | +**Location:** `internal/parser/git.go` |
| 12 | + |
| 13 | +1. **Staged file discovery:** `git status --porcelain` is scanned to identify staged files and their actions (A/M/D/R/C). |
| 14 | +2. **Per-file diff extraction:** For each staged file, `git diff --cached -U0 -- <file>` is streamed. |
| 15 | +3. **Line stats:** Added/removed lines are counted by diff prefixes (`+`/`-`). |
| 16 | +4. **Major change flag:** A file is marked `IsMajor` when added+removed lines ≥ 500. |
| 17 | + |
| 18 | +The parser returns a list of `Change` objects and aggregates totals for diff-stat analysis. |
| 19 | + |
| 20 | +## 2. Analyzer: Feature & Context Extraction |
| 21 | +**Location:** `internal/analyzer/analyzer.go` |
| 22 | + |
| 23 | +### 2.1 File/Topic/Item Detection |
| 24 | +- **Topic** is inferred from directory path with configurable overrides (`topicMappings`). |
| 25 | +- **Item** defaults to the filename without extension. |
| 26 | +- **Purpose** is inferred from keyword mappings and built-in keyword heuristics. |
| 27 | + |
| 28 | +### 2.2 Symbol Extraction |
| 29 | +Regex-based extraction detects structures from added lines: |
| 30 | +- **Functions** (Go, JS/TS, Python, Java) |
| 31 | +- **Structs/Classes** |
| 32 | +- **Methods** (receiver-based Go methods) |
| 33 | + |
| 34 | +These symbols are used to populate `{item}` placeholders and improve specificity. |
| 35 | + |
| 36 | +### 2.3 Change Pattern Detection |
| 37 | +Single-file patterns include: |
| 38 | +- error handling, tests, imports, docs/comments, refactors |
| 39 | +- API/database/performance/security indicators |
| 40 | +- validation, logging, middleware, DI, CLI changes |
| 41 | + |
| 42 | +### 2.4 Multi-file Pattern Detection |
| 43 | +Across all changes, Gitmit detects patterns such as: |
| 44 | +- **feature-addition** (many new files) |
| 45 | +- **bug-fix-cascade** (many modified files with fix keywords) |
| 46 | +- **refactor-sweep** (mixed A/M/D) |
| 47 | +- **test-suite-update** / **config-update** |
| 48 | +- **api-redesign** / **database-migration** |
| 49 | + |
| 50 | +### 2.5 Special-Case Fallbacks |
| 51 | +Early exits provide deterministic messages for clear cases: |
| 52 | +- Single added file → `feat` |
| 53 | +- Single deleted file → `chore` |
| 54 | +- Only docs/config/deps → `docs`/`ci`/`chore(deps)` |
| 55 | + |
| 56 | +## 3. Action (Type) Scoring Algorithm |
| 57 | +The commit **action** is determined by a weighted score map, with support for normalized confidence weights (default). |
| 58 | + |
| 59 | +### 3.1 Normalized Scoring (Default) |
| 60 | +Gitmit uses **normalized confidence weights** to reduce noise when multiple signals compete. |
| 61 | + |
| 62 | +1. **Normalize signals (0–1):** |
| 63 | + - **Branch hint:** 1.0 if branch name matches an action, 0.0 otherwise. |
| 64 | + - **Diff-stat:** 0–1 based on distance from thresholds (added/removed ratio). |
| 65 | + - **Keywords:** Raw keyword scores are normalized relative to the highest-scoring action. |
| 66 | + - **Multi-file patterns:** 1.0 if a relevant pattern is detected, 0.0 otherwise. |
| 67 | +2. **Apply confidence weights:** |
| 68 | + - branch: 0.35 |
| 69 | + - diff-stat: 0.25 |
| 70 | + - keywords: 0.25 |
| 71 | + - multi-file patterns: 0.15 |
| 72 | +3. **Final score:** `sum(weight × normalized_signal)` per action. |
| 73 | +4. **Selection:** The action with the highest final score is selected. |
| 74 | +5. **Fallback:** If top action score < 0.35, Gitmit falls back to file-based heuristics. |
| 75 | + |
| 76 | +### 3.2 Legacy Additive Scoring |
| 77 | +If `normalizeScoring` is disabled in config, Gitmit falls back to raw score aggregation: |
| 78 | +1. **Branch name hints:** +3 to matching action. |
| 79 | +2. **Diff-stat ratio:** +2 to `feat` or `refactor`. |
| 80 | +3. **Keyword scoring:** per-action weights are added directly. |
| 81 | +4. **Multi-file patterns:** +3 or +4 to relevant actions. |
| 82 | + |
| 83 | +## 4. Scope Selection |
| 84 | +- Single topic → that topic |
| 85 | +- Single directory → directory name |
| 86 | +- 2–3 topics → combined scope (sorted) |
| 87 | +- Many topics → most common or `core` |
| 88 | +- Commit history can override scope when consistent across recent commits |
| 89 | + |
| 90 | +## 5. Template Selection & Scoring |
| 91 | +**Location:** `internal/templater/templater.go` |
| 92 | + |
| 93 | +1. **Template group resolution:** action → template group (A/M/D/R/DOC/SECURITY/MISC). |
| 94 | +2. **Topic match:** exact → fuzzy → `_default`. |
| 95 | +3. **Template scoring:** |
| 96 | + - Base score 1.0 |
| 97 | + - +2.0 for matching detected patterns |
| 98 | + - +1.5 for using detected symbols |
| 99 | + - +1.0 for meaningful purpose placeholders |
| 100 | + - +0.5–1.5 for file-type relevance |
| 101 | + - +1.0 for major change templates |
| 102 | + - -0.5 for generic templates when specifics exist |
| 103 | +4. **History de-dup:** recent messages are avoided when possible. |
| 104 | + |
| 105 | +The highest-scoring template is selected, and placeholders (`{topic}`, `{item}`, `{purpose}`, `{source}`, `{target}`) are replaced. |
| 106 | + |
| 107 | +## 6. Alternative Suggestions (Diversity Algorithm) |
| 108 | +When regenerating suggestions: |
| 109 | +- Used messages are filtered out. |
| 110 | +- Similarity is computed using: |
| 111 | + - **Word-level Jaccard similarity (60%)** |
| 112 | + - **Character position matching (40%)** |
| 113 | +- A diversity bonus favors less similar suggestions. |
| 114 | +- A small random factor introduces controlled variation. |
| 115 | + |
| 116 | +## 7. Configuration Influence |
| 117 | +**Location:** `internal/config/config.go` + `docs/CONFIGURATION.md` |
| 118 | + |
| 119 | +Configuration can adjust: |
| 120 | +- Topic mappings |
| 121 | +- Keyword mappings and weights |
| 122 | +- Diff-stat thresholds |
| 123 | +- Project-specific defaults |
| 124 | + |
| 125 | +This allows the algorithm’s weighting to be tuned without code changes. |
0 commit comments