|
| 1 | +# Plan: Create Git History Domain |
| 2 | + |
| 3 | +## TL;DR |
| 4 | + |
| 5 | +Create `domains/git-history/` as a new vertical-slice domain covering all git history analysis — directory commit statistics, co-changed files, pairwise file metrics, author analysis, and git import orchestration. Copy all 42 GitLog Cypher queries (organized as enrichment/statistics/validation), the import scripts (importGit.sh, createGitLogCsv.sh, createAggregatedGitLogCsv.sh), and GitHistoryCsv.sh reporting logic. Convert GitHistoryGeneral.ipynb charts (~20 treemaps, bar charts, histograms) into a standalone Python script. Copy both notebooks into explore/ with validation disabled. Create a Markdown summary report. No moves or deletions of originals. |
| 6 | + |
| 7 | +## Decisions |
| 8 | + |
| 9 | +- **Domain name**: `git-history` |
| 10 | +- **Cypher organization**: Three subdirectories — `enrichment/` (import, indexing, relationship creation, property setting), `statistics/` (listing/querying for reports), `validation/` (verification queries) |
| 11 | +- **importGit.sh handling**: Copy into domain `import/` directory. Keep original. Add TODO comment to `scripts/resetAndScan.sh` reference line noting the dependency direction (core → domain) should be revisited. |
| 12 | +- **createGitLogCsv.sh + createAggregatedGitLogCsv.sh**: Copy into domain `import/` |
| 13 | +- **Report output directory**: `reports/git-history` (matches domain name, breaking change vs. old `reports/git-history-csv/`) |
| 14 | +- **GitHistoryGeneral.ipynb**: All ~20 charts converted to Python script |
| 15 | +- **GitHistoryExploration.ipynb**: Exploration notebook only (correlation analysis not in report) |
| 16 | +- **Wordcloud**: Git author wordcloud included (cypher query `Words_for_git_author_Wordcloud_with_frequency.cypher` copied) |
| 17 | +- **Entry point naming**: `gitHistoryCsv.sh`, `gitHistoryPython.sh`, `gitHistoryMarkdown.sh` (no Visualization entry point — no GraphViz graph visualizations in git history) |
| 18 | +- **No-git-data handling**: The analyzed codebase may have no git history at all. All entry points must handle this gracefully: `gitHistoryCsv.sh` produces no output files (cleanup removes empty CSVs → no report dir created); `gitHistoryCharts.py` skips chart generation if input CSVs are absent; `gitHistoryMarkdown.sh` detects absence of the report dir and renders `report_no_git_data.template.md` instead. |
| 19 | + |
| 20 | +## Domain Directory Structure |
| 21 | + |
| 22 | +``` |
| 23 | +domains/git-history/ |
| 24 | +├── README.md |
| 25 | +├── PREREQUISITES.md |
| 26 | +├── COPIED_FILES.md |
| 27 | +├── gitHistoryCsv.sh # Entry point: CSV reports (*Csv.sh) |
| 28 | +├── gitHistoryPython.sh # Entry point: Python charts (*Python.sh) |
| 29 | +├── gitHistoryMarkdown.sh # Entry point: Markdown summary (*Markdown.sh) |
| 30 | +├── gitHistoryCharts.py # Chart generation: treemap, bar, histogram → SVG |
| 31 | +├── explore/ |
| 32 | +│ ├── GitHistoryGeneralExploration.ipynb |
| 33 | +│ └── GitHistoryCorrelationExploration.ipynb |
| 34 | +├── import/ |
| 35 | +│ ├── importGit.sh # Git data import orchestrator |
| 36 | +│ ├── createGitLogCsv.sh # Full git log → CSV |
| 37 | +│ └── createAggregatedGitLogCsv.sh # Aggregated git log → CSV |
| 38 | +├── queries/ |
| 39 | +│ ├── enrichment/ # 26 files: import, indexing, relationships, property setting |
| 40 | +│ ├── statistics/ # 13 files: listing and querying for reports |
| 41 | +│ └── validation/ # 5 files: verification and validation queries |
| 42 | +└── summary/ |
| 43 | + ├── gitHistorySummary.sh # Markdown assembly logic |
| 44 | + ├── report.template.md # Main report template |
| 45 | + └── report_no_git_data.template.md # Fallback: no git data |
| 46 | +``` |
| 47 | + |
| 48 | +## Steps |
| 49 | + |
| 50 | +### Phase 1: Scaffolding & Documentation |
| 51 | + |
| 52 | +1.1 Create directory structure: `domains/git-history/{explore,import,queries/{enrichment,statistics,validation},summary}` |
| 53 | + |
| 54 | +1.2 Create `PREREQUISITES.md` documenting external dependencies: |
| 55 | + - Neo4j running with scanned artifacts |
| 56 | + - Git history imported (importGit.sh or plugin); IMPORT_GIT_LOG_DATA_IF_SOURCE_IS_PRESENT env var |
| 57 | + - Git:File ← RESOLVES_TO → code files (Java + TypeScript) |
| 58 | + - CHANGED_TOGETHER_WITH relationships between Git:Files and resolved code files |
| 59 | + - numberOfGitCommits property on code File nodes |
| 60 | + - updateCommitCount property on Git:File nodes |
| 61 | + - isMergeCommit, isAutomationCommit classification properties on commits |
| 62 | + - General enrichment: `name`, `extension` properties on File nodes from cypher/General_Enrichment/ |
| 63 | + - executeQueryFunctions.sh, cleanupAfterReportGeneration.sh (central pipeline scripts) |
| 64 | + |
| 65 | +1.3 Create `COPIED_FILES.md` tracking all original → copy mappings for deprecation follow-up |
| 66 | + |
| 67 | +1.4 Create `README.md` — domain overview, entry points, folder structure, prerequisites reference, output description |
| 68 | + |
| 69 | +### Phase 2: Copy Cypher Queries |
| 70 | + |
| 71 | +2.1 Copy enrichment queries (26 files) from `cypher/GitLog/` → `queries/enrichment/`: |
| 72 | + - Import: Import_git_log_csv_data, Import_aggregated_git_log_csv_data |
| 73 | + - Repository: Create_git_repository_node |
| 74 | + - Deletion: Delete_git_log_data, Delete_plain_git_directory_file_nodes |
| 75 | + - Indexes (8): Index_absolute_file_name, Index_author_name, Index_change_span_year, Index_commit_hash, Index_commit_parent, Index_commit_sha, Index_file_name, Index_file_relative_path |
| 76 | + - Relationships (4): Add_CHANGED_TOGETHER_WITH_relationships_to_code_files, Add_CHANGED_TOGETHER_WITH_relationships_to_git_files, Add_HAS_PARENT_relationships_to_commits, Add_RESOLVES_TO_relationships_to_git_files_for_Java, Add_RESOLVES_TO_relationships_to_git_files_for_Typescript |
| 77 | + - Properties (5): Set_commit_classification_properties, Set_number_of_aggregated_git_commits, Set_number_of_git_log_commits, Set_number_of_git_plugin_commits, Set_number_of_git_plugin_update_commits |
| 78 | + |
| 79 | +2.2 Copy statistics queries (13 files) from `cypher/GitLog/` → `queries/statistics/`: |
| 80 | + - List_ambiguous_git_files, List_git_file_directories_with_commit_statistics, List_git_files_by_resolved_label_and_extension, List_git_files_per_commit_distribution, List_git_files_that_were_changed_together, List_git_files_that_were_changed_together_all_in_one, List_git_files_that_were_changed_together_with_another_file, List_git_files_that_were_changed_together_with_another_file_all_in_one, List_git_files_with_commit_statistics_by_author, List_pairwise_changed_files, List_pairwise_changed_files_top_selected_metric, List_pairwise_changed_files_with_dependencies, List_unresolved_git_files |
| 81 | + - Also copy: `cypher/Overview/Words_for_git_author_Wordcloud_with_frequency.cypher` |
| 82 | + |
| 83 | +2.3 Copy validation queries (5 files) from `cypher/GitLog/` → `queries/validation/`: |
| 84 | + - Verify_code_to_git_file_unambiguous, Verify_git_missing_CHANGED_TOGETHER_WITH_properties, Verify_git_missing_create_date, Verify_git_to_code_file_unambiguous |
| 85 | + - Also copy: `cypher/Validation/ValidateGitHistory.cypher` |
| 86 | + |
| 87 | +### Phase 3: Copy Import Scripts |
| 88 | + |
| 89 | +3.1 Copy `scripts/importGit.sh` → `import/importGit.sh` |
| 90 | + - Update CYPHER_DIR references to point to `../queries/enrichment/` instead of `${CYPHER_DIR}/GitLog` |
| 91 | + - Update sourced scripts references: createGitLogCsv.sh, createAggregatedGitLogCsv.sh to use domain-local paths |
| 92 | + |
| 93 | +3.2 Copy `scripts/createGitLogCsv.sh` → `import/createGitLogCsv.sh` (no changes needed) |
| 94 | + |
| 95 | +3.3 Copy `scripts/createAggregatedGitLogCsv.sh` → `import/createAggregatedGitLogCsv.sh` (no changes needed) |
| 96 | + |
| 97 | +3.4 Add TODO comment to `scripts/resetAndScan.sh` at the `source "${SCRIPTS_DIR}/importGit.sh"` line noting the core → domain dependency direction should be revisited (*depends on 3.1*) |
| 98 | + |
| 99 | +### Phase 4: Create CSV Entry Point Script (*depends on 2.2*) |
| 100 | + |
| 101 | +4.1 Create `gitHistoryCsv.sh`: |
| 102 | + - Follow boilerplate from `internalDependenciesCsv.sh`: BASH_SOURCE/CDPATH directory resolution, `set -o errexit -o pipefail` |
| 103 | + - Source `../../scripts/executeQueryFunctions.sh` |
| 104 | + - Report name: `git-history`, output to `reports/git-history/` |
| 105 | + - Execute statistics queries (adapted from `scripts/reports/GitHistoryCsv.sh`): |
| 106 | + - List_git_files_with_commit_statistics_by_author → CSV |
| 107 | + - List_git_files_that_were_changed_together_with_another_file → CSV |
| 108 | + - List_git_file_directories_with_commit_statistics → CSV |
| 109 | + - List_git_files_per_commit_distribution → CSV |
| 110 | + - List_pairwise_changed_files_with_dependencies → CSV |
| 111 | + - List_pairwise_changed_files_top_selected_metric × 4 metrics (count, min_confidence, jaccard, lift) → CSVs |
| 112 | + - Also: List_git_files_by_resolved_label_and_extension, List_ambiguous_git_files, List_unresolved_git_files (for data quality) |
| 113 | + - Also: Words_for_git_author_Wordcloud_with_frequency → CSV (for the wordcloud) |
| 114 | + - Clean up empty reports via `cleanupAfterReportGeneration.sh` |
| 115 | + - **No-data case**: if all queries return empty results, `cleanupAfterReportGeneration.sh` removes all CSVs and the report dir will not exist — this is the signal used downstream |
| 116 | + |
| 117 | +### Phase 5: Create Python Charts Script (*parallel with Phase 4*) |
| 118 | + |
| 119 | +5.1 Create `gitHistoryCharts.py`: |
| 120 | + - Follow `Parameters` class pattern from `pathFindingCharts.py` and `treemapVisualizations.py` |
| 121 | + - CLI: `--report_directory`, `--verbose` arguments |
| 122 | + - Neo4j connection via `bolt://localhost:7687` with `NEO4J_INITIAL_PASSWORD` |
| 123 | + - Load CSV data from report directory (not querying Neo4j for charts — uses CSV output from Phase 4) |
| 124 | + - **No-data case**: if the report directory does not exist or the required CSV files are absent, log a warning and exit 0 without generating any SVGs |
| 125 | + |
| 126 | +5.2 Data preparation functions (extracted from GitHistoryGeneral.ipynb): |
| 127 | + - `add_quantile_limited_column(data_frame, column_name, quantile)` → DataFrame |
| 128 | + - `add_rank_column(data_frame, column_name)` → DataFrame |
| 129 | + - `add_file_extension_column(data_frame, file_path_column)` → DataFrame |
| 130 | + - `add_directory_column(data_frame, file_path_column)` → DataFrame (explodes paths into directories) |
| 131 | + - `add_directory_name_column(data_frame, directory_column)` → DataFrame |
| 132 | + - `add_parent_directory_column(data_frame, directory_column)` → DataFrame |
| 133 | + - Aggregation helpers: `get_last_entry`, `collect_as_array`, `second_entry`, `get_flattened_unique_values`, `count_unique_aggregated_values`, `get_most_frequent_entry` |
| 134 | + |
| 135 | +5.3 Directory commit statistics preparation (the multi-step grouping pipeline from notebook cells 22): |
| 136 | + - Query Neo4j for `List_git_files_with_commit_statistics_by_author.cypher` |
| 137 | + - Extract author rankings, file extension rankings |
| 138 | + - Group by directory+author → group by directory only → add names/parents → final grouping |
| 139 | + - Produces the hierarchical directory structure for treemaps |
| 140 | + |
| 141 | +5.4 Treemap chart functions (~13 charts): |
| 142 | + - Number of files per directory |
| 143 | + - Most frequent file extension per directory |
| 144 | + - Number of commits per directory |
| 145 | + - Number of distinct authors per directory |
| 146 | + - Directories with very few different authors (low focus) |
| 147 | + - Main author per directory |
| 148 | + - Second author per directory |
| 149 | + - Days since last commit per directory |
| 150 | + - Days since last commit per directory (ranked) |
| 151 | + - Days since last file creation per directory |
| 152 | + - Days since last file creation per directory (ranked) |
| 153 | + - Days since last file modification per directory |
| 154 | + - Days since last file modification per directory (ranked) |
| 155 | + |
| 156 | +5.5 Co-change treemap charts (~3 charts): |
| 157 | + - Files that likely co-change with others |
| 158 | + - Co-changing files max lift |
| 159 | + - Co-changing files average lift |
| 160 | + |
| 161 | +5.6 Bar chart: files per commit distribution (1 chart) |
| 162 | + |
| 163 | +5.7 Histogram charts (~4 charts, one per metric): |
| 164 | + - Co-changed files by commit count |
| 165 | + - Co-changed files by commit min confidence |
| 166 | + - Co-changed files by commit lift |
| 167 | + - Co-changed files by commit Jaccard similarity |
| 168 | + |
| 169 | +5.8 Git author wordcloud (1 chart — using wordcloud library) |
| 170 | + |
| 171 | +5.9 All charts saved as SVG to `reports/git-history/` |
| 172 | + |
| 173 | +### Phase 6: Create Python Entry Point (*depends on 5.1*) |
| 174 | + |
| 175 | +6.1 Create `gitHistoryPython.sh`: |
| 176 | + - Follow pattern of `internalDependenciesPython.sh` |
| 177 | + - Execute `gitHistoryCharts.py` with `--report_directory` and optional `--verbose` |
| 178 | + - Clean up empty reports |
| 179 | + |
| 180 | +### Phase 7: Create Exploration Notebooks (*parallel with Phase 5*) |
| 181 | + |
| 182 | +7.1 Copy `jupyter/GitHistoryGeneral.ipynb` → `explore/GitHistoryGeneralExploration.ipynb`: |
| 183 | + - Change title from "# git log/history" to "# Git History General Exploration" |
| 184 | + - Set metadata: `"code_graph_analysis_pipeline_data_validation": "ValidateAlwaysFalse"` |
| 185 | + - Update cypher file path references from `../cypher/GitLog/` to `../queries/statistics/` |
| 186 | + |
| 187 | +7.2 Copy `jupyter/GitHistoryExploration.ipynb` → `explore/GitHistoryCorrelationExploration.ipynb`: |
| 188 | + - Change title from "# git log/history" to "# Git History Correlation Exploration" |
| 189 | + - Set metadata: `"code_graph_analysis_pipeline_data_validation": "ValidateAlwaysFalse"` |
| 190 | + - Update cypher file path references from `../cypher/GitLog/` to `../queries/statistics/` |
| 191 | + |
| 192 | +### Phase 8: Create Markdown Summary (*depends on Phases 4, 5*) |
| 193 | + |
| 194 | +8.1 Create `summary/report.template.md`: |
| 195 | + - Front matter (title, date, version, dataset) |
| 196 | + - Section 1: Overview — what to act on first, reading guide |
| 197 | + - Section 2: Directory Commit Statistics — treemap charts, tables |
| 198 | + - Section 3: Co-Changed Files — treemap charts, top pairwise tables |
| 199 | + - Section 4: File Change Distribution — bar chart, statistics |
| 200 | + - Section 5: Pairwise Changed Files — tables per metric (count, confidence, Jaccard, lift) |
| 201 | + - Section 6: Data Quality — ambiguous files, unresolved files, file resolution statistics |
| 202 | + - Section 7: Git Author Wordcloud |
| 203 | + - Section 8: Glossary |
| 204 | + |
| 205 | +8.2 Create `summary/report_no_git_data.template.md`: |
| 206 | + - Fallback: "⚠️ No git history data available" |
| 207 | + |
| 208 | +8.3 Create `summary/gitHistorySummary.sh`: |
| 209 | + - Follow pattern of `internalDependenciesSummary.sh` |
| 210 | + - **No-data detection**: check if the report directory (`reports/git-history/`) exists and contains data; if not, render `report_no_git_data.template.md` as the final report and exit early |
| 211 | + - Generate front matter |
| 212 | + - Execute queries for Markdown table includes (limited to 10 rows) |
| 213 | + - Include SVG chart references |
| 214 | + - Assemble final report via embedMarkdownIncludes.sh |
| 215 | + |
| 216 | +8.4 Create `gitHistoryMarkdown.sh`: |
| 217 | + - Follow pattern of `internalDependenciesMarkdown.sh` |
| 218 | + - Delegates to `summary/gitHistorySummary.sh` |
| 219 | + |
| 220 | +## Relevant Files |
| 221 | + |
| 222 | +**Reference implementations (read, not modified):** |
| 223 | +- `domains/internal-dependencies/` — primary reference for domain structure, all entry point patterns, summary assembly |
| 224 | +- `domains/anomaly-detection/treemapVisualizations.py` — reference for Python chart script with Neo4j connection |
| 225 | +- `domains/anomaly-detection/explore/AnomalyDetectionExploration.ipynb` — reference for ValidateAlwaysFalse metadata |
| 226 | +- `domains/internal-dependencies/pathFindingCharts.py` — reference for chart generation patterns |
| 227 | +- `.github/prompts/plan-internal_dependencies_domain.prompt.md` — reference plan structure |
| 228 | + |
| 229 | +**Source files to copy (not modified):** |
| 230 | +- `cypher/GitLog/` — all 42 files |
| 231 | +- `cypher/Overview/Words_for_git_author_Wordcloud_with_frequency.cypher` |
| 232 | +- `cypher/Validation/ValidateGitHistory.cypher` |
| 233 | +- `scripts/reports/GitHistoryCsv.sh` — logic adapted into gitHistoryCsv.sh |
| 234 | +- `scripts/importGit.sh` — copied with path adjustments |
| 235 | +- `scripts/createGitLogCsv.sh` — copied unchanged |
| 236 | +- `scripts/createAggregatedGitLogCsv.sh` — copied unchanged |
| 237 | +- `jupyter/GitHistoryGeneral.ipynb` — copied with metadata + title changes |
| 238 | +- `jupyter/GitHistoryExploration.ipynb` — copied with metadata + title changes |
| 239 | + |
| 240 | +**Modified (minimally):** |
| 241 | +- `scripts/resetAndScan.sh` — add TODO comment at importGit.sh reference |
| 242 | + |
| 243 | +**Central scripts sourced (not copied):** |
| 244 | +- `scripts/executeQueryFunctions.sh` — provides execute_cypher(), execute_cypher_queries_until_results() |
| 245 | +- `scripts/cleanupAfterReportGeneration.sh` — removes empty CSV files |
| 246 | +- `scripts/markdown/embedMarkdownIncludes.sh` — assembles Markdown includes into final report |
| 247 | + |
| 248 | +## Verification |
| 249 | + |
| 250 | +1. Run `shellcheck domains/git-history/*.sh domains/git-history/**/*.sh` — no errors |
| 251 | +2. Run `python -m py_compile domains/git-history/gitHistoryCharts.py` — no syntax errors |
| 252 | +3. Verify all cypher files copied match originals: `diff cypher/GitLog/<file> domains/git-history/queries/enrichment/<file>` |
| 253 | +4. Verify notebook metadata: `grep "ValidateAlwaysFalse" domains/git-history/explore/*.ipynb` returns matches |
| 254 | +5. Verify entry point discovery: `find domains/git-history -name "*Csv.sh" -o -name "*Python.sh" -o -name "*Markdown.sh"` returns 3 files |
| 255 | +6. Manual: Open exploration notebooks in VS Code, confirm they display correctly |
| 256 | +7. Integration test (if Neo4j available): Run `gitHistoryCsv.sh` and verify CSV files in `reports/git-history/` |
| 257 | + |
| 258 | +## Scope Boundaries |
| 259 | + |
| 260 | +**Included:** |
| 261 | +- All 42 GitLog cypher queries + 2 external (wordcloud, validation) |
| 262 | +- Import scripts (importGit.sh, createGitLogCsv.sh, createAggregatedGitLogCsv.sh) |
| 263 | +- CSV reporting logic from GitHistoryCsv.sh |
| 264 | +- All ~20 charts from GitHistoryGeneral.ipynb as Python SVGs |
| 265 | +- Git author wordcloud |
| 266 | +- Both exploration notebooks |
| 267 | +- Markdown summary report with tables, charts, glossary |
| 268 | +- TODO comment on resetAndScan.sh |
| 269 | + |
| 270 | +**Excluded:** |
| 271 | +- No Visualization entry point (no GraphViz graphs in git history) |
| 272 | +- No move/deletion of originals |
| 273 | +- Correlation analysis stays exploration-only (no Python script for scatter plots) |
| 274 | +- No changes to central pipeline discovery mechanism |
| 275 | +- General_Enrichment cypher not copied (documented as prerequisite) |
0 commit comments