Skip to content

Commit befb3f8

Browse files
committed
Introduce git-history domain
1 parent dd36767 commit befb3f8

1 file changed

Lines changed: 275 additions & 0 deletions

File tree

Lines changed: 275 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,275 @@
1+
# Plan: Create Git History Domain
2+
3+
## TL;DR
4+
5+
Create `domains/git-history/` as a new vertical-slice domain covering all git history analysis — directory commit statistics, co-changed files, pairwise file metrics, author analysis, and git import orchestration. Copy all 42 GitLog Cypher queries (organized as enrichment/statistics/validation), the import scripts (importGit.sh, createGitLogCsv.sh, createAggregatedGitLogCsv.sh), and GitHistoryCsv.sh reporting logic. Convert GitHistoryGeneral.ipynb charts (~20 treemaps, bar charts, histograms) into a standalone Python script. Copy both notebooks into explore/ with validation disabled. Create a Markdown summary report. No moves or deletions of originals.
6+
7+
## Decisions
8+
9+
- **Domain name**: `git-history`
10+
- **Cypher organization**: Three subdirectories — `enrichment/` (import, indexing, relationship creation, property setting), `statistics/` (listing/querying for reports), `validation/` (verification queries)
11+
- **importGit.sh handling**: Copy into domain `import/` directory. Keep original. Add TODO comment to `scripts/resetAndScan.sh` reference line noting the dependency direction (core → domain) should be revisited.
12+
- **createGitLogCsv.sh + createAggregatedGitLogCsv.sh**: Copy into domain `import/`
13+
- **Report output directory**: `reports/git-history` (matches domain name, breaking change vs. old `reports/git-history-csv/`)
14+
- **GitHistoryGeneral.ipynb**: All ~20 charts converted to Python script
15+
- **GitHistoryExploration.ipynb**: Exploration notebook only (correlation analysis not in report)
16+
- **Wordcloud**: Git author wordcloud included (cypher query `Words_for_git_author_Wordcloud_with_frequency.cypher` copied)
17+
- **Entry point naming**: `gitHistoryCsv.sh`, `gitHistoryPython.sh`, `gitHistoryMarkdown.sh` (no Visualization entry point — no GraphViz graph visualizations in git history)
18+
- **No-git-data handling**: The analyzed codebase may have no git history at all. All entry points must handle this gracefully: `gitHistoryCsv.sh` produces no output files (cleanup removes empty CSVs → no report dir created); `gitHistoryCharts.py` skips chart generation if input CSVs are absent; `gitHistoryMarkdown.sh` detects absence of the report dir and renders `report_no_git_data.template.md` instead.
19+
20+
## Domain Directory Structure
21+
22+
```
23+
domains/git-history/
24+
├── README.md
25+
├── PREREQUISITES.md
26+
├── COPIED_FILES.md
27+
├── gitHistoryCsv.sh # Entry point: CSV reports (*Csv.sh)
28+
├── gitHistoryPython.sh # Entry point: Python charts (*Python.sh)
29+
├── gitHistoryMarkdown.sh # Entry point: Markdown summary (*Markdown.sh)
30+
├── gitHistoryCharts.py # Chart generation: treemap, bar, histogram → SVG
31+
├── explore/
32+
│ ├── GitHistoryGeneralExploration.ipynb
33+
│ └── GitHistoryCorrelationExploration.ipynb
34+
├── import/
35+
│ ├── importGit.sh # Git data import orchestrator
36+
│ ├── createGitLogCsv.sh # Full git log → CSV
37+
│ └── createAggregatedGitLogCsv.sh # Aggregated git log → CSV
38+
├── queries/
39+
│ ├── enrichment/ # 26 files: import, indexing, relationships, property setting
40+
│ ├── statistics/ # 13 files: listing and querying for reports
41+
│ └── validation/ # 5 files: verification and validation queries
42+
└── summary/
43+
├── gitHistorySummary.sh # Markdown assembly logic
44+
├── report.template.md # Main report template
45+
└── report_no_git_data.template.md # Fallback: no git data
46+
```
47+
48+
## Steps
49+
50+
### Phase 1: Scaffolding & Documentation
51+
52+
1.1 Create directory structure: `domains/git-history/{explore,import,queries/{enrichment,statistics,validation},summary}`
53+
54+
1.2 Create `PREREQUISITES.md` documenting external dependencies:
55+
- Neo4j running with scanned artifacts
56+
- Git history imported (importGit.sh or plugin); IMPORT_GIT_LOG_DATA_IF_SOURCE_IS_PRESENT env var
57+
- Git:File ← RESOLVES_TO → code files (Java + TypeScript)
58+
- CHANGED_TOGETHER_WITH relationships between Git:Files and resolved code files
59+
- numberOfGitCommits property on code File nodes
60+
- updateCommitCount property on Git:File nodes
61+
- isMergeCommit, isAutomationCommit classification properties on commits
62+
- General enrichment: `name`, `extension` properties on File nodes from cypher/General_Enrichment/
63+
- executeQueryFunctions.sh, cleanupAfterReportGeneration.sh (central pipeline scripts)
64+
65+
1.3 Create `COPIED_FILES.md` tracking all original → copy mappings for deprecation follow-up
66+
67+
1.4 Create `README.md` — domain overview, entry points, folder structure, prerequisites reference, output description
68+
69+
### Phase 2: Copy Cypher Queries
70+
71+
2.1 Copy enrichment queries (26 files) from `cypher/GitLog/``queries/enrichment/`:
72+
- Import: Import_git_log_csv_data, Import_aggregated_git_log_csv_data
73+
- Repository: Create_git_repository_node
74+
- Deletion: Delete_git_log_data, Delete_plain_git_directory_file_nodes
75+
- Indexes (8): Index_absolute_file_name, Index_author_name, Index_change_span_year, Index_commit_hash, Index_commit_parent, Index_commit_sha, Index_file_name, Index_file_relative_path
76+
- Relationships (4): Add_CHANGED_TOGETHER_WITH_relationships_to_code_files, Add_CHANGED_TOGETHER_WITH_relationships_to_git_files, Add_HAS_PARENT_relationships_to_commits, Add_RESOLVES_TO_relationships_to_git_files_for_Java, Add_RESOLVES_TO_relationships_to_git_files_for_Typescript
77+
- Properties (5): Set_commit_classification_properties, Set_number_of_aggregated_git_commits, Set_number_of_git_log_commits, Set_number_of_git_plugin_commits, Set_number_of_git_plugin_update_commits
78+
79+
2.2 Copy statistics queries (13 files) from `cypher/GitLog/``queries/statistics/`:
80+
- List_ambiguous_git_files, List_git_file_directories_with_commit_statistics, List_git_files_by_resolved_label_and_extension, List_git_files_per_commit_distribution, List_git_files_that_were_changed_together, List_git_files_that_were_changed_together_all_in_one, List_git_files_that_were_changed_together_with_another_file, List_git_files_that_were_changed_together_with_another_file_all_in_one, List_git_files_with_commit_statistics_by_author, List_pairwise_changed_files, List_pairwise_changed_files_top_selected_metric, List_pairwise_changed_files_with_dependencies, List_unresolved_git_files
81+
- Also copy: `cypher/Overview/Words_for_git_author_Wordcloud_with_frequency.cypher`
82+
83+
2.3 Copy validation queries (5 files) from `cypher/GitLog/``queries/validation/`:
84+
- Verify_code_to_git_file_unambiguous, Verify_git_missing_CHANGED_TOGETHER_WITH_properties, Verify_git_missing_create_date, Verify_git_to_code_file_unambiguous
85+
- Also copy: `cypher/Validation/ValidateGitHistory.cypher`
86+
87+
### Phase 3: Copy Import Scripts
88+
89+
3.1 Copy `scripts/importGit.sh``import/importGit.sh`
90+
- Update CYPHER_DIR references to point to `../queries/enrichment/` instead of `${CYPHER_DIR}/GitLog`
91+
- Update sourced scripts references: createGitLogCsv.sh, createAggregatedGitLogCsv.sh to use domain-local paths
92+
93+
3.2 Copy `scripts/createGitLogCsv.sh``import/createGitLogCsv.sh` (no changes needed)
94+
95+
3.3 Copy `scripts/createAggregatedGitLogCsv.sh``import/createAggregatedGitLogCsv.sh` (no changes needed)
96+
97+
3.4 Add TODO comment to `scripts/resetAndScan.sh` at the `source "${SCRIPTS_DIR}/importGit.sh"` line noting the core → domain dependency direction should be revisited (*depends on 3.1*)
98+
99+
### Phase 4: Create CSV Entry Point Script (*depends on 2.2*)
100+
101+
4.1 Create `gitHistoryCsv.sh`:
102+
- Follow boilerplate from `internalDependenciesCsv.sh`: BASH_SOURCE/CDPATH directory resolution, `set -o errexit -o pipefail`
103+
- Source `../../scripts/executeQueryFunctions.sh`
104+
- Report name: `git-history`, output to `reports/git-history/`
105+
- Execute statistics queries (adapted from `scripts/reports/GitHistoryCsv.sh`):
106+
- List_git_files_with_commit_statistics_by_author → CSV
107+
- List_git_files_that_were_changed_together_with_another_file → CSV
108+
- List_git_file_directories_with_commit_statistics → CSV
109+
- List_git_files_per_commit_distribution → CSV
110+
- List_pairwise_changed_files_with_dependencies → CSV
111+
- List_pairwise_changed_files_top_selected_metric × 4 metrics (count, min_confidence, jaccard, lift) → CSVs
112+
- Also: List_git_files_by_resolved_label_and_extension, List_ambiguous_git_files, List_unresolved_git_files (for data quality)
113+
- Also: Words_for_git_author_Wordcloud_with_frequency → CSV (for the wordcloud)
114+
- Clean up empty reports via `cleanupAfterReportGeneration.sh`
115+
- **No-data case**: if all queries return empty results, `cleanupAfterReportGeneration.sh` removes all CSVs and the report dir will not exist — this is the signal used downstream
116+
117+
### Phase 5: Create Python Charts Script (*parallel with Phase 4*)
118+
119+
5.1 Create `gitHistoryCharts.py`:
120+
- Follow `Parameters` class pattern from `pathFindingCharts.py` and `treemapVisualizations.py`
121+
- CLI: `--report_directory`, `--verbose` arguments
122+
- Neo4j connection via `bolt://localhost:7687` with `NEO4J_INITIAL_PASSWORD`
123+
- Load CSV data from report directory (not querying Neo4j for charts — uses CSV output from Phase 4)
124+
- **No-data case**: if the report directory does not exist or the required CSV files are absent, log a warning and exit 0 without generating any SVGs
125+
126+
5.2 Data preparation functions (extracted from GitHistoryGeneral.ipynb):
127+
- `add_quantile_limited_column(data_frame, column_name, quantile)` → DataFrame
128+
- `add_rank_column(data_frame, column_name)` → DataFrame
129+
- `add_file_extension_column(data_frame, file_path_column)` → DataFrame
130+
- `add_directory_column(data_frame, file_path_column)` → DataFrame (explodes paths into directories)
131+
- `add_directory_name_column(data_frame, directory_column)` → DataFrame
132+
- `add_parent_directory_column(data_frame, directory_column)` → DataFrame
133+
- Aggregation helpers: `get_last_entry`, `collect_as_array`, `second_entry`, `get_flattened_unique_values`, `count_unique_aggregated_values`, `get_most_frequent_entry`
134+
135+
5.3 Directory commit statistics preparation (the multi-step grouping pipeline from notebook cells 22):
136+
- Query Neo4j for `List_git_files_with_commit_statistics_by_author.cypher`
137+
- Extract author rankings, file extension rankings
138+
- Group by directory+author → group by directory only → add names/parents → final grouping
139+
- Produces the hierarchical directory structure for treemaps
140+
141+
5.4 Treemap chart functions (~13 charts):
142+
- Number of files per directory
143+
- Most frequent file extension per directory
144+
- Number of commits per directory
145+
- Number of distinct authors per directory
146+
- Directories with very few different authors (low focus)
147+
- Main author per directory
148+
- Second author per directory
149+
- Days since last commit per directory
150+
- Days since last commit per directory (ranked)
151+
- Days since last file creation per directory
152+
- Days since last file creation per directory (ranked)
153+
- Days since last file modification per directory
154+
- Days since last file modification per directory (ranked)
155+
156+
5.5 Co-change treemap charts (~3 charts):
157+
- Files that likely co-change with others
158+
- Co-changing files max lift
159+
- Co-changing files average lift
160+
161+
5.6 Bar chart: files per commit distribution (1 chart)
162+
163+
5.7 Histogram charts (~4 charts, one per metric):
164+
- Co-changed files by commit count
165+
- Co-changed files by commit min confidence
166+
- Co-changed files by commit lift
167+
- Co-changed files by commit Jaccard similarity
168+
169+
5.8 Git author wordcloud (1 chart — using wordcloud library)
170+
171+
5.9 All charts saved as SVG to `reports/git-history/`
172+
173+
### Phase 6: Create Python Entry Point (*depends on 5.1*)
174+
175+
6.1 Create `gitHistoryPython.sh`:
176+
- Follow pattern of `internalDependenciesPython.sh`
177+
- Execute `gitHistoryCharts.py` with `--report_directory` and optional `--verbose`
178+
- Clean up empty reports
179+
180+
### Phase 7: Create Exploration Notebooks (*parallel with Phase 5*)
181+
182+
7.1 Copy `jupyter/GitHistoryGeneral.ipynb``explore/GitHistoryGeneralExploration.ipynb`:
183+
- Change title from "# git log/history" to "# Git History General Exploration"
184+
- Set metadata: `"code_graph_analysis_pipeline_data_validation": "ValidateAlwaysFalse"`
185+
- Update cypher file path references from `../cypher/GitLog/` to `../queries/statistics/`
186+
187+
7.2 Copy `jupyter/GitHistoryExploration.ipynb``explore/GitHistoryCorrelationExploration.ipynb`:
188+
- Change title from "# git log/history" to "# Git History Correlation Exploration"
189+
- Set metadata: `"code_graph_analysis_pipeline_data_validation": "ValidateAlwaysFalse"`
190+
- Update cypher file path references from `../cypher/GitLog/` to `../queries/statistics/`
191+
192+
### Phase 8: Create Markdown Summary (*depends on Phases 4, 5*)
193+
194+
8.1 Create `summary/report.template.md`:
195+
- Front matter (title, date, version, dataset)
196+
- Section 1: Overview — what to act on first, reading guide
197+
- Section 2: Directory Commit Statistics — treemap charts, tables
198+
- Section 3: Co-Changed Files — treemap charts, top pairwise tables
199+
- Section 4: File Change Distribution — bar chart, statistics
200+
- Section 5: Pairwise Changed Files — tables per metric (count, confidence, Jaccard, lift)
201+
- Section 6: Data Quality — ambiguous files, unresolved files, file resolution statistics
202+
- Section 7: Git Author Wordcloud
203+
- Section 8: Glossary
204+
205+
8.2 Create `summary/report_no_git_data.template.md`:
206+
- Fallback: "⚠️ No git history data available"
207+
208+
8.3 Create `summary/gitHistorySummary.sh`:
209+
- Follow pattern of `internalDependenciesSummary.sh`
210+
- **No-data detection**: check if the report directory (`reports/git-history/`) exists and contains data; if not, render `report_no_git_data.template.md` as the final report and exit early
211+
- Generate front matter
212+
- Execute queries for Markdown table includes (limited to 10 rows)
213+
- Include SVG chart references
214+
- Assemble final report via embedMarkdownIncludes.sh
215+
216+
8.4 Create `gitHistoryMarkdown.sh`:
217+
- Follow pattern of `internalDependenciesMarkdown.sh`
218+
- Delegates to `summary/gitHistorySummary.sh`
219+
220+
## Relevant Files
221+
222+
**Reference implementations (read, not modified):**
223+
- `domains/internal-dependencies/` — primary reference for domain structure, all entry point patterns, summary assembly
224+
- `domains/anomaly-detection/treemapVisualizations.py` — reference for Python chart script with Neo4j connection
225+
- `domains/anomaly-detection/explore/AnomalyDetectionExploration.ipynb` — reference for ValidateAlwaysFalse metadata
226+
- `domains/internal-dependencies/pathFindingCharts.py` — reference for chart generation patterns
227+
- `.github/prompts/plan-internal_dependencies_domain.prompt.md` — reference plan structure
228+
229+
**Source files to copy (not modified):**
230+
- `cypher/GitLog/` — all 42 files
231+
- `cypher/Overview/Words_for_git_author_Wordcloud_with_frequency.cypher`
232+
- `cypher/Validation/ValidateGitHistory.cypher`
233+
- `scripts/reports/GitHistoryCsv.sh` — logic adapted into gitHistoryCsv.sh
234+
- `scripts/importGit.sh` — copied with path adjustments
235+
- `scripts/createGitLogCsv.sh` — copied unchanged
236+
- `scripts/createAggregatedGitLogCsv.sh` — copied unchanged
237+
- `jupyter/GitHistoryGeneral.ipynb` — copied with metadata + title changes
238+
- `jupyter/GitHistoryExploration.ipynb` — copied with metadata + title changes
239+
240+
**Modified (minimally):**
241+
- `scripts/resetAndScan.sh` — add TODO comment at importGit.sh reference
242+
243+
**Central scripts sourced (not copied):**
244+
- `scripts/executeQueryFunctions.sh` — provides execute_cypher(), execute_cypher_queries_until_results()
245+
- `scripts/cleanupAfterReportGeneration.sh` — removes empty CSV files
246+
- `scripts/markdown/embedMarkdownIncludes.sh` — assembles Markdown includes into final report
247+
248+
## Verification
249+
250+
1. Run `shellcheck domains/git-history/*.sh domains/git-history/**/*.sh` — no errors
251+
2. Run `python -m py_compile domains/git-history/gitHistoryCharts.py` — no syntax errors
252+
3. Verify all cypher files copied match originals: `diff cypher/GitLog/<file> domains/git-history/queries/enrichment/<file>`
253+
4. Verify notebook metadata: `grep "ValidateAlwaysFalse" domains/git-history/explore/*.ipynb` returns matches
254+
5. Verify entry point discovery: `find domains/git-history -name "*Csv.sh" -o -name "*Python.sh" -o -name "*Markdown.sh"` returns 3 files
255+
6. Manual: Open exploration notebooks in VS Code, confirm they display correctly
256+
7. Integration test (if Neo4j available): Run `gitHistoryCsv.sh` and verify CSV files in `reports/git-history/`
257+
258+
## Scope Boundaries
259+
260+
**Included:**
261+
- All 42 GitLog cypher queries + 2 external (wordcloud, validation)
262+
- Import scripts (importGit.sh, createGitLogCsv.sh, createAggregatedGitLogCsv.sh)
263+
- CSV reporting logic from GitHistoryCsv.sh
264+
- All ~20 charts from GitHistoryGeneral.ipynb as Python SVGs
265+
- Git author wordcloud
266+
- Both exploration notebooks
267+
- Markdown summary report with tables, charts, glossary
268+
- TODO comment on resetAndScan.sh
269+
270+
**Excluded:**
271+
- No Visualization entry point (no GraphViz graphs in git history)
272+
- No move/deletion of originals
273+
- Correlation analysis stays exploration-only (no Python script for scatter plots)
274+
- No changes to central pipeline discovery mechanism
275+
- General_Enrichment cypher not copied (documented as prerequisite)

0 commit comments

Comments
 (0)