Skip to content

Commit 13a7464

Browse files
committed
Introduce git-history domain
1 parent dd36767 commit 13a7464

60 files changed

Lines changed: 5677 additions & 2 deletions

File tree

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

.github/prompts/plan-git-history-domain.prompt.md

Lines changed: 275 additions & 0 deletions
Large diffs are not rendered by default.
Lines changed: 103 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,103 @@
1+
# Copied Files Tracking
2+
3+
This document maps every original file that was copied into this domain to its copy location.
4+
It exists to support a future deprecation follow-up task that will remove or migrate the originals
5+
once this domain is the canonical implementation.
6+
7+
> **Breaking change notice:** Output directory has changed from `reports/git-history-csv` to `reports/git-history`.
8+
> When the old `scripts/reports/GitHistoryCsv.sh` is eventually removed, a **major version bump** is required.
9+
10+
---
11+
12+
## Cypher Queries
13+
14+
### Enrichment Queries (26 files)
15+
16+
| Original | Copy |
17+
|----------|------|
18+
| `cypher/GitLog/Import_git_log_csv_data.cypher` | `queries/enrichment/Import_git_log_csv_data.cypher` |
19+
| `cypher/GitLog/Import_aggregated_git_log_csv_data.cypher` | `queries/enrichment/Import_aggregated_git_log_csv_data.cypher` |
20+
| `cypher/GitLog/Create_git_repository_node.cypher` | `queries/enrichment/Create_git_repository_node.cypher` |
21+
| `cypher/GitLog/Delete_git_log_data.cypher` | `queries/enrichment/Delete_git_log_data.cypher` |
22+
| `cypher/GitLog/Delete_plain_git_directory_file_nodes.cypher` | `queries/enrichment/Delete_plain_git_directory_file_nodes.cypher` |
23+
| `cypher/GitLog/Index_absolute_file_name.cypher` | `queries/enrichment/Index_absolute_file_name.cypher` |
24+
| `cypher/GitLog/Index_author_name.cypher` | `queries/enrichment/Index_author_name.cypher` |
25+
| `cypher/GitLog/Index_change_span_year.cypher` | `queries/enrichment/Index_change_span_year.cypher` |
26+
| `cypher/GitLog/Index_commit_hash.cypher` | `queries/enrichment/Index_commit_hash.cypher` |
27+
| `cypher/GitLog/Index_commit_parent.cypher` | `queries/enrichment/Index_commit_parent.cypher` |
28+
| `cypher/GitLog/Index_commit_sha.cypher` | `queries/enrichment/Index_commit_sha.cypher` |
29+
| `cypher/GitLog/Index_file_name.cypher` | `queries/enrichment/Index_file_name.cypher` |
30+
| `cypher/GitLog/Index_file_relative_path.cypher` | `queries/enrichment/Index_file_relative_path.cypher` |
31+
| `cypher/GitLog/Add_CHANGED_TOGETHER_WITH_relationships_to_code_files.cypher` | `queries/enrichment/Add_CHANGED_TOGETHER_WITH_relationships_to_code_files.cypher` |
32+
| `cypher/GitLog/Add_CHANGED_TOGETHER_WITH_relationships_to_git_files.cypher` | `queries/enrichment/Add_CHANGED_TOGETHER_WITH_relationships_to_git_files.cypher` |
33+
| `cypher/GitLog/Add_HAS_PARENT_relationships_to_commits.cypher` | `queries/enrichment/Add_HAS_PARENT_relationships_to_commits.cypher` |
34+
| `cypher/GitLog/Add_RESOLVES_TO_relationships_to_git_files_for_Java.cypher` | `queries/enrichment/Add_RESOLVES_TO_relationships_to_git_files_for_Java.cypher` |
35+
| `cypher/GitLog/Add_RESOLVES_TO_relationships_to_git_files_for_Typescript.cypher` | `queries/enrichment/Add_RESOLVES_TO_relationships_to_git_files_for_Typescript.cypher` |
36+
| `cypher/GitLog/Set_commit_classification_properties.cypher` | `queries/enrichment/Set_commit_classification_properties.cypher` |
37+
| `cypher/GitLog/Set_number_of_aggregated_git_commits.cypher` | `queries/enrichment/Set_number_of_aggregated_git_commits.cypher` |
38+
| `cypher/GitLog/Set_number_of_git_log_commits.cypher` | `queries/enrichment/Set_number_of_git_log_commits.cypher` |
39+
| `cypher/GitLog/Set_number_of_git_plugin_commits.cypher` | `queries/enrichment/Set_number_of_git_plugin_commits.cypher` |
40+
| `cypher/GitLog/Set_number_of_git_plugin_update_commits.cypher` | `queries/enrichment/Set_number_of_git_plugin_update_commits.cypher` |
41+
42+
> **Note:** Only 23 enrichment query files are listed above. The remaining 5 files (Verify_*) were placed in `validation/`.
43+
> The total enrichment file count includes import, repository, deletion (2), indexes (8), relationships (5), properties (5) = 23 unique files.
44+
45+
### Statistics Queries (14 files)
46+
47+
| Original | Copy |
48+
|----------|------|
49+
| `cypher/GitLog/List_ambiguous_git_files.cypher` | `queries/statistics/List_ambiguous_git_files.cypher` |
50+
| `cypher/GitLog/List_git_file_directories_with_commit_statistics.cypher` | `queries/statistics/List_git_file_directories_with_commit_statistics.cypher` |
51+
| `cypher/GitLog/List_git_files_by_resolved_label_and_extension.cypher` | `queries/statistics/List_git_files_by_resolved_label_and_extension.cypher` |
52+
| `cypher/GitLog/List_git_files_per_commit_distribution.cypher` | `queries/statistics/List_git_files_per_commit_distribution.cypher` |
53+
| `cypher/GitLog/List_git_files_that_were_changed_together.cypher` | `queries/statistics/List_git_files_that_were_changed_together.cypher` |
54+
| `cypher/GitLog/List_git_files_that_were_changed_together_all_in_one.cypher` | `queries/statistics/List_git_files_that_were_changed_together_all_in_one.cypher` |
55+
| `cypher/GitLog/List_git_files_that_were_changed_together_with_another_file.cypher` | `queries/statistics/List_git_files_that_were_changed_together_with_another_file.cypher` |
56+
| `cypher/GitLog/List_git_files_that_were_changed_together_with_another_file_all_in_one.cypher` | `queries/statistics/List_git_files_that_were_changed_together_with_another_file_all_in_one.cypher` |
57+
| `cypher/GitLog/List_git_files_with_commit_statistics_by_author.cypher` | `queries/statistics/List_git_files_with_commit_statistics_by_author.cypher` |
58+
| `cypher/GitLog/List_pairwise_changed_files.cypher` | `queries/statistics/List_pairwise_changed_files.cypher` |
59+
| `cypher/GitLog/List_pairwise_changed_files_top_selected_metric.cypher` | `queries/statistics/List_pairwise_changed_files_top_selected_metric.cypher` |
60+
| `cypher/GitLog/List_pairwise_changed_files_with_dependencies.cypher` | `queries/statistics/List_pairwise_changed_files_with_dependencies.cypher` |
61+
| `cypher/GitLog/List_unresolved_git_files.cypher` | `queries/statistics/List_unresolved_git_files.cypher` |
62+
| `cypher/Overview/Words_for_git_author_Wordcloud_with_frequency.cypher` | `queries/statistics/Words_for_git_author_Wordcloud_with_frequency.cypher` |
63+
64+
### Validation Queries (5 files)
65+
66+
| Original | Copy |
67+
|----------|------|
68+
| `cypher/GitLog/Verify_code_to_git_file_unambiguous.cypher` | `queries/validation/Verify_code_to_git_file_unambiguous.cypher` |
69+
| `cypher/GitLog/Verify_git_missing_CHANGED_TOGETHER_WITH_properties.cypher` | `queries/validation/Verify_git_missing_CHANGED_TOGETHER_WITH_properties.cypher` |
70+
| `cypher/GitLog/Verify_git_missing_create_date.cypher` | `queries/validation/Verify_git_missing_create_date.cypher` |
71+
| `cypher/GitLog/Verify_git_to_code_file_unambiguous.cypher` | `queries/validation/Verify_git_to_code_file_unambiguous.cypher` |
72+
| `cypher/Validation/ValidateGitHistory.cypher` | `queries/validation/ValidateGitHistory.cypher` |
73+
74+
---
75+
76+
## Import Scripts (3 files)
77+
78+
| Original | Copy | Changes |
79+
|----------|------|---------|
80+
| `scripts/importGit.sh` | `import/importGit.sh` | Updated `GIT_LOG_CYPHER_DIR` to `../queries/enrichment/`; updated sourced script paths |
81+
| `scripts/createGitLogCsv.sh` | `import/createGitLogCsv.sh` | No changes |
82+
| `scripts/createAggregatedGitLogCsv.sh` | `import/createAggregatedGitLogCsv.sh` | No changes |
83+
84+
---
85+
86+
## Jupyter Notebooks (2 files)
87+
88+
| Original | Copy | Metadata Change |
89+
|----------|------|-----------------|
90+
| `jupyter/GitHistoryGeneral.ipynb` | `explore/GitHistoryGeneralExploration.ipynb` | Added `"ValidateAlwaysFalse"` metadata; updated cypher paths; changed title |
91+
| `jupyter/GitHistoryExploration.ipynb` | `explore/GitHistoryCorrelationExploration.ipynb` | Added `"ValidateAlwaysFalse"` metadata; updated cypher paths; changed title |
92+
93+
---
94+
95+
## Scripts Referenced but NOT Copied (Central Pipeline)
96+
97+
These scripts are sourced from the central `scripts/` directory and are not duplicated:
98+
99+
| Script | Domain Usage |
100+
|--------|-------------|
101+
| `scripts/executeQueryFunctions.sh` | Sourced by all entry point scripts |
102+
| `scripts/cleanupAfterReportGeneration.sh` | Sourced by CSV entry point after report generation |
103+
| `scripts/markdown/embedMarkdownIncludes.sh` | Sourced by summary script for Markdown assembly |
Lines changed: 78 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,78 @@
1+
# Git History Domain — Prerequisites
2+
3+
The following are provided by the central pipeline and must run **before** this domain executes.
4+
They are not copied into this domain; they are sourced or referenced from the central pipeline locations.
5+
6+
---
7+
8+
## 1. Neo4j Running with Scanned Artifacts
9+
10+
Neo4j must be running and all artifacts must have been scanned and loaded into the graph database
11+
before any script in this domain is executed.
12+
13+
See the main [README.md](../../README.md) and [GETTING_STARTED.md](../../GETTING_STARTED.md) for setup instructions.
14+
15+
---
16+
17+
## 2. Git History Imported
18+
19+
Git history data must have been imported into the graph database. Controlled by the environment variable:
20+
21+
```
22+
IMPORT_GIT_LOG_DATA_IF_SOURCE_IS_PRESENT="plugin" # Recommended default
23+
```
24+
25+
Options: `"none"`, `"aggregated"`, `"full"`, `"plugin"` (default).
26+
27+
- **`plugin`** (recommended): jQAssistant git plugin provides `Git:Commit`, `Git:File`, `Git:Author`, and related nodes.
28+
- **`full`**: Full git log CSV import via `createGitLogCsv.sh`.
29+
- **`aggregated`**: Aggregated git log CSV import via `createAggregatedGitLogCsv.sh`.
30+
- **`none`**: Skip git import.
31+
32+
The domain's `import/importGit.sh` script orchestrates this import.
33+
34+
> **Note:** The analyzed codebase may have no git history at all.
35+
> All domain entry points handle this case gracefully: `gitHistoryCsv.sh` produces no output
36+
> when all queries are empty; `gitHistoryCharts.py` skips chart generation if CSV files are absent;
37+
> `gitHistoryMarkdown.sh` renders a fallback report if no report directory is found.
38+
39+
---
40+
41+
## 3. Git:File ↔ Code File Relationships
42+
43+
The following relationships must exist (created by `import/importGit.sh`):
44+
45+
| Relationship | Description |
46+
|---|---|
47+
| `(Git:File)-[:RESOLVES_TO]->(File)` | Links git-tracked files to scanned Java/TypeScript code files |
48+
| `(File)-[:CHANGED_TOGETHER_WITH]->(File)` | Co-change relationships between resolved code files |
49+
| `(Git:File)-[:CHANGED_TOGETHER_WITH]->(Git:File)` | Co-change relationships between raw git files |
50+
51+
---
52+
53+
## 4. Required Properties
54+
55+
| Property | Node | Set By |
56+
|---|---|---|
57+
| `numberOfGitCommits` | `File` (Java/TypeScript) | `Set_number_of_git_log_commits.cypher` or `Set_number_of_git_plugin_commits.cypher` |
58+
| `updateCommitCount` | `Git:File` | `Set_number_of_git_plugin_update_commits.cypher` |
59+
| `isMergeCommit` | `Git:Commit` | `Set_commit_classification_properties.cypher` |
60+
| `isAutomatedCommit` | `Git:Commit` | `Set_commit_classification_properties.cypher` |
61+
62+
---
63+
64+
## 5. General Enrichment
65+
66+
The `name` and `extension` properties on `File` nodes must be set by the general enrichment queries:
67+
68+
**Cypher source:** [`cypher/General_Enrichment/`](../../cypher/General_Enrichment/)
69+
70+
---
71+
72+
## 6. Central Pipeline Scripts (sourced, not copied)
73+
74+
| Script | Purpose |
75+
|---|---|
76+
| `scripts/executeQueryFunctions.sh` | Provides `execute_cypher()` and `execute_cypher_queries_until_results()` functions |
77+
| `scripts/cleanupAfterReportGeneration.sh` | Removes empty CSV files after report generation |
78+
| `scripts/markdown/embedMarkdownIncludes.sh` | Assembles Markdown includes into the final report |

domains/git-history/README.md

Lines changed: 94 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,94 @@
1+
# Git History Domain
2+
3+
This directory contains the implementation and resources for analysing **git history** within the Code Graph Analysis Pipeline. It follows the vertical-slice domain pattern: all Cypher queries, Python chart scripts, shell scripts, and report templates needed for this analysis live here.
4+
5+
This domain covers all git history analysis areas:
6+
7+
- **Directory commit statistics**: How often files in each directory are committed, by whom, and when — as hierarchical treemap charts.
8+
- **Co-changed files**: Files that tend to be modified together in the same commit — indicating hidden coupling.
9+
- **Pairwise file metrics**: Quantified co-change metrics per file pair: commit count, confidence, Jaccard similarity, lift.
10+
- **Author analysis**: Which authors contribute most, which directories have very few contributors (low bus-factor).
11+
- **Git data quality**: Ambiguous and unresolved git files; file resolution statistics.
12+
- **Git author wordcloud**: Visualization of all contributing authors weighted by commit frequency.
13+
14+
## Entry Points
15+
16+
The following scripts are discovered and invoked automatically by the central compilation scripts in [scripts/reports/compilations/](../../scripts/reports/compilations/). They are found by filename pattern.
17+
18+
- [gitHistoryCsv.sh](./gitHistoryCsv.sh): Entry point for CSV reports based on Cypher queries. Discovered by `CsvReports.sh` (`*Csv.sh` pattern).
19+
- [gitHistoryPython.sh](./gitHistoryPython.sh): Entry point for Python-based SVG chart generation. Discovered by `PythonReports.sh` (`*Python.sh` pattern).
20+
- [gitHistoryMarkdown.sh](./gitHistoryMarkdown.sh): Entry point for the Markdown summary report. Discovered by `MarkdownReports.sh` (`*Markdown.sh` pattern).
21+
22+
> **Note:** There is no Visualization entry point — git history analysis generates no GraphViz graph visualizations.
23+
24+
## No-Git-Data Handling
25+
26+
The analyzed codebase may have no git history at all. All entry points handle this gracefully:
27+
28+
- `gitHistoryCsv.sh`: Produces no output if all queries return empty results (cleanup removes empty CSVs). No report directory is created.
29+
- `gitHistoryCharts.py`: Skips chart generation if the report directory or required CSV files are absent. Exits with code 0.
30+
- `gitHistoryMarkdown.sh`: Detects the absence of the report directory and renders `summary/report_no_git_data.template.md` instead.
31+
32+
## Folder Structure
33+
34+
```
35+
domains/git-history/
36+
├── README.md # This file
37+
├── PREREQUISITES.md # Detailed prerequisite documentation
38+
├── COPIED_FILES.md # Original → copy mapping for deprecation follow-up
39+
├── gitHistoryCsv.sh # Entry point: CSV reports
40+
├── gitHistoryPython.sh # Entry point: Python charts
41+
├── gitHistoryMarkdown.sh # Entry point: Markdown summary
42+
├── gitHistoryCharts.py # Chart generator: treemap, bar, histogram → SVG
43+
├── explore/ # Jupyter notebooks for interactive exploration
44+
│ ├── GitHistoryGeneralExploration.ipynb # General exploration (treemaps, charts, wordcloud)
45+
│ └── GitHistoryCorrelationExploration.ipynb # Correlation analysis exploration
46+
├── import/ # Git data import scripts
47+
│ ├── importGit.sh # Git data import orchestrator
48+
│ ├── createGitLogCsv.sh # Full git log → CSV
49+
│ └── createAggregatedGitLogCsv.sh # Aggregated git log → CSV
50+
├── queries/
51+
│ ├── enrichment/ # 23 Cypher queries: import, indexes, relationships, properties
52+
│ ├── statistics/ # 14 Cypher queries: listing and querying for reports
53+
│ └── validation/ # 5 Cypher queries: verification and validation
54+
└── summary/
55+
├── gitHistorySummary.sh # Markdown assembly logic
56+
├── report.template.md # Main report template
57+
└── report_no_git_data.template.md # Fallback: no git data
58+
```
59+
60+
## Prerequisites
61+
62+
This domain requires the following to be in place before running. See [PREREQUISITES.md](./PREREQUISITES.md) for full details.
63+
64+
- Neo4j running with scanned artifacts loaded
65+
- Git history imported (`IMPORT_GIT_LOG_DATA_IF_SOURCE_IS_PRESENT` env var controls the mode)
66+
- `Git:File``RESOLVES_TO` ↔ code file relationships
67+
- `CHANGED_TOGETHER_WITH` relationships between git files and resolved code files
68+
- `numberOfGitCommits` property on code `File` nodes
69+
- `updateCommitCount` property on `Git:File` nodes
70+
- `isMergeCommit`, `isAutomatedCommit` classification properties on commits
71+
- General enrichment: `name`, `extension` properties on `File` nodes
72+
73+
## Output
74+
75+
All output is written to `reports/git-history/` relative to the working directory.
76+
77+
| File | Description |
78+
|------|-------------|
79+
| `List_git_files_with_commit_statistics_by_author.csv` | Per-file commit statistics by author |
80+
| `List_git_files_that_were_changed_together_with_another_file.csv` | Files with co-change partners |
81+
| `List_git_file_directories_with_commit_statistics.csv` | Directory-level commit statistics |
82+
| `List_git_files_per_commit_distribution.csv` | Distribution of changed file counts per commit |
83+
| `List_pairwise_changed_files_with_dependencies.csv` | Co-changed file pairs that also have code dependencies |
84+
| `List_pairwise_changed_files.csv` | All pairwise changed file pairs with co-change metrics |
85+
| `List_pairwise_changed_files_top_count.csv` | Top co-changed pairs by commit count |
86+
| `List_pairwise_changed_files_top_min_confidence.csv` | Top co-changed pairs by min confidence |
87+
| `List_pairwise_changed_files_top_jaccard.csv` | Top co-changed pairs by Jaccard similarity |
88+
| `List_pairwise_changed_files_top_lift.csv` | Top co-changed pairs by lift |
89+
| `List_git_files_by_resolved_label_and_extension.csv` | File resolution statistics |
90+
| `List_ambiguous_git_files.csv` | Data quality: files with ambiguous resolution |
91+
| `List_unresolved_git_files.csv` | Data quality: unresolved git files |
92+
| `Words_for_git_author_Wordcloud_with_frequency.csv` | Author words for wordcloud |
93+
| `*.svg` | SVG chart files generated by `gitHistoryCharts.py` |
94+
| `git_history_report.md` | Final assembled Markdown report |

0 commit comments

Comments
 (0)