Skip to content

Commit 25a46e5

Browse files
authored
Merge pull request #559 from JohT/feature/introduce-git-history-domain
Introduce git-history domain
2 parents dd36767 + f4e074f commit 25a46e5

64 files changed

Lines changed: 2209 additions & 104 deletions

File tree

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

.github/prompts/plan-git-history-domain.prompt.md

Lines changed: 275 additions & 0 deletions
Large diffs are not rendered by default.

COMMANDS.md

Lines changed: 4 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -282,7 +282,7 @@ Be aware that this script deletes all previous relationships and nodes in the lo
282282

283283
### Import git data
284284

285-
Use [importGit.sh](./scripts/importGit.sh) to import git data into the Graph.
285+
Use [importGit.sh](./domains/git-history/import/importGit.sh) to import git data into the Graph.
286286
It uses `git log` to extract commits, their authors and the names of the files changed with them. These are stored in an intermediate CSV file and are then imported into Neo4j with the following schema:
287287

288288
```Cypher
@@ -300,7 +300,7 @@ It uses `git log` to extract commits, their authors and the names of the files c
300300
Instead of importing every single commit, changes can be grouped by month including their commit count. This is in many cases sufficient and reduces data size and processing time significantly. To do this, set the environment variable `IMPORT_GIT_LOG_DATA_IF_SOURCE_IS_PRESENT` to `aggregated`. If you don't want to set the environment variable globally, then you can also prepend the command with it like this (inside the analysis workspace directory contained within temp):
301301

302302
```shell
303-
IMPORT_GIT_LOG_DATA_IF_SOURCE_IS_PRESENT="aggregated" ./../../scripts/importGit.sh
303+
IMPORT_GIT_LOG_DATA_IF_SOURCE_IS_PRESENT="aggregated" ./../../domains/git-history/import/importGit.sh
304304
```
305305

306306
Here is the resulting schema:
@@ -322,9 +322,9 @@ The optional parameter `--source directory-path-to-the-source-folder-containing-
322322

323323
#### Resolving git files to code files
324324

325-
After git log data has been imported successfully, [Add_RESOLVES_TO_relationships_to_git_files_for_Java.cypher](./cypher/GitLog/Add_RESOLVES_TO_relationships_to_git_files_for_Java.cypher) is used to try to resolve the imported git file names to code files. This first attempt will cover most cases, but not all of them. With this approach it is, for example, not possible to distinguish identical file names in different Java jars from the git source files of a mono repo.
325+
After git log data has been imported successfully, [Add_RESOLVES_TO_relationships_to_git_files_for_Java.cypher](./domains/git-history/queries/enrichment/Add_RESOLVES_TO_relationships_to_git_files_for_Java.cypher) is used to try to resolve the imported git file names to code files. This first attempt will cover most cases, but not all of them. With this approach it is, for example, not possible to distinguish identical file names in different Java jars from the git source files of a mono repo.
326326

327-
You can use [List_unresolved_git_files.cypher](./cypher/GitLog/List_unresolved_git_files.cypher) to find code files that couldn't be matched to git file names and [List_ambiguous_git_files.cypher](./cypher/GitLog/List_ambiguous_git_files.cypher) to find ambiguously resolved git files. If you have any idea on how to improve this feel free to [open an issue](https://github.com/JohT/code-graph-analysis-pipeline/issues/new).
327+
You can use [List_unresolved_git_files.cypher](./domains/git-history/queries/statistics/List_unresolved_git_files.cypher) to find code files that couldn't be matched to git file names and [List_ambiguous_git_files.cypher](./domains/git-history/queries/statistics/List_ambiguous_git_files.cypher) to find ambiguously resolved git files. If you have any idea on how to improve this feel free to [open an issue](https://github.com/JohT/code-graph-analysis-pipeline/issues/new).
328328

329329
## Database Queries
330330

README.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -47,7 +47,7 @@ Curious? Explore the examples at [code-graph-analysis-examples](https://github.c
4747
Here is an overview of [Jupyter Notebooks](https://jupyter.org) reports from [code-graph-analysis-examples](https://github.com/JohT/code-graph-analysis-examples). For a complete list, see the [Jupyter Notebook Report Reference](#page_with_curl-jupyter-notebook-report-reference).
4848

4949
- [External Dependencies](https://github.com/JohT/code-graph-analysis-examples/blob/main/analysis-results/AxonFramework/latest/external-dependencies-java/ExternalDependenciesJava.md) contains detailed information about external library usage ([Notebook](./domains/external-dependencies/explore/ExternalDependenciesJava.ipynb)).
50-
- [Git History](https://github.com/JohT/code-graph-analysis-examples/blob/main/analysis-results/AxonFramework/latest/git-history-general/GitHistoryGeneral.md) contains information about the git history of the analyzed code ([Notebook](./jupyter/GitHistoryGeneral.ipynb)).
50+
- [Git History](https://github.com/JohT/code-graph-analysis-examples/blob/main/analysis-results/AxonFramework/latest/git-history-general/GitHistoryGeneral.md) contains information about the git history of the analyzed code ([Notebook](./domains/git-history/explore/GitHistoryGeneralExploration.ipynb)).
5151
- [Internal Dependencies](https://github.com/JohT/code-graph-analysis-examples/blob/main/analysis-results/AxonFramework/latest/internal-dependencies-java/InternalDependenciesJava.md) is based on [Analyze java package metrics in a graph database](https://joht.github.io/johtizen/data/2023/04/21/java-package-metrics-analysis.html) and also includes cyclic dependencies ([Notebook](./domains/internal-dependencies/explore/InternalDependenciesJava.ipynb)).
5252
- [Method Metrics](https://github.com/JohT/code-graph-analysis-examples/blob/main/analysis-results/AxonFramework/latest/method-metrics-java/MethodMetricsJava.md) shows how the effective number of lines of code and the cyclomatic complexity are distributed across the methods in the code ([Notebook](./jupyter/MethodMetricsJava.ipynb)).
5353
- [Node Embeddings](https://github.com/JohT/code-graph-analysis-examples/blob/main/analysis-results/AxonFramework/latest/node-embeddings-java/NodeEmbeddingsJava.md) shows how to generate node embeddings and to further reduce their dimensionality to be able to visualize them in a 2D plot ([Notebook](./jupyter/NodeEmbeddingsJava.ipynb)).
@@ -127,7 +127,7 @@ This could be as simple as running the following command in your Typescript proj
127127
npx --yes @jqassistant/ts-lce
128128
```
129129

130-
- The cloned repository or source project needs to be copied into the directory called `source` within the analysis workspace, so that it will also be picked up during scan by [resetAndScan.sh](./scripts/resetAndScan.sh) and optional [importGit.sh](./scripts/importGit.sh).
130+
- The cloned repository or source project needs to be copied into the directory called `source` within the analysis workspace, so that it will also be picked up during scan by [resetAndScan.sh](./scripts/resetAndScan.sh) and optional [importGit.sh](./domains/git-history/import/importGit.sh).
131131

132132
## :rocket: Getting Started
133133

cypher/GitLog/Index_commit_sha.cypher

Lines changed: 0 additions & 3 deletions
This file was deleted.

cypher/GitLog/Index_file_relative_path.cypher

Lines changed: 0 additions & 3 deletions
This file was deleted.
Lines changed: 103 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,103 @@
1+
# Copied Files Tracking
2+
3+
This document maps every original file that was copied into this domain to its copy location.
4+
It exists to support a future deprecation follow-up task that will remove or migrate the originals
5+
once this domain is the canonical implementation.
6+
7+
> **Breaking change notice:** Output directory has changed from `reports/git-history-csv` to `reports/git-history`.
8+
> When the old `scripts/reports/GitHistoryCsv.sh` is eventually removed, a **major version bump** is required.
9+
10+
---
11+
12+
## Cypher Queries
13+
14+
### Enrichment Queries (26 files)
15+
16+
| Original | Copy |
17+
|----------|------|
18+
| `cypher/GitLog/Import_git_log_csv_data.cypher` | `queries/enrichment/Import_git_log_csv_data.cypher` |
19+
| `cypher/GitLog/Import_aggregated_git_log_csv_data.cypher` | `queries/enrichment/Import_aggregated_git_log_csv_data.cypher` |
20+
| `cypher/GitLog/Create_git_repository_node.cypher` | `queries/enrichment/Create_git_repository_node.cypher` |
21+
| `cypher/GitLog/Delete_git_log_data.cypher` | `queries/enrichment/Delete_git_log_data.cypher` |
22+
| `cypher/GitLog/Delete_plain_git_directory_file_nodes.cypher` | `queries/enrichment/Delete_plain_git_directory_file_nodes.cypher` |
23+
| `cypher/GitLog/Index_absolute_file_name.cypher` | `queries/enrichment/Index_absolute_file_name.cypher` |
24+
| `cypher/GitLog/Index_author_name.cypher` | `queries/enrichment/Index_author_name.cypher` |
25+
| `cypher/GitLog/Index_change_span_year.cypher` | `queries/enrichment/Index_change_span_year.cypher` |
26+
| `cypher/GitLog/Index_commit_hash.cypher` | `queries/enrichment/Index_commit_hash.cypher` |
27+
| `cypher/GitLog/Index_commit_parent.cypher` | `queries/enrichment/Index_commit_parent.cypher` |
28+
| `cypher/GitLog/Index_commit_sha.cypher` | `queries/enrichment/Index_commit_sha.cypher` |
29+
| `cypher/GitLog/Index_file_name.cypher` | `queries/enrichment/Index_file_name.cypher` |
30+
| `cypher/GitLog/Index_file_relative_path.cypher` | `queries/enrichment/Index_file_relative_path.cypher` |
31+
| `cypher/GitLog/Add_CHANGED_TOGETHER_WITH_relationships_to_code_files.cypher` | `queries/enrichment/Add_CHANGED_TOGETHER_WITH_relationships_to_code_files.cypher` |
32+
| `cypher/GitLog/Add_CHANGED_TOGETHER_WITH_relationships_to_git_files.cypher` | `queries/enrichment/Add_CHANGED_TOGETHER_WITH_relationships_to_git_files.cypher` |
33+
| `cypher/GitLog/Add_HAS_PARENT_relationships_to_commits.cypher` | `queries/enrichment/Add_HAS_PARENT_relationships_to_commits.cypher` |
34+
| `cypher/GitLog/Add_RESOLVES_TO_relationships_to_git_files_for_Java.cypher` | `queries/enrichment/Add_RESOLVES_TO_relationships_to_git_files_for_Java.cypher` |
35+
| `cypher/GitLog/Add_RESOLVES_TO_relationships_to_git_files_for_Typescript.cypher` | `queries/enrichment/Add_RESOLVES_TO_relationships_to_git_files_for_Typescript.cypher` |
36+
| `cypher/GitLog/Set_commit_classification_properties.cypher` | `queries/enrichment/Set_commit_classification_properties.cypher` |
37+
| `cypher/GitLog/Set_number_of_aggregated_git_commits.cypher` | `queries/enrichment/Set_number_of_aggregated_git_commits.cypher` |
38+
| `cypher/GitLog/Set_number_of_git_log_commits.cypher` | `queries/enrichment/Set_number_of_git_log_commits.cypher` |
39+
| `cypher/GitLog/Set_number_of_git_plugin_commits.cypher` | `queries/enrichment/Set_number_of_git_plugin_commits.cypher` |
40+
| `cypher/GitLog/Set_number_of_git_plugin_update_commits.cypher` | `queries/enrichment/Set_number_of_git_plugin_update_commits.cypher` |
41+
42+
> **Note:** Only 23 enrichment query files are listed above. The remaining 5 files (Verify_*) were placed in `validation/`.
43+
> The total enrichment file count includes import, repository, deletion (2), indexes (8), relationships (5), properties (5) = 23 unique files.
44+
45+
### Statistics Queries (14 files)
46+
47+
| Original | Copy |
48+
|----------|------|
49+
| `cypher/GitLog/List_ambiguous_git_files.cypher` | `queries/statistics/List_ambiguous_git_files.cypher` |
50+
| `cypher/GitLog/List_git_file_directories_with_commit_statistics.cypher` | `queries/statistics/List_git_file_directories_with_commit_statistics.cypher` |
51+
| `cypher/GitLog/List_git_files_by_resolved_label_and_extension.cypher` | `queries/statistics/List_git_files_by_resolved_label_and_extension.cypher` |
52+
| `cypher/GitLog/List_git_files_per_commit_distribution.cypher` | `queries/statistics/List_git_files_per_commit_distribution.cypher` |
53+
| `cypher/GitLog/List_git_files_that_were_changed_together.cypher` | `queries/statistics/List_git_files_that_were_changed_together.cypher` |
54+
| `cypher/GitLog/List_git_files_that_were_changed_together_all_in_one.cypher` | `queries/statistics/List_git_files_that_were_changed_together_all_in_one.cypher` |
55+
| `cypher/GitLog/List_git_files_that_were_changed_together_with_another_file.cypher` | `queries/statistics/List_git_files_that_were_changed_together_with_another_file.cypher` |
56+
| `cypher/GitLog/List_git_files_that_were_changed_together_with_another_file_all_in_one.cypher` | `queries/statistics/List_git_files_that_were_changed_together_with_another_file_all_in_one.cypher` |
57+
| `cypher/GitLog/List_git_files_with_commit_statistics_by_author.cypher` | `queries/statistics/List_git_files_with_commit_statistics_by_author.cypher` |
58+
| `cypher/GitLog/List_pairwise_changed_files.cypher` | `queries/statistics/List_pairwise_changed_files.cypher` |
59+
| `cypher/GitLog/List_pairwise_changed_files_top_selected_metric.cypher` | `queries/statistics/List_pairwise_changed_files_top_selected_metric.cypher` |
60+
| `cypher/GitLog/List_pairwise_changed_files_with_dependencies.cypher` | `queries/statistics/List_pairwise_changed_files_with_dependencies.cypher` |
61+
| `cypher/GitLog/List_unresolved_git_files.cypher` | `queries/statistics/List_unresolved_git_files.cypher` |
62+
| `cypher/Overview/Words_for_git_author_Wordcloud_with_frequency.cypher` | `queries/statistics/Words_for_git_author_Wordcloud_with_frequency.cypher` |
63+
64+
### Validation Queries (5 files)
65+
66+
| Original | Copy |
67+
|----------|------|
68+
| `cypher/GitLog/Verify_code_to_git_file_unambiguous.cypher` | `queries/validation/Verify_code_to_git_file_unambiguous.cypher` |
69+
| `cypher/GitLog/Verify_git_missing_CHANGED_TOGETHER_WITH_properties.cypher` | `queries/validation/Verify_git_missing_CHANGED_TOGETHER_WITH_properties.cypher` |
70+
| `cypher/GitLog/Verify_git_missing_create_date.cypher` | `queries/validation/Verify_git_missing_create_date.cypher` |
71+
| `cypher/GitLog/Verify_git_to_code_file_unambiguous.cypher` | `queries/validation/Verify_git_to_code_file_unambiguous.cypher` |
72+
| `cypher/Validation/ValidateGitHistory.cypher` | `queries/validation/ValidateGitHistory.cypher` |
73+
74+
---
75+
76+
## Import Scripts (3 files)
77+
78+
| Original | Copy | Changes |
79+
|----------|------|---------|
80+
| `scripts/importGit.sh` | `import/importGit.sh` | Updated `GIT_LOG_CYPHER_DIR` to `../queries/enrichment/`; updated sourced script paths |
81+
| `scripts/createGitLogCsv.sh` | `import/createGitLogCsv.sh` | No changes |
82+
| `scripts/createAggregatedGitLogCsv.sh` | `import/createAggregatedGitLogCsv.sh` | No changes |
83+
84+
---
85+
86+
## Jupyter Notebooks (2 files)
87+
88+
| Original | Copy | Metadata Change |
89+
|----------|------|-----------------|
90+
| `jupyter/GitHistoryGeneral.ipynb` | `explore/GitHistoryGeneralExploration.ipynb` | Added `"ValidateAlwaysFalse"` metadata; updated cypher paths; changed title |
91+
| `jupyter/GitHistoryExploration.ipynb` | `explore/GitHistoryCorrelationExploration.ipynb` | Added `"ValidateAlwaysFalse"` metadata; updated cypher paths; changed title |
92+
93+
---
94+
95+
## Scripts Referenced but NOT Copied (Central Pipeline)
96+
97+
These scripts are sourced from the central `scripts/` directory and are not duplicated:
98+
99+
| Script | Domain Usage |
100+
|--------|-------------|
101+
| `scripts/executeQueryFunctions.sh` | Sourced by all entry point scripts |
102+
| `scripts/cleanupAfterReportGeneration.sh` | Sourced by CSV entry point after report generation |
103+
| `scripts/markdown/embedMarkdownIncludes.sh` | Sourced by summary script for Markdown assembly |
Lines changed: 78 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,78 @@
1+
# Git History Domain — Prerequisites
2+
3+
The following are provided by the central pipeline and must run **before** this domain executes.
4+
They are not copied into this domain; they are sourced or referenced from the central pipeline locations.
5+
6+
---
7+
8+
## 1. Neo4j Running with Scanned Artifacts
9+
10+
Neo4j must be running and all artifacts must have been scanned and loaded into the graph database
11+
before any script in this domain is executed.
12+
13+
See the main [README.md](../../README.md) and [GETTING_STARTED.md](../../GETTING_STARTED.md) for setup instructions.
14+
15+
---
16+
17+
## 2. Git History Imported
18+
19+
Git history data must have been imported into the graph database. Controlled by the environment variable:
20+
21+
```
22+
IMPORT_GIT_LOG_DATA_IF_SOURCE_IS_PRESENT="plugin" # Recommended default
23+
```
24+
25+
Options: `"none"`, `"aggregated"`, `"full"`, `"plugin"` (default).
26+
27+
- **`plugin`** (recommended): jQAssistant git plugin provides `Git:Commit`, `Git:File`, `Git:Author`, and related nodes.
28+
- **`full`**: Full git log CSV import via `createGitLogCsv.sh`.
29+
- **`aggregated`**: Aggregated git log CSV import via `createAggregatedGitLogCsv.sh`.
30+
- **`none`**: Skip git import.
31+
32+
The domain's `import/importGit.sh` script orchestrates this import.
33+
34+
> **Note:** The analyzed codebase may have no git history at all.
35+
> All domain entry points handle this case gracefully: `gitHistoryCsv.sh` produces no output
36+
> when all queries are empty; `gitHistoryCharts.py` skips chart generation if CSV files are absent;
37+
> `gitHistoryMarkdown.sh` renders a fallback report if no report directory is found.
38+
39+
---
40+
41+
## 3. Git:File ↔ Code File Relationships
42+
43+
The following relationships must exist (created by `import/importGit.sh`):
44+
45+
| Relationship | Description |
46+
|---|---|
47+
| `(Git:File)-[:RESOLVES_TO]->(File)` | Links git-tracked files to scanned Java/TypeScript code files |
48+
| `(File)-[:CHANGED_TOGETHER_WITH]->(File)` | Co-change relationships between resolved code files |
49+
| `(Git:File)-[:CHANGED_TOGETHER_WITH]->(Git:File)` | Co-change relationships between raw git files |
50+
51+
---
52+
53+
## 4. Required Properties
54+
55+
| Property | Node | Set By |
56+
|---|---|---|
57+
| `numberOfGitCommits` | `File` (Java/TypeScript) | `Set_number_of_git_log_commits.cypher` or `Set_number_of_git_plugin_commits.cypher` |
58+
| `updateCommitCount` | `Git:File` | `Set_number_of_git_plugin_update_commits.cypher` |
59+
| `isMergeCommit` | `Git:Commit` | `Set_commit_classification_properties.cypher` |
60+
| `isAutomatedCommit` | `Git:Commit` | `Set_commit_classification_properties.cypher` |
61+
62+
---
63+
64+
## 5. General Enrichment
65+
66+
The `name` and `extension` properties on `File` nodes must be set by the general enrichment queries:
67+
68+
**Cypher source:** [`cypher/General_Enrichment/`](../../cypher/General_Enrichment/)
69+
70+
---
71+
72+
## 6. Central Pipeline Scripts (sourced, not copied)
73+
74+
| Script | Purpose |
75+
|---|---|
76+
| `scripts/executeQueryFunctions.sh` | Provides `execute_cypher()` and `execute_cypher_queries_until_results()` functions |
77+
| `scripts/cleanupAfterReportGeneration.sh` | Removes empty CSV files after report generation |
78+
| `scripts/markdown/embedMarkdownIncludes.sh` | Assembles Markdown includes into the final report |

0 commit comments

Comments
 (0)