Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
275 changes: 275 additions & 0 deletions .github/prompts/plan-git-history-domain.prompt.md

Large diffs are not rendered by default.

8 changes: 4 additions & 4 deletions COMMANDS.md
Original file line number Diff line number Diff line change
Expand Up @@ -282,7 +282,7 @@ Be aware that this script deletes all previous relationships and nodes in the lo

### Import git data

Use [importGit.sh](./scripts/importGit.sh) to import git data into the Graph.
Use [importGit.sh](./domains/git-history/import/importGit.sh) to import git data into the Graph.
It uses `git log` to extract commits, their authors and the names of the files changed with them. These are stored in an intermediate CSV file and are then imported into Neo4j with the following schema:

```Cypher
Expand All @@ -300,7 +300,7 @@ It uses `git log` to extract commits, their authors and the names of the files c
Instead of importing every single commit, changes can be grouped by month including their commit count. This is in many cases sufficient and reduces data size and processing time significantly. To do this, set the environment variable `IMPORT_GIT_LOG_DATA_IF_SOURCE_IS_PRESENT` to `aggregated`. If you don't want to set the environment variable globally, then you can also prepend the command with it like this (inside the analysis workspace directory contained within temp):

```shell
IMPORT_GIT_LOG_DATA_IF_SOURCE_IS_PRESENT="aggregated" ./../../scripts/importGit.sh
IMPORT_GIT_LOG_DATA_IF_SOURCE_IS_PRESENT="aggregated" ./../../domains/git-history/import/importGit.sh
```

Here is the resulting schema:
Expand All @@ -322,9 +322,9 @@ The optional parameter `--source directory-path-to-the-source-folder-containing-

#### Resolving git files to code files

After git log data has been imported successfully, [Add_RESOLVES_TO_relationships_to_git_files_for_Java.cypher](./cypher/GitLog/Add_RESOLVES_TO_relationships_to_git_files_for_Java.cypher) is used to try to resolve the imported git file names to code files. This first attempt will cover most cases, but not all of them. With this approach it is, for example, not possible to distinguish identical file names in different Java jars from the git source files of a mono repo.
After git log data has been imported successfully, [Add_RESOLVES_TO_relationships_to_git_files_for_Java.cypher](./domains/git-history/queries/enrichment/Add_RESOLVES_TO_relationships_to_git_files_for_Java.cypher) is used to try to resolve the imported git file names to code files. This first attempt will cover most cases, but not all of them. With this approach it is, for example, not possible to distinguish identical file names in different Java jars from the git source files of a mono repo.

You can use [List_unresolved_git_files.cypher](./cypher/GitLog/List_unresolved_git_files.cypher) to find code files that couldn't be matched to git file names and [List_ambiguous_git_files.cypher](./cypher/GitLog/List_ambiguous_git_files.cypher) to find ambiguously resolved git files. If you have any idea on how to improve this feel free to [open an issue](https://github.com/JohT/code-graph-analysis-pipeline/issues/new).
You can use [List_unresolved_git_files.cypher](./domains/git-history/queries/statistics/List_unresolved_git_files.cypher) to find code files that couldn't be matched to git file names and [List_ambiguous_git_files.cypher](./domains/git-history/queries/statistics/List_ambiguous_git_files.cypher) to find ambiguously resolved git files. If you have any idea on how to improve this feel free to [open an issue](https://github.com/JohT/code-graph-analysis-pipeline/issues/new).

## Database Queries

Expand Down
4 changes: 2 additions & 2 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -47,7 +47,7 @@ Curious? Explore the examples at [code-graph-analysis-examples](https://github.c
Here is an overview of [Jupyter Notebooks](https://jupyter.org) reports from [code-graph-analysis-examples](https://github.com/JohT/code-graph-analysis-examples). For a complete list, see the [Jupyter Notebook Report Reference](#page_with_curl-jupyter-notebook-report-reference).

- [External Dependencies](https://github.com/JohT/code-graph-analysis-examples/blob/main/analysis-results/AxonFramework/latest/external-dependencies-java/ExternalDependenciesJava.md) contains detailed information about external library usage ([Notebook](./domains/external-dependencies/explore/ExternalDependenciesJava.ipynb)).
- [Git History](https://github.com/JohT/code-graph-analysis-examples/blob/main/analysis-results/AxonFramework/latest/git-history-general/GitHistoryGeneral.md) contains information about the git history of the analyzed code ([Notebook](./jupyter/GitHistoryGeneral.ipynb)).
- [Git History](https://github.com/JohT/code-graph-analysis-examples/blob/main/analysis-results/AxonFramework/latest/git-history-general/GitHistoryGeneral.md) contains information about the git history of the analyzed code ([Notebook](./domains/git-history/explore/GitHistoryGeneralExploration.ipynb)).
- [Internal Dependencies](https://github.com/JohT/code-graph-analysis-examples/blob/main/analysis-results/AxonFramework/latest/internal-dependencies-java/InternalDependenciesJava.md) is based on [Analyze java package metrics in a graph database](https://joht.github.io/johtizen/data/2023/04/21/java-package-metrics-analysis.html) and also includes cyclic dependencies ([Notebook](./domains/internal-dependencies/explore/InternalDependenciesJava.ipynb)).
- [Method Metrics](https://github.com/JohT/code-graph-analysis-examples/blob/main/analysis-results/AxonFramework/latest/method-metrics-java/MethodMetricsJava.md) shows how the effective number of lines of code and the cyclomatic complexity are distributed across the methods in the code ([Notebook](./jupyter/MethodMetricsJava.ipynb)).
- [Node Embeddings](https://github.com/JohT/code-graph-analysis-examples/blob/main/analysis-results/AxonFramework/latest/node-embeddings-java/NodeEmbeddingsJava.md) shows how to generate node embeddings and to further reduce their dimensionality to be able to visualize them in a 2D plot ([Notebook](./jupyter/NodeEmbeddingsJava.ipynb)).
Expand Down Expand Up @@ -127,7 +127,7 @@ This could be as simple as running the following command in your Typescript proj
npx --yes @jqassistant/ts-lce
```

- The cloned repository or source project needs to be copied into the directory called `source` within the analysis workspace, so that it will also be picked up during scan by [resetAndScan.sh](./scripts/resetAndScan.sh) and optional [importGit.sh](./scripts/importGit.sh).
- The cloned repository or source project needs to be copied into the directory called `source` within the analysis workspace, so that it will also be picked up during scan by [resetAndScan.sh](./scripts/resetAndScan.sh) and optional [importGit.sh](./domains/git-history/import/importGit.sh).

## :rocket: Getting Started

Expand Down
3 changes: 0 additions & 3 deletions cypher/GitLog/Index_commit_sha.cypher

This file was deleted.

3 changes: 0 additions & 3 deletions cypher/GitLog/Index_file_relative_path.cypher

This file was deleted.

103 changes: 103 additions & 0 deletions domains/git-history/COPIED_FILES.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,103 @@
# Copied Files Tracking

This document maps every original file that was copied into this domain to its copy location.
It exists to support a future deprecation follow-up task that will remove or migrate the originals
once this domain is the canonical implementation.

> **Breaking change notice:** Output directory has changed from `reports/git-history-csv` to `reports/git-history`.
> When the old `scripts/reports/GitHistoryCsv.sh` is eventually removed, a **major version bump** is required.

---

## Cypher Queries

### Enrichment Queries (26 files)

| Original | Copy |
|----------|------|
| `cypher/GitLog/Import_git_log_csv_data.cypher` | `queries/enrichment/Import_git_log_csv_data.cypher` |
| `cypher/GitLog/Import_aggregated_git_log_csv_data.cypher` | `queries/enrichment/Import_aggregated_git_log_csv_data.cypher` |
| `cypher/GitLog/Create_git_repository_node.cypher` | `queries/enrichment/Create_git_repository_node.cypher` |
| `cypher/GitLog/Delete_git_log_data.cypher` | `queries/enrichment/Delete_git_log_data.cypher` |
| `cypher/GitLog/Delete_plain_git_directory_file_nodes.cypher` | `queries/enrichment/Delete_plain_git_directory_file_nodes.cypher` |
| `cypher/GitLog/Index_absolute_file_name.cypher` | `queries/enrichment/Index_absolute_file_name.cypher` |
| `cypher/GitLog/Index_author_name.cypher` | `queries/enrichment/Index_author_name.cypher` |
| `cypher/GitLog/Index_change_span_year.cypher` | `queries/enrichment/Index_change_span_year.cypher` |
| `cypher/GitLog/Index_commit_hash.cypher` | `queries/enrichment/Index_commit_hash.cypher` |
| `cypher/GitLog/Index_commit_parent.cypher` | `queries/enrichment/Index_commit_parent.cypher` |
| `cypher/GitLog/Index_commit_sha.cypher` | `queries/enrichment/Index_commit_sha.cypher` |
| `cypher/GitLog/Index_file_name.cypher` | `queries/enrichment/Index_file_name.cypher` |
| `cypher/GitLog/Index_file_relative_path.cypher` | `queries/enrichment/Index_file_relative_path.cypher` |
| `cypher/GitLog/Add_CHANGED_TOGETHER_WITH_relationships_to_code_files.cypher` | `queries/enrichment/Add_CHANGED_TOGETHER_WITH_relationships_to_code_files.cypher` |
| `cypher/GitLog/Add_CHANGED_TOGETHER_WITH_relationships_to_git_files.cypher` | `queries/enrichment/Add_CHANGED_TOGETHER_WITH_relationships_to_git_files.cypher` |
| `cypher/GitLog/Add_HAS_PARENT_relationships_to_commits.cypher` | `queries/enrichment/Add_HAS_PARENT_relationships_to_commits.cypher` |
| `cypher/GitLog/Add_RESOLVES_TO_relationships_to_git_files_for_Java.cypher` | `queries/enrichment/Add_RESOLVES_TO_relationships_to_git_files_for_Java.cypher` |
| `cypher/GitLog/Add_RESOLVES_TO_relationships_to_git_files_for_Typescript.cypher` | `queries/enrichment/Add_RESOLVES_TO_relationships_to_git_files_for_Typescript.cypher` |
| `cypher/GitLog/Set_commit_classification_properties.cypher` | `queries/enrichment/Set_commit_classification_properties.cypher` |
| `cypher/GitLog/Set_number_of_aggregated_git_commits.cypher` | `queries/enrichment/Set_number_of_aggregated_git_commits.cypher` |
| `cypher/GitLog/Set_number_of_git_log_commits.cypher` | `queries/enrichment/Set_number_of_git_log_commits.cypher` |
| `cypher/GitLog/Set_number_of_git_plugin_commits.cypher` | `queries/enrichment/Set_number_of_git_plugin_commits.cypher` |
| `cypher/GitLog/Set_number_of_git_plugin_update_commits.cypher` | `queries/enrichment/Set_number_of_git_plugin_update_commits.cypher` |

> **Note:** Only 23 enrichment query files are listed above. The remaining 5 files (Verify_*) were placed in `validation/`.
> The total enrichment file count includes import, repository, deletion (2), indexes (8), relationships (5), properties (5) = 23 unique files.

### Statistics Queries (14 files)

| Original | Copy |
|----------|------|
| `cypher/GitLog/List_ambiguous_git_files.cypher` | `queries/statistics/List_ambiguous_git_files.cypher` |
| `cypher/GitLog/List_git_file_directories_with_commit_statistics.cypher` | `queries/statistics/List_git_file_directories_with_commit_statistics.cypher` |
| `cypher/GitLog/List_git_files_by_resolved_label_and_extension.cypher` | `queries/statistics/List_git_files_by_resolved_label_and_extension.cypher` |
| `cypher/GitLog/List_git_files_per_commit_distribution.cypher` | `queries/statistics/List_git_files_per_commit_distribution.cypher` |
| `cypher/GitLog/List_git_files_that_were_changed_together.cypher` | `queries/statistics/List_git_files_that_were_changed_together.cypher` |
| `cypher/GitLog/List_git_files_that_were_changed_together_all_in_one.cypher` | `queries/statistics/List_git_files_that_were_changed_together_all_in_one.cypher` |
| `cypher/GitLog/List_git_files_that_were_changed_together_with_another_file.cypher` | `queries/statistics/List_git_files_that_were_changed_together_with_another_file.cypher` |
| `cypher/GitLog/List_git_files_that_were_changed_together_with_another_file_all_in_one.cypher` | `queries/statistics/List_git_files_that_were_changed_together_with_another_file_all_in_one.cypher` |
| `cypher/GitLog/List_git_files_with_commit_statistics_by_author.cypher` | `queries/statistics/List_git_files_with_commit_statistics_by_author.cypher` |
| `cypher/GitLog/List_pairwise_changed_files.cypher` | `queries/statistics/List_pairwise_changed_files.cypher` |
| `cypher/GitLog/List_pairwise_changed_files_top_selected_metric.cypher` | `queries/statistics/List_pairwise_changed_files_top_selected_metric.cypher` |
| `cypher/GitLog/List_pairwise_changed_files_with_dependencies.cypher` | `queries/statistics/List_pairwise_changed_files_with_dependencies.cypher` |
| `cypher/GitLog/List_unresolved_git_files.cypher` | `queries/statistics/List_unresolved_git_files.cypher` |
| `cypher/Overview/Words_for_git_author_Wordcloud_with_frequency.cypher` | `queries/statistics/Words_for_git_author_Wordcloud_with_frequency.cypher` |

### Validation Queries (5 files)

| Original | Copy |
|----------|------|
| `cypher/GitLog/Verify_code_to_git_file_unambiguous.cypher` | `queries/validation/Verify_code_to_git_file_unambiguous.cypher` |
| `cypher/GitLog/Verify_git_missing_CHANGED_TOGETHER_WITH_properties.cypher` | `queries/validation/Verify_git_missing_CHANGED_TOGETHER_WITH_properties.cypher` |
| `cypher/GitLog/Verify_git_missing_create_date.cypher` | `queries/validation/Verify_git_missing_create_date.cypher` |
| `cypher/GitLog/Verify_git_to_code_file_unambiguous.cypher` | `queries/validation/Verify_git_to_code_file_unambiguous.cypher` |
| `cypher/Validation/ValidateGitHistory.cypher` | `queries/validation/ValidateGitHistory.cypher` |

---

## Import Scripts (3 files)

| Original | Copy | Changes |
|----------|------|---------|
| `scripts/importGit.sh` | `import/importGit.sh` | Updated `GIT_LOG_CYPHER_DIR` to `../queries/enrichment/`; updated sourced script paths |
| `scripts/createGitLogCsv.sh` | `import/createGitLogCsv.sh` | No changes |
| `scripts/createAggregatedGitLogCsv.sh` | `import/createAggregatedGitLogCsv.sh` | No changes |

---

## Jupyter Notebooks (2 files)

| Original | Copy | Metadata Change |
|----------|------|-----------------|
| `jupyter/GitHistoryGeneral.ipynb` | `explore/GitHistoryGeneralExploration.ipynb` | Added `"ValidateAlwaysFalse"` metadata; updated cypher paths; changed title |
| `jupyter/GitHistoryExploration.ipynb` | `explore/GitHistoryCorrelationExploration.ipynb` | Added `"ValidateAlwaysFalse"` metadata; updated cypher paths; changed title |

---

## Scripts Referenced but NOT Copied (Central Pipeline)

These scripts are sourced from the central `scripts/` directory and are not duplicated:

| Script | Domain Usage |
|--------|-------------|
| `scripts/executeQueryFunctions.sh` | Sourced by all entry point scripts |
| `scripts/cleanupAfterReportGeneration.sh` | Sourced by CSV entry point after report generation |
| `scripts/markdown/embedMarkdownIncludes.sh` | Sourced by summary script for Markdown assembly |
78 changes: 78 additions & 0 deletions domains/git-history/PREREQUISITES.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,78 @@
# Git History Domain — Prerequisites

The following are provided by the central pipeline and must run **before** this domain executes.
They are not copied into this domain; they are sourced or referenced from the central pipeline locations.

---

## 1. Neo4j Running with Scanned Artifacts

Neo4j must be running and all artifacts must have been scanned and loaded into the graph database
before any script in this domain is executed.

See the main [README.md](../../README.md) and [GETTING_STARTED.md](../../GETTING_STARTED.md) for setup instructions.

---

## 2. Git History Imported

Git history data must have been imported into the graph database. Controlled by the environment variable:

```
IMPORT_GIT_LOG_DATA_IF_SOURCE_IS_PRESENT="plugin" # Recommended default
```

Options: `"none"`, `"aggregated"`, `"full"`, `"plugin"` (default).

- **`plugin`** (recommended): jQAssistant git plugin provides `Git:Commit`, `Git:File`, `Git:Author`, and related nodes.
- **`full`**: Full git log CSV import via `createGitLogCsv.sh`.
- **`aggregated`**: Aggregated git log CSV import via `createAggregatedGitLogCsv.sh`.
- **`none`**: Skip git import.

The domain's `import/importGit.sh` script orchestrates this import.

> **Note:** The analyzed codebase may have no git history at all.
> All domain entry points handle this case gracefully: `gitHistoryCsv.sh` produces no output
> when all queries are empty; `gitHistoryCharts.py` skips chart generation if CSV files are absent;
> `gitHistoryMarkdown.sh` renders a fallback report if no report directory is found.

---

## 3. Git:File ↔ Code File Relationships

The following relationships must exist (created by `import/importGit.sh`):

| Relationship | Description |
|---|---|
| `(Git:File)-[:RESOLVES_TO]->(File)` | Links git-tracked files to scanned Java/TypeScript code files |
| `(File)-[:CHANGED_TOGETHER_WITH]->(File)` | Co-change relationships between resolved code files |
| `(Git:File)-[:CHANGED_TOGETHER_WITH]->(Git:File)` | Co-change relationships between raw git files |

---

## 4. Required Properties

| Property | Node | Set By |
|---|---|---|
| `numberOfGitCommits` | `File` (Java/TypeScript) | `Set_number_of_git_log_commits.cypher` or `Set_number_of_git_plugin_commits.cypher` |
| `updateCommitCount` | `Git:File` | `Set_number_of_git_plugin_update_commits.cypher` |
| `isMergeCommit` | `Git:Commit` | `Set_commit_classification_properties.cypher` |
| `isAutomatedCommit` | `Git:Commit` | `Set_commit_classification_properties.cypher` |

---

## 5. General Enrichment

The `name` and `extension` properties on `File` nodes must be set by the general enrichment queries:

**Cypher source:** [`cypher/General_Enrichment/`](../../cypher/General_Enrichment/)

---

## 6. Central Pipeline Scripts (sourced, not copied)

| Script | Purpose |
|---|---|
| `scripts/executeQueryFunctions.sh` | Provides `execute_cypher()` and `execute_cypher_queries_until_results()` functions |
| `scripts/cleanupAfterReportGeneration.sh` | Removes empty CSV files after report generation |
| `scripts/markdown/embedMarkdownIncludes.sh` | Assembles Markdown includes into the final report |
Loading
Loading