diff --git a/.github/prompts/plan-git-history-domain.prompt.md b/.github/prompts/plan-git-history-domain.prompt.md new file mode 100644 index 000000000..4acfbf190 --- /dev/null +++ b/.github/prompts/plan-git-history-domain.prompt.md @@ -0,0 +1,275 @@ +# Plan: Create Git History Domain + +## TL;DR + +Create `domains/git-history/` as a new vertical-slice domain covering all git history analysis — directory commit statistics, co-changed files, pairwise file metrics, author analysis, and git import orchestration. Copy all 42 GitLog Cypher queries (organized as enrichment/statistics/validation), the import scripts (importGit.sh, createGitLogCsv.sh, createAggregatedGitLogCsv.sh), and GitHistoryCsv.sh reporting logic. Convert GitHistoryGeneral.ipynb charts (~20 treemaps, bar charts, histograms) into a standalone Python script. Copy both notebooks into explore/ with validation disabled. Create a Markdown summary report. No moves or deletions of originals. + +## Decisions + +- **Domain name**: `git-history` +- **Cypher organization**: Three subdirectories — `enrichment/` (import, indexing, relationship creation, property setting), `statistics/` (listing/querying for reports), `validation/` (verification queries) +- **importGit.sh handling**: Copy into domain `import/` directory. Keep original. Add TODO comment to `scripts/resetAndScan.sh` reference line noting the dependency direction (core → domain) should be revisited. +- **createGitLogCsv.sh + createAggregatedGitLogCsv.sh**: Copy into domain `import/` +- **Report output directory**: `reports/git-history` (matches domain name, breaking change vs. old `reports/git-history-csv/`) +- **GitHistoryGeneral.ipynb**: All ~20 charts converted to Python script +- **GitHistoryExploration.ipynb**: Exploration notebook only (correlation analysis not in report) +- **Wordcloud**: Git author wordcloud included (cypher query `Words_for_git_author_Wordcloud_with_frequency.cypher` copied) +- **Entry point naming**: `gitHistoryCsv.sh`, `gitHistoryPython.sh`, `gitHistoryMarkdown.sh` (no Visualization entry point — no GraphViz graph visualizations in git history) +- **No-git-data handling**: The analyzed codebase may have no git history at all. All entry points must handle this gracefully: `gitHistoryCsv.sh` produces no output files (cleanup removes empty CSVs → no report dir created); `gitHistoryCharts.py` skips chart generation if input CSVs are absent; `gitHistoryMarkdown.sh` detects absence of the report dir and renders `report_no_git_data.template.md` instead. + +## Domain Directory Structure + +``` +domains/git-history/ +├── README.md +├── PREREQUISITES.md +├── COPIED_FILES.md +├── gitHistoryCsv.sh # Entry point: CSV reports (*Csv.sh) +├── gitHistoryPython.sh # Entry point: Python charts (*Python.sh) +├── gitHistoryMarkdown.sh # Entry point: Markdown summary (*Markdown.sh) +├── gitHistoryCharts.py # Chart generation: treemap, bar, histogram → SVG +├── explore/ +│ ├── GitHistoryGeneralExploration.ipynb +│ └── GitHistoryCorrelationExploration.ipynb +├── import/ +│ ├── importGit.sh # Git data import orchestrator +│ ├── createGitLogCsv.sh # Full git log → CSV +│ └── createAggregatedGitLogCsv.sh # Aggregated git log → CSV +├── queries/ +│ ├── enrichment/ # 26 files: import, indexing, relationships, property setting +│ ├── statistics/ # 13 files: listing and querying for reports +│ └── validation/ # 5 files: verification and validation queries +└── summary/ + ├── gitHistorySummary.sh # Markdown assembly logic + ├── report.template.md # Main report template + └── report_no_git_data.template.md # Fallback: no git data +``` + +## Steps + +### Phase 1: Scaffolding & Documentation + +1.1 Create directory structure: `domains/git-history/{explore,import,queries/{enrichment,statistics,validation},summary}` + +1.2 Create `PREREQUISITES.md` documenting external dependencies: + - Neo4j running with scanned artifacts + - Git history imported (importGit.sh or plugin); IMPORT_GIT_LOG_DATA_IF_SOURCE_IS_PRESENT env var + - Git:File ← RESOLVES_TO → code files (Java + TypeScript) + - CHANGED_TOGETHER_WITH relationships between Git:Files and resolved code files + - numberOfGitCommits property on code File nodes + - updateCommitCount property on Git:File nodes + - isMergeCommit, isAutomatedCommit classification properties on commits + - General enrichment: `name`, `extension` properties on File nodes from cypher/General_Enrichment/ + - executeQueryFunctions.sh, cleanupAfterReportGeneration.sh (central pipeline scripts) + +1.3 Create `COPIED_FILES.md` tracking all original → copy mappings for deprecation follow-up + +1.4 Create `README.md` — domain overview, entry points, folder structure, prerequisites reference, output description + +### Phase 2: Copy Cypher Queries + +2.1 Copy enrichment queries (26 files) from `cypher/GitLog/` → `queries/enrichment/`: + - Import: Import_git_log_csv_data, Import_aggregated_git_log_csv_data + - Repository: Create_git_repository_node + - Deletion: Delete_git_log_data, Delete_plain_git_directory_file_nodes + - Indexes (8): Index_absolute_file_name, Index_author_name, Index_change_span_year, Index_commit_hash, Index_commit_parent, Index_commit_sha, Index_file_name, Index_file_relative_path + - Relationships (4): Add_CHANGED_TOGETHER_WITH_relationships_to_code_files, Add_CHANGED_TOGETHER_WITH_relationships_to_git_files, Add_HAS_PARENT_relationships_to_commits, Add_RESOLVES_TO_relationships_to_git_files_for_Java, Add_RESOLVES_TO_relationships_to_git_files_for_Typescript + - Properties (5): Set_commit_classification_properties, Set_number_of_aggregated_git_commits, Set_number_of_git_log_commits, Set_number_of_git_plugin_commits, Set_number_of_git_plugin_update_commits + +2.2 Copy statistics queries (13 files) from `cypher/GitLog/` → `queries/statistics/`: + - List_ambiguous_git_files, List_git_file_directories_with_commit_statistics, List_git_files_by_resolved_label_and_extension, List_git_files_per_commit_distribution, List_git_files_that_were_changed_together, List_git_files_that_were_changed_together_all_in_one, List_git_files_that_were_changed_together_with_another_file, List_git_files_that_were_changed_together_with_another_file_all_in_one, List_git_files_with_commit_statistics_by_author, List_pairwise_changed_files, List_pairwise_changed_files_top_selected_metric, List_pairwise_changed_files_with_dependencies, List_unresolved_git_files + - Also copy: `cypher/Overview/Words_for_git_author_Wordcloud_with_frequency.cypher` + +2.3 Copy validation queries (5 files) from `cypher/GitLog/` → `queries/validation/`: + - Verify_code_to_git_file_unambiguous, Verify_git_missing_CHANGED_TOGETHER_WITH_properties, Verify_git_missing_create_date, Verify_git_to_code_file_unambiguous + - Also copy: `cypher/Validation/ValidateGitHistory.cypher` + +### Phase 3: Copy Import Scripts + +3.1 Copy `scripts/importGit.sh` → `import/importGit.sh` + - Update CYPHER_DIR references to point to `../queries/enrichment/` instead of `${CYPHER_DIR}/GitLog` + - Update sourced scripts references: createGitLogCsv.sh, createAggregatedGitLogCsv.sh to use domain-local paths + +3.2 Copy `scripts/createGitLogCsv.sh` → `import/createGitLogCsv.sh` (no changes needed) + +3.3 Copy `scripts/createAggregatedGitLogCsv.sh` → `import/createAggregatedGitLogCsv.sh` (no changes needed) + +3.4 Add TODO comment to `scripts/resetAndScan.sh` at the `source "${SCRIPTS_DIR}/importGit.sh"` line noting the core → domain dependency direction should be revisited (*depends on 3.1*) + +### Phase 4: Create CSV Entry Point Script (*depends on 2.2*) + +4.1 Create `gitHistoryCsv.sh`: + - Follow boilerplate from `internalDependenciesCsv.sh`: BASH_SOURCE/CDPATH directory resolution, `set -o errexit -o pipefail` + - Source `../../scripts/executeQueryFunctions.sh` + - Report name: `git-history`, output to `reports/git-history/` + - Execute statistics queries (adapted from `scripts/reports/GitHistoryCsv.sh`): + - List_git_files_with_commit_statistics_by_author → CSV + - List_git_files_that_were_changed_together_with_another_file → CSV + - List_git_file_directories_with_commit_statistics → CSV + - List_git_files_per_commit_distribution → CSV + - List_pairwise_changed_files_with_dependencies → CSV + - List_pairwise_changed_files_top_selected_metric × 4 metrics (count, min_confidence, jaccard, lift) → CSVs + - Also: List_git_files_by_resolved_label_and_extension, List_ambiguous_git_files, List_unresolved_git_files (for data quality) + - Also: Words_for_git_author_Wordcloud_with_frequency → CSV (for the wordcloud) + - Clean up empty reports via `cleanupAfterReportGeneration.sh` + - **No-data case**: if all queries return empty results, `cleanupAfterReportGeneration.sh` removes all CSVs and the report dir will not exist — this is the signal used downstream + +### Phase 5: Create Python Charts Script (*parallel with Phase 4*) + +5.1 Create `gitHistoryCharts.py`: + - Follow `Parameters` class pattern from `pathFindingCharts.py` and `treemapVisualizations.py` + - CLI: `--report_directory`, `--verbose` arguments + - Neo4j connection via `bolt://localhost:7687` with `NEO4J_INITIAL_PASSWORD` + - Load CSV data from report directory (not querying Neo4j for charts — uses CSV output from Phase 4) + - **No-data case**: if the report directory does not exist or the required CSV files are absent, log a warning and exit 0 without generating any SVGs + +5.2 Data preparation functions (extracted from GitHistoryGeneral.ipynb): + - `add_quantile_limited_column(data_frame, column_name, quantile)` → DataFrame + - `add_rank_column(data_frame, column_name)` → DataFrame + - `add_file_extension_column(data_frame, file_path_column)` → DataFrame + - `add_directory_column(data_frame, file_path_column)` → DataFrame (explodes paths into directories) + - `add_directory_name_column(data_frame, directory_column)` → DataFrame + - `add_parent_directory_column(data_frame, directory_column)` → DataFrame + - Aggregation helpers: `get_last_entry`, `collect_as_array`, `second_entry`, `get_flattened_unique_values`, `count_unique_aggregated_values`, `get_most_frequent_entry` + +5.3 Directory commit statistics preparation (the multi-step grouping pipeline from notebook cells 22): + - Query Neo4j for `List_git_files_with_commit_statistics_by_author.cypher` + - Extract author rankings, file extension rankings + - Group by directory+author → group by directory only → add names/parents → final grouping + - Produces the hierarchical directory structure for treemaps + +5.4 Treemap chart functions (~13 charts): + - Number of files per directory + - Most frequent file extension per directory + - Number of commits per directory + - Number of distinct authors per directory + - Directories with very few different authors (low focus) + - Main author per directory + - Second author per directory + - Days since last commit per directory + - Days since last commit per directory (ranked) + - Days since last file creation per directory + - Days since last file creation per directory (ranked) + - Days since last file modification per directory + - Days since last file modification per directory (ranked) + +5.5 Co-change treemap charts (~3 charts): + - Files that likely co-change with others + - Co-changing files max lift + - Co-changing files average lift + +5.6 Bar chart: files per commit distribution (1 chart) + +5.7 Histogram charts (~4 charts, one per metric): + - Co-changed files by commit count + - Co-changed files by commit min confidence + - Co-changed files by commit lift + - Co-changed files by commit Jaccard similarity + +5.8 Git author wordcloud (1 chart — using wordcloud library) + +5.9 All charts saved as SVG to `reports/git-history/` + +### Phase 6: Create Python Entry Point (*depends on 5.1*) + +6.1 Create `gitHistoryPython.sh`: + - Follow pattern of `internalDependenciesPython.sh` + - Execute `gitHistoryCharts.py` with `--report_directory` and optional `--verbose` + - Clean up empty reports + +### Phase 7: Create Exploration Notebooks (*parallel with Phase 5*) + +7.1 Copy `jupyter/GitHistoryGeneral.ipynb` → `explore/GitHistoryGeneralExploration.ipynb`: + - Change title from "# git log/history" to "# Git History General Exploration" + - Set metadata: `"code_graph_analysis_pipeline_data_validation": "ValidateAlwaysFalse"` + - Update cypher file path references from `../cypher/GitLog/` to `../queries/statistics/` + +7.2 Copy `jupyter/GitHistoryExploration.ipynb` → `explore/GitHistoryCorrelationExploration.ipynb`: + - Change title from "# git log/history" to "# Git History Correlation Exploration" + - Set metadata: `"code_graph_analysis_pipeline_data_validation": "ValidateAlwaysFalse"` + - Update cypher file path references from `../cypher/GitLog/` to `../queries/statistics/` + +### Phase 8: Create Markdown Summary (*depends on Phases 4, 5*) + +8.1 Create `summary/report.template.md`: + - Front matter (title, date, version, dataset) + - Section 1: Overview — what to act on first, reading guide + - Section 2: Directory Commit Statistics — treemap charts, tables + - Section 3: Co-Changed Files — treemap charts, top pairwise tables + - Section 4: File Change Distribution — bar chart, statistics + - Section 5: Pairwise Changed Files — tables per metric (count, confidence, Jaccard, lift) + - Section 6: Data Quality — ambiguous files, unresolved files, file resolution statistics + - Section 7: Git Author Wordcloud + - Section 8: Glossary + +8.2 Create `summary/report_no_git_data.template.md`: + - Fallback: "⚠️ No git history data available" + +8.3 Create `summary/gitHistorySummary.sh`: + - Follow pattern of `internalDependenciesSummary.sh` + - **No-data detection**: check if the report directory (`reports/git-history/`) exists and contains data; if not, render `report_no_git_data.template.md` as the final report and exit early + - Generate front matter + - Execute queries for Markdown table includes (limited to 10 rows) + - Include SVG chart references + - Assemble final report via embedMarkdownIncludes.sh + +8.4 Create `gitHistoryMarkdown.sh`: + - Follow pattern of `internalDependenciesMarkdown.sh` + - Delegates to `summary/gitHistorySummary.sh` + +## Relevant Files + +**Reference implementations (read, not modified):** +- `domains/internal-dependencies/` — primary reference for domain structure, all entry point patterns, summary assembly +- `domains/anomaly-detection/treemapVisualizations.py` — reference for Python chart script with Neo4j connection +- `domains/anomaly-detection/explore/AnomalyDetectionExploration.ipynb` — reference for ValidateAlwaysFalse metadata +- `domains/internal-dependencies/pathFindingCharts.py` — reference for chart generation patterns +- `.github/prompts/plan-internal_dependencies_domain.prompt.md` — reference plan structure + +**Source files to copy (not modified):** +- `cypher/GitLog/` — all 42 files +- `cypher/Overview/Words_for_git_author_Wordcloud_with_frequency.cypher` +- `cypher/Validation/ValidateGitHistory.cypher` +- `scripts/reports/GitHistoryCsv.sh` — logic adapted into gitHistoryCsv.sh +- `scripts/importGit.sh` — copied with path adjustments +- `scripts/createGitLogCsv.sh` — copied unchanged +- `scripts/createAggregatedGitLogCsv.sh` — copied unchanged +- `jupyter/GitHistoryGeneral.ipynb` — copied with metadata + title changes +- `jupyter/GitHistoryExploration.ipynb` — copied with metadata + title changes + +**Modified (minimally):** +- `scripts/resetAndScan.sh` — add TODO comment at importGit.sh reference + +**Central scripts sourced (not copied):** +- `scripts/executeQueryFunctions.sh` — provides execute_cypher(), execute_cypher_queries_until_results() +- `scripts/cleanupAfterReportGeneration.sh` — removes empty CSV files +- `scripts/markdown/embedMarkdownIncludes.sh` — assembles Markdown includes into final report + +## Verification + +1. Run `shellcheck domains/git-history/*.sh domains/git-history/**/*.sh` — no errors +2. Run `python -m py_compile domains/git-history/gitHistoryCharts.py` — no syntax errors +3. Verify all cypher files copied match originals: `diff cypher/GitLog/ domains/git-history/queries/enrichment/` +4. Verify notebook metadata: `grep "ValidateAlwaysFalse" domains/git-history/explore/*.ipynb` returns matches +5. Verify entry point discovery: `find domains/git-history -name "*Csv.sh" -o -name "*Python.sh" -o -name "*Markdown.sh"` returns 3 files +6. Manual: Open exploration notebooks in VS Code, confirm they display correctly +7. Integration test (if Neo4j available): Run `gitHistoryCsv.sh` and verify CSV files in `reports/git-history/` + +## Scope Boundaries + +**Included:** +- All 42 GitLog cypher queries + 2 external (wordcloud, validation) +- Import scripts (importGit.sh, createGitLogCsv.sh, createAggregatedGitLogCsv.sh) +- CSV reporting logic from GitHistoryCsv.sh +- All ~20 charts from GitHistoryGeneral.ipynb as Python SVGs +- Git author wordcloud +- Both exploration notebooks +- Markdown summary report with tables, charts, glossary +- TODO comment on resetAndScan.sh + +**Excluded:** +- No Visualization entry point (no GraphViz graphs in git history) +- No move/deletion of originals +- Correlation analysis stays exploration-only (no Python script for scatter plots) +- No changes to central pipeline discovery mechanism +- General_Enrichment cypher not copied (documented as prerequisite) diff --git a/COMMANDS.md b/COMMANDS.md index 877699eb9..9a80f6b9d 100644 --- a/COMMANDS.md +++ b/COMMANDS.md @@ -282,7 +282,7 @@ Be aware that this script deletes all previous relationships and nodes in the lo ### Import git data -Use [importGit.sh](./scripts/importGit.sh) to import git data into the Graph. +Use [importGit.sh](./domains/git-history/import/importGit.sh) to import git data into the Graph. It uses `git log` to extract commits, their authors and the names of the files changed with them. These are stored in an intermediate CSV file and are then imported into Neo4j with the following schema: ```Cypher @@ -300,7 +300,7 @@ It uses `git log` to extract commits, their authors and the names of the files c Instead of importing every single commit, changes can be grouped by month including their commit count. This is in many cases sufficient and reduces data size and processing time significantly. To do this, set the environment variable `IMPORT_GIT_LOG_DATA_IF_SOURCE_IS_PRESENT` to `aggregated`. If you don't want to set the environment variable globally, then you can also prepend the command with it like this (inside the analysis workspace directory contained within temp): ```shell -IMPORT_GIT_LOG_DATA_IF_SOURCE_IS_PRESENT="aggregated" ./../../scripts/importGit.sh +IMPORT_GIT_LOG_DATA_IF_SOURCE_IS_PRESENT="aggregated" ./../../domains/git-history/import/importGit.sh ``` Here is the resulting schema: @@ -322,9 +322,9 @@ The optional parameter `--source directory-path-to-the-source-folder-containing- #### Resolving git files to code files -After git log data has been imported successfully, [Add_RESOLVES_TO_relationships_to_git_files_for_Java.cypher](./cypher/GitLog/Add_RESOLVES_TO_relationships_to_git_files_for_Java.cypher) is used to try to resolve the imported git file names to code files. This first attempt will cover most cases, but not all of them. With this approach it is, for example, not possible to distinguish identical file names in different Java jars from the git source files of a mono repo. +After git log data has been imported successfully, [Add_RESOLVES_TO_relationships_to_git_files_for_Java.cypher](./domains/git-history/queries/enrichment/Add_RESOLVES_TO_relationships_to_git_files_for_Java.cypher) is used to try to resolve the imported git file names to code files. This first attempt will cover most cases, but not all of them. With this approach it is, for example, not possible to distinguish identical file names in different Java jars from the git source files of a mono repo. -You can use [List_unresolved_git_files.cypher](./cypher/GitLog/List_unresolved_git_files.cypher) to find code files that couldn't be matched to git file names and [List_ambiguous_git_files.cypher](./cypher/GitLog/List_ambiguous_git_files.cypher) to find ambiguously resolved git files. If you have any idea on how to improve this feel free to [open an issue](https://github.com/JohT/code-graph-analysis-pipeline/issues/new). +You can use [List_unresolved_git_files.cypher](./domains/git-history/queries/statistics/List_unresolved_git_files.cypher) to find code files that couldn't be matched to git file names and [List_ambiguous_git_files.cypher](./domains/git-history/queries/statistics/List_ambiguous_git_files.cypher) to find ambiguously resolved git files. If you have any idea on how to improve this feel free to [open an issue](https://github.com/JohT/code-graph-analysis-pipeline/issues/new). ## Database Queries diff --git a/README.md b/README.md index 161ef8471..f1cd9abc2 100644 --- a/README.md +++ b/README.md @@ -47,7 +47,7 @@ Curious? Explore the examples at [code-graph-analysis-examples](https://github.c Here is an overview of [Jupyter Notebooks](https://jupyter.org) reports from [code-graph-analysis-examples](https://github.com/JohT/code-graph-analysis-examples). For a complete list, see the [Jupyter Notebook Report Reference](#page_with_curl-jupyter-notebook-report-reference). - [External Dependencies](https://github.com/JohT/code-graph-analysis-examples/blob/main/analysis-results/AxonFramework/latest/external-dependencies-java/ExternalDependenciesJava.md) contains detailed information about external library usage ([Notebook](./domains/external-dependencies/explore/ExternalDependenciesJava.ipynb)). -- [Git History](https://github.com/JohT/code-graph-analysis-examples/blob/main/analysis-results/AxonFramework/latest/git-history-general/GitHistoryGeneral.md) contains information about the git history of the analyzed code ([Notebook](./jupyter/GitHistoryGeneral.ipynb)). +- [Git History](https://github.com/JohT/code-graph-analysis-examples/blob/main/analysis-results/AxonFramework/latest/git-history-general/GitHistoryGeneral.md) contains information about the git history of the analyzed code ([Notebook](./domains/git-history/explore/GitHistoryGeneralExploration.ipynb)). - [Internal Dependencies](https://github.com/JohT/code-graph-analysis-examples/blob/main/analysis-results/AxonFramework/latest/internal-dependencies-java/InternalDependenciesJava.md) is based on [Analyze java package metrics in a graph database](https://joht.github.io/johtizen/data/2023/04/21/java-package-metrics-analysis.html) and also includes cyclic dependencies ([Notebook](./domains/internal-dependencies/explore/InternalDependenciesJava.ipynb)). - [Method Metrics](https://github.com/JohT/code-graph-analysis-examples/blob/main/analysis-results/AxonFramework/latest/method-metrics-java/MethodMetricsJava.md) shows how the effective number of lines of code and the cyclomatic complexity are distributed across the methods in the code ([Notebook](./jupyter/MethodMetricsJava.ipynb)). - [Node Embeddings](https://github.com/JohT/code-graph-analysis-examples/blob/main/analysis-results/AxonFramework/latest/node-embeddings-java/NodeEmbeddingsJava.md) shows how to generate node embeddings and to further reduce their dimensionality to be able to visualize them in a 2D plot ([Notebook](./jupyter/NodeEmbeddingsJava.ipynb)). @@ -127,7 +127,7 @@ This could be as simple as running the following command in your Typescript proj npx --yes @jqassistant/ts-lce ``` -- The cloned repository or source project needs to be copied into the directory called `source` within the analysis workspace, so that it will also be picked up during scan by [resetAndScan.sh](./scripts/resetAndScan.sh) and optional [importGit.sh](./scripts/importGit.sh). +- The cloned repository or source project needs to be copied into the directory called `source` within the analysis workspace, so that it will also be picked up during scan by [resetAndScan.sh](./scripts/resetAndScan.sh) and optional [importGit.sh](./domains/git-history/import/importGit.sh). ## :rocket: Getting Started diff --git a/cypher/GitLog/Index_commit_sha.cypher b/cypher/GitLog/Index_commit_sha.cypher deleted file mode 100644 index 4a923e007..000000000 --- a/cypher/GitLog/Index_commit_sha.cypher +++ /dev/null @@ -1,3 +0,0 @@ -// Create index for git commit sha - -CREATE INDEX INDEX_COMMIT_HASH IF NOT EXISTS FOR (commit:Commit) ON (commit.sha) \ No newline at end of file diff --git a/cypher/GitLog/Index_file_relative_path.cypher b/cypher/GitLog/Index_file_relative_path.cypher deleted file mode 100644 index 418d424d4..000000000 --- a/cypher/GitLog/Index_file_relative_path.cypher +++ /dev/null @@ -1,3 +0,0 @@ -// Create index for the relative file path - -CREATE INDEX INDEX_FILE_NAME IF NOT EXISTS FOR (file:File) ON (file.relativePath) \ No newline at end of file diff --git a/domains/git-history/COPIED_FILES.md b/domains/git-history/COPIED_FILES.md new file mode 100644 index 000000000..341954b83 --- /dev/null +++ b/domains/git-history/COPIED_FILES.md @@ -0,0 +1,103 @@ +# Copied Files Tracking + +This document maps every original file that was copied into this domain to its copy location. +It exists to support a future deprecation follow-up task that will remove or migrate the originals +once this domain is the canonical implementation. + +> **Breaking change notice:** Output directory has changed from `reports/git-history-csv` to `reports/git-history`. +> When the old `scripts/reports/GitHistoryCsv.sh` is eventually removed, a **major version bump** is required. + +--- + +## Cypher Queries + +### Enrichment Queries (26 files) + +| Original | Copy | +|----------|------| +| `cypher/GitLog/Import_git_log_csv_data.cypher` | `queries/enrichment/Import_git_log_csv_data.cypher` | +| `cypher/GitLog/Import_aggregated_git_log_csv_data.cypher` | `queries/enrichment/Import_aggregated_git_log_csv_data.cypher` | +| `cypher/GitLog/Create_git_repository_node.cypher` | `queries/enrichment/Create_git_repository_node.cypher` | +| `cypher/GitLog/Delete_git_log_data.cypher` | `queries/enrichment/Delete_git_log_data.cypher` | +| `cypher/GitLog/Delete_plain_git_directory_file_nodes.cypher` | `queries/enrichment/Delete_plain_git_directory_file_nodes.cypher` | +| `cypher/GitLog/Index_absolute_file_name.cypher` | `queries/enrichment/Index_absolute_file_name.cypher` | +| `cypher/GitLog/Index_author_name.cypher` | `queries/enrichment/Index_author_name.cypher` | +| `cypher/GitLog/Index_change_span_year.cypher` | `queries/enrichment/Index_change_span_year.cypher` | +| `cypher/GitLog/Index_commit_hash.cypher` | `queries/enrichment/Index_commit_hash.cypher` | +| `cypher/GitLog/Index_commit_parent.cypher` | `queries/enrichment/Index_commit_parent.cypher` | +| `cypher/GitLog/Index_commit_sha.cypher` | `queries/enrichment/Index_commit_sha.cypher` | +| `cypher/GitLog/Index_file_name.cypher` | `queries/enrichment/Index_file_name.cypher` | +| `cypher/GitLog/Index_file_relative_path.cypher` | `queries/enrichment/Index_file_relative_path.cypher` | +| `cypher/GitLog/Add_CHANGED_TOGETHER_WITH_relationships_to_code_files.cypher` | `queries/enrichment/Add_CHANGED_TOGETHER_WITH_relationships_to_code_files.cypher` | +| `cypher/GitLog/Add_CHANGED_TOGETHER_WITH_relationships_to_git_files.cypher` | `queries/enrichment/Add_CHANGED_TOGETHER_WITH_relationships_to_git_files.cypher` | +| `cypher/GitLog/Add_HAS_PARENT_relationships_to_commits.cypher` | `queries/enrichment/Add_HAS_PARENT_relationships_to_commits.cypher` | +| `cypher/GitLog/Add_RESOLVES_TO_relationships_to_git_files_for_Java.cypher` | `queries/enrichment/Add_RESOLVES_TO_relationships_to_git_files_for_Java.cypher` | +| `cypher/GitLog/Add_RESOLVES_TO_relationships_to_git_files_for_Typescript.cypher` | `queries/enrichment/Add_RESOLVES_TO_relationships_to_git_files_for_Typescript.cypher` | +| `cypher/GitLog/Set_commit_classification_properties.cypher` | `queries/enrichment/Set_commit_classification_properties.cypher` | +| `cypher/GitLog/Set_number_of_aggregated_git_commits.cypher` | `queries/enrichment/Set_number_of_aggregated_git_commits.cypher` | +| `cypher/GitLog/Set_number_of_git_log_commits.cypher` | `queries/enrichment/Set_number_of_git_log_commits.cypher` | +| `cypher/GitLog/Set_number_of_git_plugin_commits.cypher` | `queries/enrichment/Set_number_of_git_plugin_commits.cypher` | +| `cypher/GitLog/Set_number_of_git_plugin_update_commits.cypher` | `queries/enrichment/Set_number_of_git_plugin_update_commits.cypher` | + +> **Note:** Only 23 enrichment query files are listed above. The remaining 5 files (Verify_*) were placed in `validation/`. +> The total enrichment file count includes import, repository, deletion (2), indexes (8), relationships (5), properties (5) = 23 unique files. + +### Statistics Queries (14 files) + +| Original | Copy | +|----------|------| +| `cypher/GitLog/List_ambiguous_git_files.cypher` | `queries/statistics/List_ambiguous_git_files.cypher` | +| `cypher/GitLog/List_git_file_directories_with_commit_statistics.cypher` | `queries/statistics/List_git_file_directories_with_commit_statistics.cypher` | +| `cypher/GitLog/List_git_files_by_resolved_label_and_extension.cypher` | `queries/statistics/List_git_files_by_resolved_label_and_extension.cypher` | +| `cypher/GitLog/List_git_files_per_commit_distribution.cypher` | `queries/statistics/List_git_files_per_commit_distribution.cypher` | +| `cypher/GitLog/List_git_files_that_were_changed_together.cypher` | `queries/statistics/List_git_files_that_were_changed_together.cypher` | +| `cypher/GitLog/List_git_files_that_were_changed_together_all_in_one.cypher` | `queries/statistics/List_git_files_that_were_changed_together_all_in_one.cypher` | +| `cypher/GitLog/List_git_files_that_were_changed_together_with_another_file.cypher` | `queries/statistics/List_git_files_that_were_changed_together_with_another_file.cypher` | +| `cypher/GitLog/List_git_files_that_were_changed_together_with_another_file_all_in_one.cypher` | `queries/statistics/List_git_files_that_were_changed_together_with_another_file_all_in_one.cypher` | +| `cypher/GitLog/List_git_files_with_commit_statistics_by_author.cypher` | `queries/statistics/List_git_files_with_commit_statistics_by_author.cypher` | +| `cypher/GitLog/List_pairwise_changed_files.cypher` | `queries/statistics/List_pairwise_changed_files.cypher` | +| `cypher/GitLog/List_pairwise_changed_files_top_selected_metric.cypher` | `queries/statistics/List_pairwise_changed_files_top_selected_metric.cypher` | +| `cypher/GitLog/List_pairwise_changed_files_with_dependencies.cypher` | `queries/statistics/List_pairwise_changed_files_with_dependencies.cypher` | +| `cypher/GitLog/List_unresolved_git_files.cypher` | `queries/statistics/List_unresolved_git_files.cypher` | +| `cypher/Overview/Words_for_git_author_Wordcloud_with_frequency.cypher` | `queries/statistics/Words_for_git_author_Wordcloud_with_frequency.cypher` | + +### Validation Queries (5 files) + +| Original | Copy | +|----------|------| +| `cypher/GitLog/Verify_code_to_git_file_unambiguous.cypher` | `queries/validation/Verify_code_to_git_file_unambiguous.cypher` | +| `cypher/GitLog/Verify_git_missing_CHANGED_TOGETHER_WITH_properties.cypher` | `queries/validation/Verify_git_missing_CHANGED_TOGETHER_WITH_properties.cypher` | +| `cypher/GitLog/Verify_git_missing_create_date.cypher` | `queries/validation/Verify_git_missing_create_date.cypher` | +| `cypher/GitLog/Verify_git_to_code_file_unambiguous.cypher` | `queries/validation/Verify_git_to_code_file_unambiguous.cypher` | +| `cypher/Validation/ValidateGitHistory.cypher` | `queries/validation/ValidateGitHistory.cypher` | + +--- + +## Import Scripts (3 files) + +| Original | Copy | Changes | +|----------|------|---------| +| `scripts/importGit.sh` | `import/importGit.sh` | Updated `GIT_LOG_CYPHER_DIR` to `../queries/enrichment/`; updated sourced script paths | +| `scripts/createGitLogCsv.sh` | `import/createGitLogCsv.sh` | No changes | +| `scripts/createAggregatedGitLogCsv.sh` | `import/createAggregatedGitLogCsv.sh` | No changes | + +--- + +## Jupyter Notebooks (2 files) + +| Original | Copy | Metadata Change | +|----------|------|-----------------| +| `jupyter/GitHistoryGeneral.ipynb` | `explore/GitHistoryGeneralExploration.ipynb` | Added `"ValidateAlwaysFalse"` metadata; updated cypher paths; changed title | +| `jupyter/GitHistoryExploration.ipynb` | `explore/GitHistoryCorrelationExploration.ipynb` | Added `"ValidateAlwaysFalse"` metadata; updated cypher paths; changed title | + +--- + +## Scripts Referenced but NOT Copied (Central Pipeline) + +These scripts are sourced from the central `scripts/` directory and are not duplicated: + +| Script | Domain Usage | +|--------|-------------| +| `scripts/executeQueryFunctions.sh` | Sourced by all entry point scripts | +| `scripts/cleanupAfterReportGeneration.sh` | Sourced by CSV entry point after report generation | +| `scripts/markdown/embedMarkdownIncludes.sh` | Sourced by summary script for Markdown assembly | diff --git a/domains/git-history/PREREQUISITES.md b/domains/git-history/PREREQUISITES.md new file mode 100644 index 000000000..66ee70f8b --- /dev/null +++ b/domains/git-history/PREREQUISITES.md @@ -0,0 +1,78 @@ +# Git History Domain — Prerequisites + +The following are provided by the central pipeline and must run **before** this domain executes. +They are not copied into this domain; they are sourced or referenced from the central pipeline locations. + +--- + +## 1. Neo4j Running with Scanned Artifacts + +Neo4j must be running and all artifacts must have been scanned and loaded into the graph database +before any script in this domain is executed. + +See the main [README.md](../../README.md) and [GETTING_STARTED.md](../../GETTING_STARTED.md) for setup instructions. + +--- + +## 2. Git History Imported + +Git history data must have been imported into the graph database. Controlled by the environment variable: + +``` +IMPORT_GIT_LOG_DATA_IF_SOURCE_IS_PRESENT="plugin" # Recommended default +``` + +Options: `"none"`, `"aggregated"`, `"full"`, `"plugin"` (default). + +- **`plugin`** (recommended): jQAssistant git plugin provides `Git:Commit`, `Git:File`, `Git:Author`, and related nodes. +- **`full`**: Full git log CSV import via `createGitLogCsv.sh`. +- **`aggregated`**: Aggregated git log CSV import via `createAggregatedGitLogCsv.sh`. +- **`none`**: Skip git import. + +The domain's `import/importGit.sh` script orchestrates this import. + +> **Note:** The analyzed codebase may have no git history at all. +> All domain entry points handle this case gracefully: `gitHistoryCsv.sh` produces no output +> when all queries are empty; `gitHistoryCharts.py` skips chart generation if CSV files are absent; +> `gitHistoryMarkdown.sh` renders a fallback report if no report directory is found. + +--- + +## 3. Git:File ↔ Code File Relationships + +The following relationships must exist (created by `import/importGit.sh`): + +| Relationship | Description | +|---|---| +| `(Git:File)-[:RESOLVES_TO]->(File)` | Links git-tracked files to scanned Java/TypeScript code files | +| `(File)-[:CHANGED_TOGETHER_WITH]->(File)` | Co-change relationships between resolved code files | +| `(Git:File)-[:CHANGED_TOGETHER_WITH]->(Git:File)` | Co-change relationships between raw git files | + +--- + +## 4. Required Properties + +| Property | Node | Set By | +|---|---|---| +| `numberOfGitCommits` | `File` (Java/TypeScript) | `Set_number_of_git_log_commits.cypher` or `Set_number_of_git_plugin_commits.cypher` | +| `updateCommitCount` | `Git:File` | `Set_number_of_git_plugin_update_commits.cypher` | +| `isMergeCommit` | `Git:Commit` | `Set_commit_classification_properties.cypher` | +| `isAutomatedCommit` | `Git:Commit` | `Set_commit_classification_properties.cypher` | + +--- + +## 5. General Enrichment + +The `name` and `extension` properties on `File` nodes must be set by the general enrichment queries: + +**Cypher source:** [`cypher/General_Enrichment/`](../../cypher/General_Enrichment/) + +--- + +## 6. Central Pipeline Scripts (sourced, not copied) + +| Script | Purpose | +|---|---| +| `scripts/executeQueryFunctions.sh` | Provides `execute_cypher()` and `execute_cypher_queries_until_results()` functions | +| `scripts/cleanupAfterReportGeneration.sh` | Removes empty CSV files after report generation | +| `scripts/markdown/embedMarkdownIncludes.sh` | Assembles Markdown includes into the final report | diff --git a/domains/git-history/README.md b/domains/git-history/README.md new file mode 100644 index 000000000..6ebb21f29 --- /dev/null +++ b/domains/git-history/README.md @@ -0,0 +1,94 @@ +# Git History Domain + +This directory contains the implementation and resources for analysing **git history** within the Code Graph Analysis Pipeline. It follows the vertical-slice domain pattern: all Cypher queries, Python chart scripts, shell scripts, and report templates needed for this analysis live here. + +This domain covers all git history analysis areas: + +- **Directory commit statistics**: How often files in each directory are committed, by whom, and when — as hierarchical treemap charts. +- **Co-changed files**: Files that tend to be modified together in the same commit — indicating hidden coupling. +- **Pairwise file metrics**: Quantified co-change metrics per file pair: commit count, confidence, Jaccard similarity, lift. +- **Author analysis**: Which authors contribute most, which directories have very few contributors (low bus-factor). +- **Git data quality**: Ambiguous and unresolved git files; file resolution statistics. +- **Git author wordcloud**: Visualization of all contributing authors weighted by commit frequency. + +## Entry Points + +The following scripts are discovered and invoked automatically by the central compilation scripts in [scripts/reports/compilations/](../../scripts/reports/compilations/). They are found by filename pattern. + +- [gitHistoryCsv.sh](./gitHistoryCsv.sh): Entry point for CSV reports based on Cypher queries. Discovered by `CsvReports.sh` (`*Csv.sh` pattern). +- [gitHistoryPython.sh](./gitHistoryPython.sh): Entry point for Python-based SVG chart generation. Discovered by `PythonReports.sh` (`*Python.sh` pattern). +- [gitHistoryMarkdown.sh](./gitHistoryMarkdown.sh): Entry point for the Markdown summary report. Discovered by `MarkdownReports.sh` (`*Markdown.sh` pattern). + +> **Note:** There is no Visualization entry point — git history analysis generates no GraphViz graph visualizations. + +## No-Git-Data Handling + +The analyzed codebase may have no git history at all. All entry points handle this gracefully: + +- `gitHistoryCsv.sh`: Produces no output if all queries return empty results (cleanup removes empty CSVs). No report directory is created. +- `gitHistoryCharts.py`: Skips chart generation if the report directory or required CSV files are absent. Exits with code 0. +- `gitHistoryMarkdown.sh`: Detects the absence of the report directory and renders `summary/report_no_git_data.template.md` instead. + +## Folder Structure + +``` +domains/git-history/ +├── README.md # This file +├── PREREQUISITES.md # Detailed prerequisite documentation +├── COPIED_FILES.md # Original → copy mapping for deprecation follow-up +├── gitHistoryCsv.sh # Entry point: CSV reports +├── gitHistoryPython.sh # Entry point: Python charts +├── gitHistoryMarkdown.sh # Entry point: Markdown summary +├── gitHistoryCharts.py # Chart generator: treemap, bar, histogram → SVG +├── explore/ # Jupyter notebooks for interactive exploration +│ ├── GitHistoryGeneralExploration.ipynb # General exploration (treemaps, charts, wordcloud) +│ └── GitHistoryCorrelationExploration.ipynb # Correlation analysis exploration +├── import/ # Git data import scripts +│ ├── importGit.sh # Git data import orchestrator +│ ├── createGitLogCsv.sh # Full git log → CSV +│ └── createAggregatedGitLogCsv.sh # Aggregated git log → CSV +├── queries/ +│ ├── enrichment/ # 23 Cypher queries: import, indexes, relationships, properties +│ ├── statistics/ # 14 Cypher queries: listing and querying for reports +│ └── validation/ # 5 Cypher queries: verification and validation +└── summary/ + ├── gitHistorySummary.sh # Markdown assembly logic + ├── report.template.md # Main report template + └── report_no_git_data.template.md # Fallback: no git data +``` + +## Prerequisites + +This domain requires the following to be in place before running. See [PREREQUISITES.md](./PREREQUISITES.md) for full details. + +- Neo4j running with scanned artifacts loaded +- Git history imported (`IMPORT_GIT_LOG_DATA_IF_SOURCE_IS_PRESENT` env var controls the mode) +- `Git:File` ↔ `RESOLVES_TO` ↔ code file relationships +- `CHANGED_TOGETHER_WITH` relationships between git files and resolved code files +- `numberOfGitCommits` property on code `File` nodes +- `updateCommitCount` property on `Git:File` nodes +- `isMergeCommit`, `isAutomatedCommit` classification properties on commits +- General enrichment: `name`, `extension` properties on `File` nodes + +## Output + +All output is written to `reports/git-history/` relative to the working directory. + +| File | Description | +|------|-------------| +| `List_git_files_with_commit_statistics_by_author.csv` | Per-file commit statistics by author | +| `List_git_files_that_were_changed_together_with_another_file.csv` | Files with co-change partners | +| `List_git_file_directories_with_commit_statistics.csv` | Directory-level commit statistics | +| `List_git_files_per_commit_distribution.csv` | Distribution of changed file counts per commit | +| `List_pairwise_changed_files_with_dependencies.csv` | Co-changed file pairs that also have code dependencies | +| `List_pairwise_changed_files.csv` | All pairwise changed file pairs with co-change metrics | +| `List_pairwise_changed_files_top_count.csv` | Top co-changed pairs by commit count | +| `List_pairwise_changed_files_top_min_confidence.csv` | Top co-changed pairs by min confidence | +| `List_pairwise_changed_files_top_jaccard.csv` | Top co-changed pairs by Jaccard similarity | +| `List_pairwise_changed_files_top_lift.csv` | Top co-changed pairs by lift | +| `List_git_files_by_resolved_label_and_extension.csv` | File resolution statistics | +| `List_ambiguous_git_files.csv` | Data quality: files with ambiguous resolution | +| `List_unresolved_git_files.csv` | Data quality: unresolved git files | +| `Words_for_git_author_Wordcloud_with_frequency.csv` | Author words for wordcloud | +| `*.svg` | SVG chart files generated by `gitHistoryCharts.py` | +| `git_history_report.md` | Final assembled Markdown report | diff --git a/jupyter/GitHistoryExploration.ipynb b/domains/git-history/explore/GitHistoryCorrelationExploration.ipynb similarity index 98% rename from jupyter/GitHistoryExploration.ipynb rename to domains/git-history/explore/GitHistoryCorrelationExploration.ipynb index b57b96949..806e7c6d0 100644 --- a/jupyter/GitHistoryExploration.ipynb +++ b/domains/git-history/explore/GitHistoryCorrelationExploration.ipynb @@ -6,7 +6,7 @@ "id": "2f0eabc4", "metadata": {}, "source": [ - "# git log/history\n", + "# Git History Correlation Exploration\n", "
\n", "\n", "### References\n", @@ -169,7 +169,7 @@ "metadata": {}, "outputs": [], "source": [ - "pairwise_changed_git_files_with_dependencies = query_cypher_to_data_frame(\"../cypher/GitLog/List_pairwise_changed_files_with_dependencies.cypher\")\n", + "pairwise_changed_git_files_with_dependencies = query_cypher_to_data_frame(\"../queries/statistics/List_pairwise_changed_files_with_dependencies.cypher\")\n", "pairwise_changed_git_files_with_dependencies.head(10)" ] }, @@ -420,7 +420,7 @@ "pygments_lexer": "ipython3", "version": "3.12.9" }, - "title": "Git History Charts with Neo4j (Additional Manual Exploration)" + "title": "Git History Correlation Exploration" }, "nbformat": 4, "nbformat_minor": 5 diff --git a/jupyter/GitHistoryGeneral.ipynb b/domains/git-history/explore/GitHistoryGeneralExploration.ipynb similarity index 99% rename from jupyter/GitHistoryGeneral.ipynb rename to domains/git-history/explore/GitHistoryGeneralExploration.ipynb index f4b7fd451..5f482e6d6 100644 --- a/jupyter/GitHistoryGeneral.ipynb +++ b/domains/git-history/explore/GitHistoryGeneralExploration.ipynb @@ -6,7 +6,7 @@ "id": "2f0eabc4", "metadata": {}, "source": [ - "# git log/history\n", + "# Git History General Exploration\n", "
\n", "\n", "### References\n", @@ -545,7 +545,7 @@ "metadata": {}, "outputs": [], "source": [ - "git_files_with_commit_statistics = query_cypher_to_data_frame(\"../cypher/GitLog/List_git_files_with_commit_statistics_by_author.cypher\")\n", + "git_files_with_commit_statistics = query_cypher_to_data_frame(\"../queries/statistics/List_git_files_with_commit_statistics_by_author.cypher\")\n", "\n", "# Get all authors, their commit count and based on it their rank in a separate dataframe.\n", "# This will then be needed to visualize the (main) author for each directory.\n", @@ -1194,7 +1194,7 @@ "metadata": {}, "outputs": [], "source": [ - "git_file_count_per_commit = query_cypher_to_data_frame(\"../cypher/GitLog/List_git_files_per_commit_distribution.cypher\")\n", + "git_file_count_per_commit = query_cypher_to_data_frame(\"../queries/statistics/List_git_files_per_commit_distribution.cypher\")\n", "\n", "print(\"Sum of commits that changed more than 30 files (each) = \" + str(git_file_count_per_commit[git_file_count_per_commit['filesPerCommit'] > 30]['commitCount'].sum()))\n", "print(\"Max changed files with one commit = \" + str(git_file_count_per_commit['filesPerCommit'].max()))\n", @@ -1267,7 +1267,7 @@ "metadata": {}, "outputs": [], "source": [ - "data_to_display = query_cypher_to_data_frame(\"../cypher/GitLog/List_git_files_that_were_changed_together_with_another_file.cypher\")\n", + "data_to_display = query_cypher_to_data_frame(\"../queries/statistics/List_git_files_that_were_changed_together_with_another_file.cypher\")\n", "\n", "# Debug\n", "# display(\"1. pairwise changed files --------------\")\n", @@ -1420,7 +1420,7 @@ "metadata": {}, "outputs": [], "source": [ - "pairwise_changed_git_files = query_cypher_to_data_frame(\"../cypher/GitLog/List_pairwise_changed_files.cypher\")" + "pairwise_changed_git_files = query_cypher_to_data_frame(\"../queries/statistics/List_pairwise_changed_files.cypher\")" ] }, { @@ -1922,7 +1922,7 @@ "outputs": [], "source": [ "# Query data from graph database\n", - "git_author_words_with_frequency = query_cypher_to_data_frame(\"../cypher/Overview/Words_for_git_author_Wordcloud_with_frequency.cypher\")\n", + "git_author_words_with_frequency = query_cypher_to_data_frame(\"../queries/statistics/Words_for_git_author_Wordcloud_with_frequency.cypher\")\n", "\n", "git_author_words_with_frequency.sort_values(by='frequency', ascending=False).reset_index(drop=True).head(10)" ] @@ -1964,7 +1964,7 @@ "name": "JohT" } ], - "code_graph_analysis_pipeline_data_validation": "ValidateGitHistory", + "code_graph_analysis_pipeline_data_validation": "ValidateAlwaysFalse", "kernelspec": { "display_name": "codegraph", "language": "python", @@ -1982,7 +1982,7 @@ "pygments_lexer": "ipython3", "version": "3.12.9" }, - "title": "Git History Charts with Neo4j" + "title": "Git History General Exploration" }, "nbformat": 4, "nbformat_minor": 5 diff --git a/domains/git-history/gitHistoryCharts.py b/domains/git-history/gitHistoryCharts.py new file mode 100644 index 000000000..d26876e6b --- /dev/null +++ b/domains/git-history/gitHistoryCharts.py @@ -0,0 +1,960 @@ +#!/usr/bin/env python + +# Generates git history charts as SVG files from CSV data produced by gitHistoryCsv.sh. +# Charts are saved to the report directory and referenced by the Markdown summary report. +# +# Charts produced: +# Treemaps (directory commit statistics): 13 charts +# Co-change treemaps: 3 charts +# Bar chart (files per commit distribution): 1 chart +# Histograms (co-changed files by metric): 4 charts +# Wordcloud (git authors): 1 chart +# +# Input Parameters: +# --report_directory path to the report directory (contains CSV files from gitHistoryCsv.sh) +# --verbose optional finer-grained logging +# +# Prerequisites: +# - gitHistoryCsv.sh must have run first to produce the required CSV files. +# - If the report directory does not exist or CSVs are absent, exits with 0 without generating anything. + +import os +import sys +import argparse +from typing import Any, cast + +import pandas as pd +import numpy as np + +import matplotlib +matplotlib.use('Agg') # Non-interactive backend — required for headless script execution +import matplotlib.pyplot as plot + +from plotly import graph_objects as plotly_graph_objects +from plotly.express import colors as plotly_colors +from plotly.subplots import make_subplots + +SCRIPT_NAME = "gitHistoryCharts" + +# ── Plotly layout constants ─────────────────────────────────────────────────── + +PLOTLY_MAIN_LAYOUT_BASE_SETTINGS: dict[str, Any] = dict( + margin=dict(t=80, l=15, r=15, b=15), +) +PLOTLY_TREEMAP_FIGURE_SHOW_SETTINGS = dict( + width=1080, + height=1080, + config={"scrollZoom": False, "displaylogo": False, "displayModeBar": False}, +) +PLOTLY_TREEMAP_MARKER_BASE_STYLE = dict( + cornerradius=5, +) +PLOTLY_TREEMAP_MARKER_BASE_COLORSCALE = dict( + **PLOTLY_TREEMAP_MARKER_BASE_STYLE, + colorscale="Hot_r", +) + + +# ── Parameters ──────────────────────────────────────────────────────────────── + +class Parameters: + def __init__(self, report_directory: str, verbose: bool) -> None: + self.report_directory = report_directory + self.verbose = verbose + + def __repr__(self) -> str: + return ( + f"Parameters(" + f"report_directory={self.report_directory!r}, " + f"verbose={self.verbose})" + ) + + @staticmethod + def log_dependency_versions() -> None: + print("---------------------------------------") + print(f"Python version: {sys.version}") + from pandas import __version__ as pandas_version + print(f"pandas version: {pandas_version}") + from numpy import __version__ as numpy_version + print(f"numpy version: {numpy_version}") + from plotly import version as plotly_version + print(f"plotly version: {plotly_version}") + from matplotlib import __version__ as matplotlib_version + print(f"matplotlib version: {matplotlib_version}") + print("---------------------------------------") + + +def parse_parameters() -> Parameters: + parser = argparse.ArgumentParser( + description="Generates git history charts as SVG files from CSV data." + ) + parser.add_argument( + "--report_directory", + type=str, + default="", + help="Path to the report directory containing CSV files from gitHistoryCsv.sh", + ) + parser.add_argument( + "--verbose", + action="store_true", + default=False, + help="Enable verbose mode for detailed logging", + ) + args = parser.parse_args() + return Parameters( + report_directory=args.report_directory, + verbose=args.verbose, + ) + + +# ── CSV loading helpers ─────────────────────────────────────────────────────── + +def load_csv(report_directory: str, filename: str, verbose: bool) -> pd.DataFrame: + """Load a CSV file from the report directory. Returns empty DataFrame if file is missing.""" + path = os.path.join(report_directory, filename) + if not os.path.isfile(path): + if verbose: + print(f"{SCRIPT_NAME}: Skipping — CSV not found: {path}") + return pd.DataFrame() + data_frame = pd.read_csv(path) + if verbose: + print(f"{SCRIPT_NAME}: Loaded {len(data_frame)} rows from {path}") + return data_frame + + +def get_plotly_figure_write_image_settings(report_directory: str, name: str) -> dict: + """Returns the settings for the plotly figure write_image method.""" + return dict( + file=os.path.join(report_directory, name + ".svg"), + format="svg", + width=1080, + height=1080, + ) + + +# ── Data preparation functions ──────────────────────────────────────────────── + +def add_quantile_limited_column(input_data_frame: pd.DataFrame, column_name: str, quantile: float = 0.95) -> pd.DataFrame: + """Limits the values of the given column to the given quantile (caps rather than filters).""" + data_frame = input_data_frame.copy() + column_values = data_frame[column_name] + column_limit = column_values.quantile(quantile) + data_frame[column_name + "_limited"] = np.where(column_values > column_limit, column_limit, column_values) + return data_frame + + +def add_rank_column(input_data_frame: pd.DataFrame, column_name: str) -> pd.DataFrame: + """Adds a dense rank column based on the given column.""" + data_frame = input_data_frame.copy() + data_frame[column_name + "_rank"] = data_frame[column_name].rank(ascending=True, method="dense") + return data_frame + + +def get_last_entry(values: list): + """Returns the last element of a list.""" + return values[-1] + + +def add_file_extension_column(input_dataframe: pd.DataFrame, file_path_column: str, file_extension_column: str = "fileExtension") -> pd.DataFrame: + """Adds a file extension column derived from the file path.""" + if file_extension_column in input_dataframe.columns: + return input_dataframe + file_path_col_pos = cast(int, input_dataframe.columns.get_loc(file_path_column)) + file_extensions = input_dataframe[file_path_column].str.split("/").map(get_last_entry) + file_extensions = file_extensions.str.split(".").map(get_last_entry) + input_dataframe.insert(file_path_col_pos + 1, file_extension_column, file_extensions) + return input_dataframe + + +def remove_last_file_path_element(file_path_elements: list) -> list: + return file_path_elements[:-1] if len(file_path_elements) > 1 else [""] + + +def convert_path_elements_to_directories(file_path_elements: list) -> list: + directories = remove_last_file_path_element(file_path_elements) + return ["/".join(directories[: i + 1]) for i in range(len(directories))] + + +def add_directory_column(input_dataframe: pd.DataFrame, file_path_column: str, directory_column: str = "directoryPath") -> pd.DataFrame: + """Explodes file paths into all ancestor directory paths.""" + if directory_column in input_dataframe.columns: + return input_dataframe + input_dataframe.insert( + 0, + directory_column, + input_dataframe[file_path_column].str.split("/").apply(convert_path_elements_to_directories), + ) + input_dataframe = input_dataframe.explode(directory_column) + return input_dataframe + + +def add_directory_name_column(input_dataframe: pd.DataFrame, directory_column: str = "directoryPath", directory_name_column: str = "directoryName") -> pd.DataFrame: + """Adds the final path component as a name column.""" + if directory_name_column in input_dataframe.columns: + return input_dataframe + splitted = input_dataframe[directory_column].str.rsplit("/", n=1) + input_dataframe.insert(1, directory_name_column, splitted.apply(lambda x: x[-1])) + return input_dataframe + + +def add_parent_directory_column(input_dataframe: pd.DataFrame, directory_column: str = "directoryPath", directory_parent_column: str = "directoryParentPath") -> pd.DataFrame: + """Adds the parent directory path column.""" + if directory_parent_column in input_dataframe.columns: + return input_dataframe + splitted = input_dataframe[directory_column].str.rsplit("/", n=1) + input_dataframe.insert(1, directory_parent_column, splitted.apply(lambda x: x[0])) + input_dataframe.loc[ + input_dataframe[directory_parent_column] == input_dataframe[directory_column], + directory_parent_column, + ] = "" + return input_dataframe + + +def collect_as_array(values: pd.Series): + return np.asanyarray(values.to_list()) + + +def second_entry(values: pd.Series): + return values.iloc[1] if len(values) > 1 else None + + +def _to_array(value) -> list: + """Converts a value to a list, splitting comma-separated strings if needed.""" + if isinstance(value, str): + return value.split(",") + return list(value) + + +def get_flattened_unique_values(values: pd.Series): + return np.unique(np.concatenate([_to_array(v) for v in values.to_list()])) + + +def count_unique_aggregated_values(values: pd.Series): + return len(np.unique(np.concatenate([_to_array(v) for v in values.to_list()]))) + + +def get_most_frequent_entry(input_values: pd.Series): + all_values = np.concatenate([_to_array(v) for v in input_values.to_list()]) + unique_vals, counts = np.unique(all_values, return_counts=True) + return unique_vals[counts.argmax()] + + +# ── Directory commit statistics preparation ─────────────────────────────────── + +def prepare_directory_commit_statistics(commit_statistics_data: pd.DataFrame) -> tuple: + """ + Multi-step grouping pipeline that transforms per-file commit statistics + into a hierarchical directory structure suitable for treemaps. + + Returns: (git_files_with_commit_statistics, git_file_authors, git_file_extensions) + """ + # Derive author rankings + git_file_authors = ( + commit_statistics_data[["author", "commitCount"]] + .groupby("author") + .aggregate(authorCommitCount=pd.NamedAgg(column="commitCount", aggfunc="sum")) + .sort_values(by="authorCommitCount", ascending=False) + .reset_index() + ) + git_file_authors["authorCommitCountRank"] = ( + git_file_authors["authorCommitCount"] + .rank(ascending=False, method="dense") + .astype(int) + ) + + # Add file extension column + git_files_with_commit_statistics = add_file_extension_column(commit_statistics_data.copy(), "filePath", "fileExtension") + + # Derive extension rankings + git_file_extensions = ( + git_files_with_commit_statistics["fileExtension"] + .value_counts() + .rename_axis("fileExtension") + .reset_index(name="fileExtensionCount") + ) + git_file_extensions["fileExtensionCountRank"] = ( + git_file_extensions["fileExtensionCount"] + .rank(ascending=False, method="dense") + .astype(int) + ) + + # Explode directories + git_files_with_commit_statistics = add_directory_column(git_files_with_commit_statistics, "filePath", "directoryPath") + + common_named_aggregation = dict( + daysSinceLastCommit=pd.NamedAgg(column="daysSinceLastCommit", aggfunc="min"), + daysSinceLastCreation=pd.NamedAgg(column="daysSinceLastCreation", aggfunc="min"), + daysSinceLastModification=pd.NamedAgg(column="daysSinceLastModification", aggfunc="min"), + lastCommitDate=pd.NamedAgg(column="lastCommitDate", aggfunc="max"), + lastCreationDate=pd.NamedAgg(column="lastCreationDate", aggfunc="max"), + lastModificationDate=pd.NamedAgg(column="lastModificationDate", aggfunc="max"), + maxCommitSha=pd.NamedAgg(column="maxCommitSha", aggfunc="max"), + ) + + # Group by directory + author + git_files_with_commit_statistics = git_files_with_commit_statistics.groupby(["directoryPath", "author"]).aggregate( + filePaths=pd.NamedAgg(column="filePath", aggfunc=np.unique), + firstFile=pd.NamedAgg(column="filePath", aggfunc="first"), + fileExtensions=pd.NamedAgg(column="fileExtension", aggfunc=collect_as_array), + commitHashes=pd.NamedAgg(column="commitHashes", aggfunc=get_flattened_unique_values), + intermediateCommitCount=pd.NamedAgg(column="commitHashes", aggfunc="count"), + **common_named_aggregation, + ) + git_files_with_commit_statistics = git_files_with_commit_statistics.sort_values(by=["directoryPath", "intermediateCommitCount"], ascending=[True, False]) + git_files_with_commit_statistics = git_files_with_commit_statistics.reset_index() + + # Group by directory only + git_files_with_commit_statistics = git_files_with_commit_statistics.groupby("directoryPath").aggregate( + fileCount=pd.NamedAgg(column="filePaths", aggfunc=count_unique_aggregated_values), + firstFile=pd.NamedAgg(column="firstFile", aggfunc="first"), + mostFrequentFileExtension=pd.NamedAgg(column="fileExtensions", aggfunc=get_most_frequent_entry), + authorCount=pd.NamedAgg(column="author", aggfunc="nunique"), + mainAuthor=pd.NamedAgg(column="author", aggfunc="first"), + secondAuthor=pd.NamedAgg(column="author", aggfunc=second_entry), + commitCount=pd.NamedAgg(column="commitHashes", aggfunc=count_unique_aggregated_values), + **common_named_aggregation, + ) + git_files_with_commit_statistics = git_files_with_commit_statistics.reset_index() + + # Add directory name and parent + git_files_with_commit_statistics = add_directory_name_column(git_files_with_commit_statistics, "directoryPath", "directoryName") + git_files_with_commit_statistics = add_parent_directory_column(git_files_with_commit_statistics, "directoryPath", "directoryParentPath") + + # Final grouping: consolidate duplicate entries + all_columns_except_directory = git_files_with_commit_statistics.columns.to_list()[3:] + git_files_with_commit_statistics = git_files_with_commit_statistics.groupby(all_columns_except_directory).aggregate( + directoryName=pd.NamedAgg(column="directoryName", aggfunc=lambda names: "/".join(names)), + directoryParentPath=pd.NamedAgg(column="directoryParentPath", aggfunc="first"), + directoryPath=pd.NamedAgg(column="directoryPath", aggfunc="last"), + ) + final_column_order = ["directoryPath", "directoryParentPath", "directoryName"] + all_columns_except_directory + git_files_with_commit_statistics = git_files_with_commit_statistics.reset_index()[final_column_order] + + return git_files_with_commit_statistics, git_file_authors, git_file_extensions + + +# ── Treemap creation helpers ────────────────────────────────────────────────── + +def create_treemap_commit_statistics_settings(data_frame: pd.DataFrame) -> plotly_graph_objects.Treemap: + """Creates a Plotly Treemap with the given settings and data frame.""" + return plotly_graph_objects.Treemap( + labels=data_frame["directoryName"], + parents=data_frame["directoryParentPath"], + ids=data_frame["directoryPath"], + customdata=data_frame[ + [ + "fileCount", + "mostFrequentFileExtension", + "commitCount", + "authorCount", + "mainAuthor", + "secondAuthor", + "lastCommitDate", + "daysSinceLastCommit", + "lastCreationDate", + "daysSinceLastCreation", + "lastModificationDate", + "daysSinceLastModification", + "directoryPath", + ] + ], + hovertemplate=( + "%{label}
" + "Files: %{customdata[0]} (%{customdata[1]})
" + "Commits: %{customdata[2]}
" + "Authors: %{customdata[4]}, %{customdata[5]},.. (%{customdata[3]})
" + "Last Commit: %{customdata[6]} (%{customdata[7]} days ago)
" + "Last Created: %{customdata[8]} (%{customdata[9]} days ago)
" + "Last Modified: %{customdata[10]} (%{customdata[11]} days ago)
" + "Path: %{customdata[12]}" + ), + maxdepth=-1, + root_color="lightgrey", + marker=dict(**PLOTLY_TREEMAP_MARKER_BASE_STYLE), + ) + + +def create_rank_colorbar_for_graph_objects_treemap_marker(data_frame: pd.DataFrame, name_column: str, rank_column: str) -> dict: + """Creates a plotly graph_objects.Treemap marker object for a colorbar representing ranked names.""" + inverse_ranked = data_frame[rank_column].max() + 1 - data_frame[rank_column] + return dict( + cornerradius=5, + colors=inverse_ranked, + colorscale=plotly_colors.qualitative.G10, + colorbar=dict( + title="Rank", + tickmode="array", + ticktext=data_frame[name_column], + tickvals=inverse_ranked, + tickfont_size=10, + ), + ) + + +def write_image_and_log(figure: plotly_graph_objects.Figure, report_directory: str, name: str, verbose: bool) -> None: + """Writes the figure as SVG and optionally logs the output path.""" + figure.write_image(**get_plotly_figure_write_image_settings(report_directory, name)) + if verbose: + print(f"{SCRIPT_NAME}: Chart saved: {os.path.join(report_directory, name + '.svg')}") + + +# ── Treemap charts: directory commit statistics ─────────────────────────────── + +def generate_directory_commit_statistic_treemaps( + git_files_with_commit_statistics: pd.DataFrame, + git_file_authors: pd.DataFrame, + git_file_extensions: pd.DataFrame, + report_directory: str, + verbose: bool, +) -> None: + # 1. Number of files per directory + figure = plotly_graph_objects.Figure(plotly_graph_objects.Treemap( + create_treemap_commit_statistics_settings(git_files_with_commit_statistics), + values=git_files_with_commit_statistics["fileCount"], + )) + figure.update_layout(**PLOTLY_MAIN_LAYOUT_BASE_SETTINGS, title="Directories and their file count") + write_image_and_log(figure, report_directory, "NumberOfFilesPerDirectory", verbose) + + # 2. Most frequent file extension per directory + git_files_with_commit_statistics_and_file_extension_rank = pd.merge( + git_files_with_commit_statistics, + git_file_extensions, + left_on="mostFrequentFileExtension", + right_on="fileExtension", + how="left", + validate="m:1", + ) + figure = plotly_graph_objects.Figure(plotly_graph_objects.Treemap( + create_treemap_commit_statistics_settings(git_files_with_commit_statistics), + marker=create_rank_colorbar_for_graph_objects_treemap_marker(git_files_with_commit_statistics_and_file_extension_rank, "fileExtension", "fileExtensionCountRank"), + )) + figure.update_layout(**PLOTLY_MAIN_LAYOUT_BASE_SETTINGS, title="Most frequent file extension per directory") + write_image_and_log(figure, report_directory, "MostFrequentFileExtensionPerDirectory", verbose) + + # 3. Number of commits per directory + git_commit_count_per_directory = add_quantile_limited_column(git_files_with_commit_statistics, "commitCount", 0.98) + figure = plotly_graph_objects.Figure(plotly_graph_objects.Treemap( + create_treemap_commit_statistics_settings(git_commit_count_per_directory), + marker=dict( + **PLOTLY_TREEMAP_MARKER_BASE_COLORSCALE, + colors=git_commit_count_per_directory["commitCount_limited"], + colorbar=dict(title="Commits"), + ), + )) + figure.update_layout(**PLOTLY_MAIN_LAYOUT_BASE_SETTINGS, title="Number of git commits") + write_image_and_log(figure, report_directory, "NumberOfGitCommits", verbose) + + # 4. Number of distinct authors per directory + git_commit_authors_per_directory = add_quantile_limited_column(git_files_with_commit_statistics, "authorCount", 0.98) + figure = plotly_graph_objects.Figure(plotly_graph_objects.Treemap( + create_treemap_commit_statistics_settings(git_commit_authors_per_directory), + marker=dict( + **PLOTLY_TREEMAP_MARKER_BASE_COLORSCALE, + colors=git_commit_authors_per_directory["authorCount_limited"], + colorbar=dict(title="Authors"), + ), + )) + figure.update_layout(**PLOTLY_MAIN_LAYOUT_BASE_SETTINGS, title="Number of distinct commit authors") + write_image_and_log(figure, report_directory, "NumberOfDistinctCommitAuthors", verbose) + + # 5. Directories with very few different authors (low bus-factor, focus = few authors) + git_commit_authors_per_directory_low_focus = add_quantile_limited_column(git_files_with_commit_statistics, "authorCount", 0.33) + author_count_top_limit = git_commit_authors_per_directory_low_focus["authorCount_limited"].max().astype(int).astype(str) + author_count_top_limit_label_alias = {author_count_top_limit: author_count_top_limit + " or more"} + figure = plotly_graph_objects.Figure(plotly_graph_objects.Treemap( + create_treemap_commit_statistics_settings(git_commit_authors_per_directory_low_focus), + marker=dict( + **PLOTLY_TREEMAP_MARKER_BASE_COLORSCALE, + colors=git_commit_authors_per_directory_low_focus["authorCount_limited"], + colorbar=dict( + title="Authors", + tickmode="auto", + labelalias=author_count_top_limit_label_alias, + ), + reversescale=True, + ), + )) + figure.update_layout( + **PLOTLY_MAIN_LAYOUT_BASE_SETTINGS, + title="Number of distinct commit authors (red/black = only one or very few authors)", + ) + write_image_and_log(figure, report_directory, "NumberOfDistinctCommitAuthorsLowFocus", verbose) + + # 6. Main author per directory + git_files_with_commit_statistics_and_main_author_rank = pd.merge( + git_files_with_commit_statistics, + git_file_authors, + left_on="mainAuthor", + right_on="author", + how="left", + validate="m:1", + ) + figure = plotly_graph_objects.Figure(plotly_graph_objects.Treemap( + create_treemap_commit_statistics_settings(git_files_with_commit_statistics), + marker=create_rank_colorbar_for_graph_objects_treemap_marker(git_files_with_commit_statistics_and_main_author_rank, "mainAuthor", "authorCommitCountRank"), + )) + figure.update_layout(**PLOTLY_MAIN_LAYOUT_BASE_SETTINGS, title="Main authors with highest number of commits") + write_image_and_log(figure, report_directory, "MainAuthorsWithHighestNumberOfCommits", verbose) + + # 7. Second author per directory + git_files_with_commit_statistics_and_second_author_rank = pd.merge( + git_files_with_commit_statistics, + git_file_authors, + left_on="secondAuthor", + right_on="author", + how="left", + validate="m:1", + ) + figure = plotly_graph_objects.Figure(plotly_graph_objects.Treemap( + create_treemap_commit_statistics_settings(git_files_with_commit_statistics), + marker=create_rank_colorbar_for_graph_objects_treemap_marker(git_files_with_commit_statistics_and_second_author_rank, "secondAuthor", "authorCommitCountRank"), + )) + figure.update_layout(**PLOTLY_MAIN_LAYOUT_BASE_SETTINGS, title="Second author with the second highest number of commits") + write_image_and_log(figure, report_directory, "SecondAuthorWithTheSecondHighestNumberOfCommits", verbose) + + # 8. Days since last commit per directory + git_commit_days_since_last_commit_per_directory = add_quantile_limited_column(git_files_with_commit_statistics, "daysSinceLastCommit", 0.98) + figure = plotly_graph_objects.Figure(plotly_graph_objects.Treemap( + create_treemap_commit_statistics_settings(git_commit_days_since_last_commit_per_directory), + marker=dict( + **PLOTLY_TREEMAP_MARKER_BASE_COLORSCALE, + colors=git_commit_days_since_last_commit_per_directory["daysSinceLastCommit_limited"], + colorbar=dict(title="Days"), + ), + )) + figure.update_layout(**PLOTLY_MAIN_LAYOUT_BASE_SETTINGS, title="Days since last commit") + write_image_and_log(figure, report_directory, "DaysSinceLastCommit", verbose) + + # 9. Days since last commit per directory (ranked) + git_commit_days_since_last_commit_per_directory = add_rank_column(git_files_with_commit_statistics, "daysSinceLastCommit") + figure = plotly_graph_objects.Figure(plotly_graph_objects.Treemap( + create_treemap_commit_statistics_settings(git_commit_days_since_last_commit_per_directory), + marker=dict( + **PLOTLY_TREEMAP_MARKER_BASE_COLORSCALE, + colors=git_commit_days_since_last_commit_per_directory["daysSinceLastCommit_rank"], + colorbar=dict(title="Rank"), + ), + )) + figure.update_layout(**PLOTLY_MAIN_LAYOUT_BASE_SETTINGS, title="Rank of days since last commit") + write_image_and_log(figure, report_directory, "DaysSinceLastCommitRanked", verbose) + + # 10. Days since last file creation per directory + git_commit_days_since_last_file_creation_per_directory = add_quantile_limited_column(git_files_with_commit_statistics, "daysSinceLastCreation", 0.98) + figure = plotly_graph_objects.Figure(plotly_graph_objects.Treemap( + create_treemap_commit_statistics_settings(git_commit_days_since_last_file_creation_per_directory), + marker=dict( + **PLOTLY_TREEMAP_MARKER_BASE_COLORSCALE, + colors=git_commit_days_since_last_file_creation_per_directory["daysSinceLastCreation_limited"], + colorbar=dict(title="Days"), + ), + )) + figure.update_layout(**PLOTLY_MAIN_LAYOUT_BASE_SETTINGS, title="Days since last file creation") + write_image_and_log(figure, report_directory, "DaysSinceLastFileCreation", verbose) + + # 11. Days since last file creation per directory (ranked) + git_commit_days_since_last_file_creation_per_directory = add_rank_column(git_files_with_commit_statistics, "daysSinceLastCreation") + figure = plotly_graph_objects.Figure(plotly_graph_objects.Treemap( + create_treemap_commit_statistics_settings(git_commit_days_since_last_file_creation_per_directory), + marker=dict( + **PLOTLY_TREEMAP_MARKER_BASE_COLORSCALE, + colors=git_commit_days_since_last_file_creation_per_directory["daysSinceLastCreation_rank"], + colorbar=dict(title="Rank"), + ), + )) + figure.update_layout(**PLOTLY_MAIN_LAYOUT_BASE_SETTINGS, title="Rank of days since last file creation") + write_image_and_log(figure, report_directory, "DaysSinceLastFileCreationRanked", verbose) + + # 12. Days since last file modification per directory + git_commit_days_since_last_file_modification_per_directory = add_quantile_limited_column(git_files_with_commit_statistics, "daysSinceLastModification", 0.98) + figure = plotly_graph_objects.Figure(plotly_graph_objects.Treemap( + create_treemap_commit_statistics_settings(git_commit_days_since_last_file_modification_per_directory), + marker=dict( + **PLOTLY_TREEMAP_MARKER_BASE_COLORSCALE, + colors=git_commit_days_since_last_file_modification_per_directory["daysSinceLastModification_limited"], + colorbar=dict(title="Days"), + ), + )) + figure.update_layout(**PLOTLY_MAIN_LAYOUT_BASE_SETTINGS, title="Days since last file modification") + write_image_and_log(figure, report_directory, "DaysSinceLastFileModification", verbose) + + # 13. Days since last file modification per directory (ranked) + git_commit_days_since_last_file_modification_per_directory = add_rank_column(git_files_with_commit_statistics, "daysSinceLastModification") + figure = plotly_graph_objects.Figure(plotly_graph_objects.Treemap( + create_treemap_commit_statistics_settings(git_commit_days_since_last_file_modification_per_directory), + marker=dict( + **PLOTLY_TREEMAP_MARKER_BASE_COLORSCALE, + colors=git_commit_days_since_last_file_modification_per_directory["daysSinceLastModification_rank"], + colorbar=dict(title="Rank"), + ), + )) + figure.update_layout(**PLOTLY_MAIN_LAYOUT_BASE_SETTINGS, title="Rank of days since last file modification") + write_image_and_log(figure, report_directory, "DaysSinceLastFileModificationRanked", verbose) + + +# ── Co-change treemap charts ────────────────────────────────────────────────── + +def generate_cochange_treemaps( + git_files_with_commit_statistics: pd.DataFrame, + cochange_data: pd.DataFrame, + report_directory: str, + verbose: bool, +) -> None: + # Prepare co-change data + data_to_display = add_directory_column(cochange_data.copy(), "filePath", "directoryPath") + data_to_display = ( + data_to_display.groupby(["directoryPath"]) + .aggregate( + pairwiseChangeCommitCount=pd.NamedAgg(column="commitCount", aggfunc="sum"), + pairwiseChangeFileCount=pd.NamedAgg(column="filePath", aggfunc="count"), + pairwiseChangeAverageRate=pd.NamedAgg(column="coChangeRate", aggfunc="mean"), + pairwiseChangeMaxLift=pd.NamedAgg(column="maxLift", aggfunc="max"), + pairwiseChangeAverageLift=pd.NamedAgg(column="avgLift", aggfunc="mean"), + ) + .reset_index() + ) + data_to_display = pd.merge( + git_files_with_commit_statistics, + data_to_display, + left_on="directoryPath", + right_on="directoryPath", + how="left", + validate="m:1", + ) + data_to_display["pairwiseChangeCommitCount"] = data_to_display["pairwiseChangeCommitCount"].fillna(0).astype(int) + data_to_display["pairwiseChangeFileCount"] = data_to_display["pairwiseChangeFileCount"].fillna(0).astype(int) + data_to_display["pairwiseChangeAverageRate"] = data_to_display["pairwiseChangeAverageRate"].fillna(0).astype(float) + data_to_display["pairwiseChangeMaxLift"] = data_to_display["pairwiseChangeMaxLift"].fillna(0).astype(float) + data_to_display["pairwiseChangeAverageLift"] = data_to_display["pairwiseChangeAverageLift"].fillna(0).astype(float) + data_to_display = data_to_display.reset_index() + + # 14. Files that likely co-change with others + data_to_display = add_quantile_limited_column(data_to_display, "pairwiseChangeCommitCount", 0.98) + figure = plotly_graph_objects.Figure(plotly_graph_objects.Treemap( + create_treemap_commit_statistics_settings(data_to_display), + marker=dict( + **PLOTLY_TREEMAP_MARKER_BASE_COLORSCALE, + colors=data_to_display["pairwiseChangeCommitCount_limited"], + colorbar=dict(title="Co-Changes"), + ), + )) + figure.update_layout(**PLOTLY_MAIN_LAYOUT_BASE_SETTINGS, title="Files that likely co-change with others in update commits") + write_image_and_log(figure, report_directory, "CoChangingFiles", verbose) + + # 15. Co-changing files max lift + data_to_display = add_quantile_limited_column(data_to_display, "pairwiseChangeMaxLift", 0.98) + figure = plotly_graph_objects.Figure(plotly_graph_objects.Treemap( + create_treemap_commit_statistics_settings(data_to_display), + marker=dict( + **PLOTLY_TREEMAP_MARKER_BASE_COLORSCALE, + colors=data_to_display["pairwiseChangeMaxLift_limited"], + colorbar=dict(title="Co-Change Lift"), + ), + )) + figure.update_layout( + **PLOTLY_MAIN_LAYOUT_BASE_SETTINGS, + title="Co-Changing files in update commits max lift (1=random, >1=more than random, <1=less than random)", + ) + write_image_and_log(figure, report_directory, "CoChangingFilesMaxLift", verbose) + + # 16. Co-changing files average lift + data_to_display = add_quantile_limited_column(data_to_display, "pairwiseChangeAverageLift", 0.98) + figure = plotly_graph_objects.Figure(plotly_graph_objects.Treemap( + create_treemap_commit_statistics_settings(data_to_display), + marker=dict( + **PLOTLY_TREEMAP_MARKER_BASE_COLORSCALE, + colors=data_to_display["pairwiseChangeAverageLift_limited"], + colorbar=dict(title="Co-Change Lift"), + ), + )) + figure.update_layout( + **PLOTLY_MAIN_LAYOUT_BASE_SETTINGS, + title="Co-Changing files in update commits average lift (1=random, >1=more than random, <1=less than random)", + ) + write_image_and_log(figure, report_directory, "CoChangingFilesAverageLift", verbose) + + +# ── Bar chart: files per commit distribution ────────────────────────────────── + +def generate_files_per_commit_bar_chart( + git_file_count_per_commit: pd.DataFrame, + report_directory: str, + verbose: bool, +) -> None: + if git_file_count_per_commit.empty: + if verbose: + print(f"{SCRIPT_NAME}: Skipping files-per-commit bar chart — no data") + return + figure = plotly_graph_objects.Figure(plotly_graph_objects.Bar( + x=git_file_count_per_commit["filesPerCommit"].head(30), + y=git_file_count_per_commit["commitCount"].head(30), + )) + figure.update_layout( + **PLOTLY_MAIN_LAYOUT_BASE_SETTINGS, + title="Changed files per commit", + xaxis_title="file count", + yaxis_title="commit count", + ) + write_image_and_log(figure, report_directory, "ChangedFilesPerCommit", verbose) + + +# ── Histogram charts: pairwise co-changed files ─────────────────────────────── + +def add_file_extension_rank_column(data_frame: pd.DataFrame, column_name: str) -> pd.DataFrame: + if column_name + "_rank" in data_frame.columns: + return data_frame + data_frame[f"{column_name}ExtensionRank"] = ( + data_frame.groupby("fileExtensionPair", observed=False)[column_name] + .rank(ascending=False, method="dense") + .astype(int) + ) + return data_frame + + +def plot_histogram_of_pairwise_changed_files( + data_to_plot: pd.DataFrame, + top_pairwise_changed_file_extensions: pd.Series, + x_axis_column: str, + x_axis_label: str, + output_file_name: str, + report_directory: str, + verbose: bool, + sub_plot_rows: int = 4, + sub_plot_columns: int = 1, +) -> None: + if data_to_plot.empty: + print("No data to plot") + return + if top_pairwise_changed_file_extensions.size != sub_plot_rows * sub_plot_columns: + raise ValueError( + f"Number of top pairwise changed file extensions ({top_pairwise_changed_file_extensions.size}) " + f"does not match the number of subplots ({sub_plot_rows * sub_plot_columns})." + ) + + figure = make_subplots( + rows=sub_plot_rows, + cols=sub_plot_columns, + subplot_titles=top_pairwise_changed_file_extensions, + vertical_spacing=0.06, + horizontal_spacing=0.04, + ) + for index, extension in enumerate(top_pairwise_changed_file_extensions, start=1): + row = (index - 1) // sub_plot_columns + 1 + column = (index - 1) % sub_plot_columns + 1 + data_for_subplot = data_to_plot[data_to_plot["fileExtensionPair"] == extension] + figure.add_trace( + plotly_graph_objects.Histogram( + x=data_for_subplot[x_axis_column], + text=data_for_subplot["filePairLineBreak"], + textposition="inside", + hovertext=data_for_subplot["filePairWithRelativePath"], + nbinsx=40, + textfont=dict(size=12, color="white"), + name=extension, + ), + row=row, + col=column, + ) + figure.update_xaxes(title_text=x_axis_label, row=row, col=column) + figure.update_yaxes(title_text="File Pair Count (log)", type="log", row=row, col=column) + + figure.update_annotations(font=dict(size=18)) + figure.update_layout( + margin=dict(t=100, l=10, r=10, b=10), + title="Co-Changed Files by their " + x_axis_label.lower(), + title_font_size=20, + title_y=0.99, + bargap=0.05, + height=2000, + width=1000, + showlegend=False, + ) + figure.write_image(**get_plotly_figure_write_image_settings(report_directory, output_file_name)) + if verbose: + print(f"{SCRIPT_NAME}: Chart saved: {os.path.join(report_directory, output_file_name + '.svg')}") + + +def find_top_pairwise_changed_file_extensions(input_data: pd.DataFrame, top_n: int = 10) -> pd.DataFrame: + """Finds the top N pairwise changed file extensions based on pair count.""" + top_extensions = ( + input_data.groupby("fileExtensionPair", observed=False) + .aggregate(fileExtensionPairCount=pd.NamedAgg(column="filePairWithRelativePath", aggfunc="count")) + .reset_index() + ) + return top_extensions.sort_values(by="fileExtensionPairCount", ascending=False).reset_index(drop=True).head(top_n) + + +def generate_pairwise_histograms( + pairwise_changed_git_files: pd.DataFrame, + report_directory: str, + verbose: bool, +) -> None: + if pairwise_changed_git_files.empty: + if verbose: + print(f"{SCRIPT_NAME}: Skipping pairwise histograms — no data") + return + + # Find top 4 file extension pairs (matching the original notebook behavior) + top_pairwise_changed_file_extensions_data = find_top_pairwise_changed_file_extensions(pairwise_changed_git_files, top_n=4) + + if len(top_pairwise_changed_file_extensions_data) < 4: + if verbose: + print(f"{SCRIPT_NAME}: Skipping pairwise histograms — fewer than 4 extension pairs " + f"(found {len(top_pairwise_changed_file_extensions_data)})") + return + + pairwise_changed_git_files = pairwise_changed_git_files.merge(top_pairwise_changed_file_extensions_data, on="fileExtensionPair") + top_pairwise_changed_file_extensions = top_pairwise_changed_file_extensions_data["fileExtensionPair"] + pairwise_changed_git_files = pairwise_changed_git_files[pairwise_changed_git_files["fileExtensionPair"].isin(top_pairwise_changed_file_extensions)] + + pairwise_changed_git_files = add_file_extension_rank_column(pairwise_changed_git_files, "updateCommitCount") + pairwise_changed_git_files = add_file_extension_rank_column(pairwise_changed_git_files, "updateCommitMinConfidence") + pairwise_changed_git_files = add_file_extension_rank_column(pairwise_changed_git_files, "updateCommitJaccardSimilarity") + pairwise_changed_git_files = add_file_extension_rank_column(pairwise_changed_git_files, "updateCommitLift") + + plot_histogram_of_pairwise_changed_files( + data_to_plot=pairwise_changed_git_files, + top_pairwise_changed_file_extensions=top_pairwise_changed_file_extensions, + x_axis_column="updateCommitCount", + x_axis_label="Commit Count", + output_file_name="CoChangedFilesByCommitCount", + report_directory=report_directory, + verbose=verbose, + ) + + plot_histogram_of_pairwise_changed_files( + data_to_plot=pairwise_changed_git_files, + top_pairwise_changed_file_extensions=top_pairwise_changed_file_extensions, + x_axis_column="updateCommitMinConfidence", + x_axis_label="Commit Min Confidence", + output_file_name="CoChangedFilesByCommitMinConfidence", + report_directory=report_directory, + verbose=verbose, + ) + + plot_histogram_of_pairwise_changed_files( + data_to_plot=pairwise_changed_git_files, + top_pairwise_changed_file_extensions=top_pairwise_changed_file_extensions, + x_axis_column="updateCommitLift", + x_axis_label="Commit Lift", + output_file_name="CoChangedFilesByCommitLift", + report_directory=report_directory, + verbose=verbose, + ) + + plot_histogram_of_pairwise_changed_files( + data_to_plot=pairwise_changed_git_files, + top_pairwise_changed_file_extensions=top_pairwise_changed_file_extensions, + x_axis_column="updateCommitJaccardSimilarity", + x_axis_label="Commit Jaccard Similarity", + output_file_name="CoChangedFilesByCommitJaccardSimilarity", + report_directory=report_directory, + verbose=verbose, + ) + + +# ── Git author wordcloud ────────────────────────────────────────────────────── + +def generate_git_author_wordcloud( + git_author_words_with_frequency: pd.DataFrame, + report_directory: str, + verbose: bool, +) -> None: + if git_author_words_with_frequency.empty: + if verbose: + print(f"{SCRIPT_NAME}: Skipping wordcloud — no data") + return + try: + from wordcloud import WordCloud # type: ignore + except ImportError: + print(f"{SCRIPT_NAME}: Warning: wordcloud library not installed — skipping wordcloud chart") + return + + words_with_frequency_dict = git_author_words_with_frequency.set_index(git_author_words_with_frequency.columns[0]).to_dict()[git_author_words_with_frequency.columns[1]] + wordcloud = WordCloud( + width=800, + height=800, + max_words=600, + collocations=False, + background_color="white", + colormap="viridis", + ).generate_from_frequencies(words_with_frequency_dict) + + plot.figure(figsize=(15, 15)) + plot.imshow(wordcloud, interpolation="bilinear") + plot.axis("off") + plot.title("Wordcloud of git authors") + path = os.path.join(report_directory, "GitAuthorWordcloud.svg") + plot.savefig(path, format="svg", bbox_inches="tight") + plot.close() + if verbose: + print(f"{SCRIPT_NAME}: Chart saved: {path}") + + +# ── Main ────────────────────────────────────────────────────────────────────── + +def main() -> None: + params = parse_parameters() + + if params.verbose: + params.log_dependency_versions() + print(params) + + report_directory = params.report_directory + verbose = params.verbose + + if not report_directory: + print(f"{SCRIPT_NAME}: No report directory specified. Use --report_directory.") + sys.exit(1) + + if not os.path.isdir(report_directory): + print(f"{SCRIPT_NAME}: Report directory does not exist: {report_directory} — no git data, skipping chart generation.") + sys.exit(0) + + print(f"{SCRIPT_NAME}: Generating charts in {report_directory}") + + # Load commit statistics CSV (primary data source) + commit_statistics = load_csv(report_directory, "List_git_files_with_commit_statistics_by_author.csv", verbose) + if commit_statistics.empty: + print(f"{SCRIPT_NAME}: Primary CSV not found or empty — no git data, skipping chart generation.") + sys.exit(0) + + # Prepare hierarchical directory data + git_files_with_commit_statistics, git_file_authors, git_file_extensions = prepare_directory_commit_statistics(commit_statistics) + + if git_files_with_commit_statistics.empty: + print(f"{SCRIPT_NAME}: No directory statistics available — skipping treemap charts.") + else: + print(f"{SCRIPT_NAME}: Generating directory commit statistic treemaps...") + generate_directory_commit_statistic_treemaps( + git_files_with_commit_statistics, git_file_authors, git_file_extensions, report_directory, verbose + ) + + # Co-change treemaps require the files-changed-together CSV + cochange_data = load_csv(report_directory, "List_git_files_that_were_changed_together_with_another_file.csv", verbose) + if not cochange_data.empty: + print(f"{SCRIPT_NAME}: Generating co-change treemaps...") + generate_cochange_treemaps(git_files_with_commit_statistics, cochange_data, report_directory, verbose) + + # Files per commit bar chart + git_file_count_per_commit = load_csv(report_directory, "List_git_files_per_commit_distribution.csv", verbose) + if not git_file_count_per_commit.empty: + print(f"{SCRIPT_NAME}: Generating files-per-commit bar chart...") + generate_files_per_commit_bar_chart(git_file_count_per_commit, report_directory, verbose) + + # Pairwise histograms use the full pairwise changed files list (same as the original notebook) + pairwise_changed_git_files = load_csv(report_directory, "List_pairwise_changed_files.csv", verbose) + if not pairwise_changed_git_files.empty: + print(f"{SCRIPT_NAME}: Generating pairwise changed files histograms...") + generate_pairwise_histograms(pairwise_changed_git_files, report_directory, verbose) + + # Wordcloud + git_author_words_with_frequency = load_csv(report_directory, "Words_for_git_author_Wordcloud_with_frequency.csv", verbose) + if not git_author_words_with_frequency.empty: + print(f"{SCRIPT_NAME}: Generating git author wordcloud...") + generate_git_author_wordcloud(git_author_words_with_frequency, report_directory, verbose) + + print(f"{SCRIPT_NAME}: Chart generation complete.") + + +if __name__ == "__main__": + main() diff --git a/domains/git-history/gitHistoryCsv.sh b/domains/git-history/gitHistoryCsv.sh new file mode 100755 index 000000000..4e3146280 --- /dev/null +++ b/domains/git-history/gitHistoryCsv.sh @@ -0,0 +1,107 @@ +#!/usr/bin/env bash + +# Executes GitLog Cypher statistics queries to produce git history CSV reports. +# It covers directory commit statistics, co-changed files, pairwise changed file metrics, and data quality. +# It requires an already running Neo4j graph database with already imported git history data. +# The results will be written into the sub directory reports/git-history. +# Dynamically triggered by "CsvReports.sh". + +# Note that "scripts/prepareAnalysis.sh" is required to run prior to this script. +# Note that git history data must be imported before this script runs (see import/importGit.sh or the +# jQAssistant git plugin). If no git history is present, all queries return empty results and +# cleanupAfterReportGeneration.sh will remove the empty CSV files — no report directory is created. + +# Requires executeQueryFunctions.sh, cleanupAfterReportGeneration.sh + +# Fail on any error ("-e" = exit on first error, "-o pipefail" exist on errors within piped commands) +set -o errexit -o pipefail + +# Overrideable Constants (defaults also defined in sub scripts) +REPORTS_DIRECTORY=${REPORTS_DIRECTORY:-"reports"} + +## Get this "domains/git-history" directory if not already set +# Even if $BASH_SOURCE is made for Bourne-like shells it is also supported by others and therefore here the preferred solution. +# CDPATH reduces the scope of the cd command to potentially prevent unintended directory changes. +# This way non-standard tools like readlink aren't needed. +GIT_HISTORY_SCRIPT_DIR=${GIT_HISTORY_SCRIPT_DIR:-$(CDPATH=. cd -- "$(dirname -- "${BASH_SOURCE[0]}")" && pwd -P)} +echo "gitHistoryCsv: GIT_HISTORY_SCRIPT_DIR=${GIT_HISTORY_SCRIPT_DIR}" + +# Get the "scripts" directory by navigating two levels up from this domain directory. +SCRIPTS_DIR=${SCRIPTS_DIR:-"${GIT_HISTORY_SCRIPT_DIR}/../../scripts"} + +# Cypher query directories within this domain +STATISTICS_CYPHER_DIR="${GIT_HISTORY_SCRIPT_DIR}/queries/statistics" + +# Define functions to execute a cypher query from within a given file like "execute_cypher" +source "${SCRIPTS_DIR}/executeQueryFunctions.sh" + +# Create main report directory +REPORT_NAME="git-history" +FULL_REPORT_DIRECTORY="${REPORTS_DIRECTORY}/${REPORT_NAME}" +mkdir -p "${FULL_REPORT_DIRECTORY}" + +echo "gitHistoryCsv: $(date +'%Y-%m-%dT%H:%M:%S%z') Processing git history..." + +# ── Detailed file commit statistics ────────────────────────────────────────── + +execute_cypher "${STATISTICS_CYPHER_DIR}/List_git_files_with_commit_statistics_by_author.cypher" \ + > "${FULL_REPORT_DIRECTORY}/List_git_files_with_commit_statistics_by_author.csv" + +execute_cypher "${STATISTICS_CYPHER_DIR}/List_git_files_that_were_changed_together_with_another_file.cypher" \ + > "${FULL_REPORT_DIRECTORY}/List_git_files_that_were_changed_together_with_another_file.csv" + +execute_cypher "${STATISTICS_CYPHER_DIR}/List_git_file_directories_with_commit_statistics.cypher" \ + > "${FULL_REPORT_DIRECTORY}/List_git_file_directories_with_commit_statistics.csv" + +# ── Files per commit distribution ──────────────────────────────────────────── + +execute_cypher "${STATISTICS_CYPHER_DIR}/List_git_files_per_commit_distribution.cypher" \ + > "${FULL_REPORT_DIRECTORY}/List_git_files_per_commit_distribution.csv" + +# ── Pairwise changed files ──────────────────────────────────────────────────── + +execute_cypher "${STATISTICS_CYPHER_DIR}/List_pairwise_changed_files.cypher" \ + > "${FULL_REPORT_DIRECTORY}/List_pairwise_changed_files.csv" + +execute_cypher "${STATISTICS_CYPHER_DIR}/List_pairwise_changed_files_with_dependencies.cypher" \ + > "${FULL_REPORT_DIRECTORY}/List_pairwise_changed_files_with_dependencies.csv" + +execute_cypher "${STATISTICS_CYPHER_DIR}/List_pairwise_changed_files_top_selected_metric.cypher" \ + "selected_pair_metric=updateCommitCount" \ + > "${FULL_REPORT_DIRECTORY}/List_pairwise_changed_files_top_count.csv" + +execute_cypher "${STATISTICS_CYPHER_DIR}/List_pairwise_changed_files_top_selected_metric.cypher" \ + "selected_pair_metric=updateCommitMinConfidence" \ + > "${FULL_REPORT_DIRECTORY}/List_pairwise_changed_files_top_min_confidence.csv" + +execute_cypher "${STATISTICS_CYPHER_DIR}/List_pairwise_changed_files_top_selected_metric.cypher" \ + "selected_pair_metric=updateCommitJaccardSimilarity" \ + > "${FULL_REPORT_DIRECTORY}/List_pairwise_changed_files_top_jaccard.csv" + +execute_cypher "${STATISTICS_CYPHER_DIR}/List_pairwise_changed_files_top_selected_metric.cypher" \ + "selected_pair_metric=updateCommitLift" \ + > "${FULL_REPORT_DIRECTORY}/List_pairwise_changed_files_top_lift.csv" + +# ── Data quality ────────────────────────────────────────────────────────────── + +execute_cypher "${STATISTICS_CYPHER_DIR}/List_git_files_by_resolved_label_and_extension.cypher" \ + > "${FULL_REPORT_DIRECTORY}/List_git_files_by_resolved_label_and_extension.csv" + +execute_cypher "${STATISTICS_CYPHER_DIR}/List_ambiguous_git_files.cypher" \ + > "${FULL_REPORT_DIRECTORY}/List_ambiguous_git_files.csv" + +execute_cypher "${STATISTICS_CYPHER_DIR}/List_unresolved_git_files.cypher" \ + > "${FULL_REPORT_DIRECTORY}/List_unresolved_git_files.csv" + +# ── Wordcloud data ──────────────────────────────────────────────────────────── + +execute_cypher "${STATISTICS_CYPHER_DIR}/Words_for_git_author_Wordcloud_with_frequency.cypher" \ + > "${FULL_REPORT_DIRECTORY}/Words_for_git_author_Wordcloud_with_frequency.csv" + +# ── Cleanup ─────────────────────────────────────────────────────────────────── + +# Clean-up after report generation. Empty reports will be deleted. +# If all queries returned empty results (no git data), the report directory will be removed entirely. +source "${SCRIPTS_DIR}/cleanupAfterReportGeneration.sh" "${FULL_REPORT_DIRECTORY}" + +echo "gitHistoryCsv: $(date +'%Y-%m-%dT%H:%M:%S%z') Successfully finished." diff --git a/domains/git-history/gitHistoryMarkdown.sh b/domains/git-history/gitHistoryMarkdown.sh new file mode 100755 index 000000000..e65a37c80 --- /dev/null +++ b/domains/git-history/gitHistoryMarkdown.sh @@ -0,0 +1,27 @@ +#!/usr/bin/env bash + +# This script is dynamically triggered by "MarkdownReports.sh" when report "All" or "Markdown" are enabled. +# It is designed as an entry point and delegates the execution to the dedicated "gitHistorySummary.sh" script that does the "heavy lifting". + +# Note that "scripts/prepareAnalysis.sh" is required to run prior to this script. + +# Requires gitHistorySummary.sh + +# Fail on any error ("-e" = exit on first error, "-o pipefail" exist on errors within piped commands) +set -o errexit -o pipefail + +# Overrideable Constants (defaults also defined in sub scripts) +REPORTS_DIRECTORY=${REPORTS_DIRECTORY:-"reports"} + +## Get this "domains/git-history" directory if not already set +# Even if $BASH_SOURCE is made for Bourne-like shells it is also supported by others and therefore here the preferred solution. +# CDPATH reduces the scope of the cd command to potentially prevent unintended directory changes. +# This way non-standard tools like readlink aren't needed. +GIT_HISTORY_SCRIPT_DIR=${GIT_HISTORY_SCRIPT_DIR:-$(CDPATH=. cd -- "$(dirname -- "${BASH_SOURCE[0]}")" && pwd -P)} +# echo "gitHistoryMarkdown: GIT_HISTORY_SCRIPT_DIR=${GIT_HISTORY_SCRIPT_DIR}" + +# Get the "summary" directory by taking the path of this script and selecting "summary". +GIT_HISTORY_SUMMARY_DIR=${GIT_HISTORY_SUMMARY_DIR:-"${GIT_HISTORY_SCRIPT_DIR}/summary"} # Contains everything (scripts, templates) to create the Markdown summary report + +# Delegate the execution to the responsible script. +source "${GIT_HISTORY_SUMMARY_DIR}/gitHistorySummary.sh" diff --git a/domains/git-history/gitHistoryPython.sh b/domains/git-history/gitHistoryPython.sh new file mode 100755 index 000000000..18c5d4a12 --- /dev/null +++ b/domains/git-history/gitHistoryPython.sh @@ -0,0 +1,81 @@ +#!/usr/bin/env bash + +# Generates git history charts as SVG files using Python. +# If the required CSV data files are not yet present, it automatically runs "gitHistoryCsv.sh" first. +# The results will be written into the sub directory reports/git-history. +# Dynamically triggered by "PythonReports.sh". + +# Note that "scripts/prepareAnalysis.sh" is required to run prior to this script. +# Note that "gitHistoryCsv.sh" is called automatically if the CSV files are not yet present +# (e.g. when running '--report Python' standalone). When running '--report All', the CSV step +# has already run via CsvReports.sh, so no duplicate work is done. +# If no git history data is present (report directory missing), this script exits cleanly. + +# Requires gitHistoryCharts.py + +# Fail on any error ("-e" = exit on first error, "-o pipefail" exist on errors within piped commands) +set -o errexit -o pipefail + +# Overrideable Constants (defaults also defined in sub scripts) +REPORTS_DIRECTORY=${REPORTS_DIRECTORY:-"reports"} + +## Get this "domains/git-history" directory if not already set +# Even if $BASH_SOURCE is made for Bourne-like shells it is also supported by others and therefore here the preferred solution. +# CDPATH reduces the scope of the cd command to potentially prevent unintended directory changes. +# This way non-standard tools like readlink aren't needed. +GIT_HISTORY_SCRIPT_DIR=${GIT_HISTORY_SCRIPT_DIR:-$(CDPATH=. cd -- "$(dirname -- "${BASH_SOURCE[0]}")" && pwd -P)} +echo "gitHistoryPython: GIT_HISTORY_SCRIPT_DIR=${GIT_HISTORY_SCRIPT_DIR}" + +# Get the "scripts" directory by navigating two levels up from this domain directory. +SCRIPTS_DIR=${SCRIPTS_DIR:-"${GIT_HISTORY_SCRIPT_DIR}/../../scripts"} + +# Function to display script usage +usage() { + echo -e "${COLOR_ERROR}" >&2 + echo "Usage: $0 [--verbose]" >&2 + echo -e "${COLOR_DEFAULT}" >&2 + exit 1 +} + +# Default values +verboseMode="" # either "" or "--verbose" + +# Parse command line arguments +while [[ $# -gt 0 ]]; do + key="$1" + + case ${key} in + --verbose) + verboseMode="--verbose" + ;; + *) + echo -e "${COLOR_ERROR}gitHistoryPython: Error: Unknown option: ${key}${COLOR_DEFAULT}" >&2 + usage + ;; + esac + shift || true # ignore error when there are no more arguments +done + +# Report directory +REPORT_NAME="git-history" +FULL_REPORT_DIRECTORY="${REPORTS_DIRECTORY}/${REPORT_NAME}" +mkdir -p "${FULL_REPORT_DIRECTORY}" + +echo "gitHistoryPython: $(date +'%Y-%m-%dT%H:%M:%S%z') Starting git history chart generation..." + +# If the primary CSV is missing, generate CSVs now so this report type is self-contained. +# When running '--report All', CsvReports.sh already ran, so this is a no-op in that case. +PRIMARY_CSV="${FULL_REPORT_DIRECTORY}/List_git_files_with_commit_statistics_by_author.csv" +if [ ! -f "${PRIMARY_CSV}" ]; then + echo "gitHistoryPython: Primary CSV not found — running gitHistoryCsv.sh first to generate CSV data." + source "${GIT_HISTORY_SCRIPT_DIR}/gitHistoryCsv.sh" +fi + +time python "${GIT_HISTORY_SCRIPT_DIR}/gitHistoryCharts.py" \ + --report_directory "${FULL_REPORT_DIRECTORY}" \ + ${verboseMode} + +# Clean-up after report generation. Empty reports will be deleted. +source "${SCRIPTS_DIR}/cleanupAfterReportGeneration.sh" "${FULL_REPORT_DIRECTORY}" + +echo "gitHistoryPython: $(date +'%Y-%m-%dT%H:%M:%S%z') Successfully finished." diff --git a/scripts/createAggregatedGitLogCsv.sh b/domains/git-history/import/createAggregatedGitLogCsv.sh similarity index 100% rename from scripts/createAggregatedGitLogCsv.sh rename to domains/git-history/import/createAggregatedGitLogCsv.sh diff --git a/scripts/createGitLogCsv.sh b/domains/git-history/import/createGitLogCsv.sh similarity index 100% rename from scripts/createGitLogCsv.sh rename to domains/git-history/import/createGitLogCsv.sh diff --git a/scripts/importGit.sh b/domains/git-history/import/importGit.sh similarity index 87% rename from scripts/importGit.sh rename to domains/git-history/import/importGit.sh index 09342689e..489c9d818 100755 --- a/scripts/importGit.sh +++ b/domains/git-history/import/importGit.sh @@ -53,14 +53,20 @@ echo "importGit: source directory to look for git repositories=${source}" # Even if $BASH_SOURCE is made for Bourne-like shells it is also supported by others and therefore here the preferred solution. # CDPATH reduces the scope of the cd command to potentially prevent unintended directory changes. # This way non-standard tools like readlink aren't needed. -SCRIPTS_DIR=${SCRIPTS_DIR:-$( CDPATH=. cd -- "$(dirname -- "${BASH_SOURCE[0]}")" && pwd -P )} # Repository directory containing the shell scripts +GIT_HISTORY_IMPORT_DIR=${GIT_HISTORY_IMPORT_DIR:-$( CDPATH=. cd -- "$(dirname -- "${BASH_SOURCE[0]}")" && pwd -P )} # This "import" directory +echo "importGit: GIT_HISTORY_IMPORT_DIR=${GIT_HISTORY_IMPORT_DIR}" + +# Get the central "scripts" directory by navigating three levels up from this domain's import directory. +SCRIPTS_DIR=${SCRIPTS_DIR:-"${GIT_HISTORY_IMPORT_DIR}/../../../scripts"} echo "importGit: SCRIPTS_DIR=${SCRIPTS_DIR}" -# Get the "cypher" directory by taking the path of this script and going two directory up and then to "cypher". -CYPHER_DIR=${CYPHER_DIR:-"${SCRIPTS_DIR}/../cypher"} -echo "importGit: CYPHER_DIR=${CYPHER_DIR}" +# Cypher enrichment queries are in this domain's queries/enrichment directory. +GIT_LOG_CYPHER_DIR="${GIT_HISTORY_IMPORT_DIR}/../queries/enrichment" +echo "importGit: GIT_LOG_CYPHER_DIR=${GIT_LOG_CYPHER_DIR}" -GIT_LOG_CYPHER_DIR="${CYPHER_DIR}/GitLog" +# Cypher validation queries are in this domain's queries/validation directory. +GIT_LOG_VALIDATION_CYPHER_DIR="${GIT_HISTORY_IMPORT_DIR}/../queries/validation" +echo "importGit: GIT_LOG_VALIDATION_CYPHER_DIR=${GIT_LOG_VALIDATION_CYPHER_DIR}" # Define functions (like execute_cypher and execute_cypher_summarized) to execute cypher queries from within a given file source "${SCRIPTS_DIR}/executeQueryFunctions.sh" @@ -137,12 +143,12 @@ commonPostGitImport() { # Since it's currently not possible to rule out ambiguity in git<->code file matching, # the following verifications are only an additional info in the log rather than an error. echo "importGit: Running verification queries for troubleshooting (non failing)..." - execute_cypher "${GIT_LOG_CYPHER_DIR}/Verify_git_to_code_file_unambiguous.cypher" - execute_cypher "${GIT_LOG_CYPHER_DIR}/Verify_code_to_git_file_unambiguous.cypher" - execute_cypher "${GIT_LOG_CYPHER_DIR}/Verify_git_missing_CHANGED_TOGETHER_WITH_properties.cypher" + execute_cypher "${GIT_LOG_VALIDATION_CYPHER_DIR}/Verify_git_to_code_file_unambiguous.cypher" + execute_cypher "${GIT_LOG_VALIDATION_CYPHER_DIR}/Verify_code_to_git_file_unambiguous.cypher" + execute_cypher "${GIT_LOG_VALIDATION_CYPHER_DIR}/Verify_git_missing_CHANGED_TOGETHER_WITH_properties.cypher" - dataVerificationResult=$( execute_cypher "${GIT_LOG_CYPHER_DIR}/Verify_git_missing_create_date.cypher") - if ! is_csv_column_greater_zero "${dataVerificationResult}" "numberOfMissingCreateDateEntires"; then + dataVerificationResult=$( execute_cypher "${GIT_LOG_VALIDATION_CYPHER_DIR}/Verify_git_missing_create_date.cypher") + if ! is_csv_column_greater_zero "${dataVerificationResult}" "numberOfMissingCreateDateEntries"; then # Warning: The git file creation date must not be missing. However, this is not important enough to stop the analysis. # Therefore, it will only be a warning and subsequent queries will use a default date in these cases. echo -e "${COLOR_YELLOW}importGit: Data verification warning: Git:File nodes with missing createdAtEpoch property detected! Affected number of nodes:${COLOR_DEFAULT}" @@ -216,12 +222,12 @@ if [ ! "${IMPORT_GIT_LOG_DATA_IF_SOURCE_IS_PRESENT}" = "none" ] && [ ! "${IMPORT if [ "${IMPORT_GIT_LOG_DATA_IF_SOURCE_IS_PRESENT}" = "aggregated" ]; then # Import pre-aggregated git log data (no single commits) when IMPORT_GIT_LOG_DATA_IF_SOURCE_IS_PRESENT = "aggregated" - (cd "${repository}" && source "${SCRIPTS_DIR}/createAggregatedGitLogCsv.sh" "${NEO4J_FULL_IMPORT_DIRECTORY}/aggregatedGitLog.csv") + (cd "${repository}" && source "${GIT_HISTORY_IMPORT_DIR}/createAggregatedGitLogCsv.sh" "${NEO4J_FULL_IMPORT_DIRECTORY}/aggregatedGitLog.csv") importAggregatedGitLog "git_repository_absolute_directory_name=${full_repository_path}" postAggregatedGitLogImport else # Import git log data with every commit when IMPORT_GIT_LOG_DATA_IF_SOURCE_IS_PRESENT = "full" (default) - (cd "${repository}" && source "${SCRIPTS_DIR}/createGitLogCsv.sh" "${NEO4J_FULL_IMPORT_DIRECTORY}/gitLog.csv") + (cd "${repository}" && source "${GIT_HISTORY_IMPORT_DIR}/createGitLogCsv.sh" "${NEO4J_FULL_IMPORT_DIRECTORY}/gitLog.csv") importGitLog "git_repository_absolute_directory_name=${full_repository_path}" postGitLogImport fi diff --git a/cypher/GitLog/Add_CHANGED_TOGETHER_WITH_relationships_to_code_files.cypher b/domains/git-history/queries/enrichment/Add_CHANGED_TOGETHER_WITH_relationships_to_code_files.cypher similarity index 100% rename from cypher/GitLog/Add_CHANGED_TOGETHER_WITH_relationships_to_code_files.cypher rename to domains/git-history/queries/enrichment/Add_CHANGED_TOGETHER_WITH_relationships_to_code_files.cypher diff --git a/cypher/GitLog/Add_CHANGED_TOGETHER_WITH_relationships_to_git_files.cypher b/domains/git-history/queries/enrichment/Add_CHANGED_TOGETHER_WITH_relationships_to_git_files.cypher similarity index 100% rename from cypher/GitLog/Add_CHANGED_TOGETHER_WITH_relationships_to_git_files.cypher rename to domains/git-history/queries/enrichment/Add_CHANGED_TOGETHER_WITH_relationships_to_git_files.cypher diff --git a/cypher/GitLog/Add_HAS_PARENT_relationships_to_commits.cypher b/domains/git-history/queries/enrichment/Add_HAS_PARENT_relationships_to_commits.cypher similarity index 100% rename from cypher/GitLog/Add_HAS_PARENT_relationships_to_commits.cypher rename to domains/git-history/queries/enrichment/Add_HAS_PARENT_relationships_to_commits.cypher diff --git a/cypher/GitLog/Add_RESOLVES_TO_relationships_to_git_files_for_Java.cypher b/domains/git-history/queries/enrichment/Add_RESOLVES_TO_relationships_to_git_files_for_Java.cypher similarity index 100% rename from cypher/GitLog/Add_RESOLVES_TO_relationships_to_git_files_for_Java.cypher rename to domains/git-history/queries/enrichment/Add_RESOLVES_TO_relationships_to_git_files_for_Java.cypher diff --git a/cypher/GitLog/Add_RESOLVES_TO_relationships_to_git_files_for_Typescript.cypher b/domains/git-history/queries/enrichment/Add_RESOLVES_TO_relationships_to_git_files_for_Typescript.cypher similarity index 100% rename from cypher/GitLog/Add_RESOLVES_TO_relationships_to_git_files_for_Typescript.cypher rename to domains/git-history/queries/enrichment/Add_RESOLVES_TO_relationships_to_git_files_for_Typescript.cypher diff --git a/cypher/GitLog/Create_git_repository_node.cypher b/domains/git-history/queries/enrichment/Create_git_repository_node.cypher similarity index 100% rename from cypher/GitLog/Create_git_repository_node.cypher rename to domains/git-history/queries/enrichment/Create_git_repository_node.cypher diff --git a/cypher/GitLog/Delete_git_log_data.cypher b/domains/git-history/queries/enrichment/Delete_git_log_data.cypher similarity index 100% rename from cypher/GitLog/Delete_git_log_data.cypher rename to domains/git-history/queries/enrichment/Delete_git_log_data.cypher diff --git a/cypher/GitLog/Delete_plain_git_directory_file_nodes.cypher b/domains/git-history/queries/enrichment/Delete_plain_git_directory_file_nodes.cypher similarity index 100% rename from cypher/GitLog/Delete_plain_git_directory_file_nodes.cypher rename to domains/git-history/queries/enrichment/Delete_plain_git_directory_file_nodes.cypher diff --git a/cypher/GitLog/Import_aggregated_git_log_csv_data.cypher b/domains/git-history/queries/enrichment/Import_aggregated_git_log_csv_data.cypher similarity index 87% rename from cypher/GitLog/Import_aggregated_git_log_csv_data.cypher rename to domains/git-history/queries/enrichment/Import_aggregated_git_log_csv_data.cypher index 2db4a4956..c96236e7f 100644 --- a/cypher/GitLog/Import_aggregated_git_log_csv_data.cypher +++ b/domains/git-history/queries/enrichment/Import_aggregated_git_log_csv_data.cypher @@ -1,4 +1,4 @@ -// Import aggregated git log CSV data with the following schema: (Git:Log:Author)-[:AUTHORED]->(Git:Log:ChangeSpan)-[:CONTAINS]->(Git:Log:File) , (Git:Repository)-[:HAS_CHANGE_SPAN]->(Git:Log:ChangeSpan) , (Git:Repository)-[:HAS_AUTHER]->(Git:Log:Auther) , (Git:Repository)-[:HAS_FILE]->(Git:Log:File). Variables: git_repository_absolute_directory_name +// Import aggregated git log CSV data with the following schema: (Git:Log:Author)-[:AUTHORED]->(Git:Log:ChangeSpan)-[:CONTAINS]->(Git:Log:File) , (Git:Repository)-[:HAS_CHANGE_SPAN]->(Git:Log:ChangeSpan) , (Git:Repository)-[:HAS_AUTHOR]->(Git:Log:Author) , (Git:Repository)-[:HAS_FILE]->(Git:Log:File). Variables: git_repository_absolute_directory_name LOAD CSV WITH HEADERS FROM "file:///aggregatedGitLog.csv" AS row CALL { WITH row @@ -13,8 +13,8 @@ CALL { WITH row MERGE (git_author)-[:AUTHORED]->(git_change_span) MERGE (git_change_span)-[:CONTAINS_CHANGED]->(git_file) MERGE (git_repository)-[:HAS_CHANGE_SPAN]->(git_change_span) - MERGE (git_repository)-[:HAS_AUTHOR]->(git_file) - MERGE (git_repository)-[:HAS_FILE]->(git_author) + MERGE (git_repository)-[:HAS_AUTHOR]->(git_author) + MERGE (git_repository)-[:HAS_FILE]->(git_file) } IN TRANSACTIONS OF 1000 ROWS RETURN count(DISTINCT row.author) AS numberOfAuthors ,count(DISTINCT row.filename) AS numberOfFiles diff --git a/cypher/GitLog/Import_git_log_csv_data.cypher b/domains/git-history/queries/enrichment/Import_git_log_csv_data.cypher similarity index 100% rename from cypher/GitLog/Import_git_log_csv_data.cypher rename to domains/git-history/queries/enrichment/Import_git_log_csv_data.cypher diff --git a/cypher/GitLog/Index_absolute_file_name.cypher b/domains/git-history/queries/enrichment/Index_absolute_file_name.cypher similarity index 100% rename from cypher/GitLog/Index_absolute_file_name.cypher rename to domains/git-history/queries/enrichment/Index_absolute_file_name.cypher diff --git a/cypher/GitLog/Index_author_name.cypher b/domains/git-history/queries/enrichment/Index_author_name.cypher similarity index 100% rename from cypher/GitLog/Index_author_name.cypher rename to domains/git-history/queries/enrichment/Index_author_name.cypher diff --git a/cypher/GitLog/Index_change_span_year.cypher b/domains/git-history/queries/enrichment/Index_change_span_year.cypher similarity index 100% rename from cypher/GitLog/Index_change_span_year.cypher rename to domains/git-history/queries/enrichment/Index_change_span_year.cypher diff --git a/cypher/GitLog/Index_commit_hash.cypher b/domains/git-history/queries/enrichment/Index_commit_hash.cypher similarity index 100% rename from cypher/GitLog/Index_commit_hash.cypher rename to domains/git-history/queries/enrichment/Index_commit_hash.cypher diff --git a/cypher/GitLog/Index_commit_parent.cypher b/domains/git-history/queries/enrichment/Index_commit_parent.cypher similarity index 100% rename from cypher/GitLog/Index_commit_parent.cypher rename to domains/git-history/queries/enrichment/Index_commit_parent.cypher diff --git a/domains/git-history/queries/enrichment/Index_commit_sha.cypher b/domains/git-history/queries/enrichment/Index_commit_sha.cypher new file mode 100644 index 000000000..3a1540e4c --- /dev/null +++ b/domains/git-history/queries/enrichment/Index_commit_sha.cypher @@ -0,0 +1,3 @@ +// Create index for git commit sha + +CREATE INDEX INDEX_COMMIT_SHA IF NOT EXISTS FOR (commit:Commit) ON (commit.sha) \ No newline at end of file diff --git a/cypher/GitLog/Index_file_name.cypher b/domains/git-history/queries/enrichment/Index_file_name.cypher similarity index 100% rename from cypher/GitLog/Index_file_name.cypher rename to domains/git-history/queries/enrichment/Index_file_name.cypher diff --git a/domains/git-history/queries/enrichment/Index_file_relative_path.cypher b/domains/git-history/queries/enrichment/Index_file_relative_path.cypher new file mode 100644 index 000000000..971335295 --- /dev/null +++ b/domains/git-history/queries/enrichment/Index_file_relative_path.cypher @@ -0,0 +1,3 @@ +// Create index for the relative file path + +CREATE INDEX INDEX_FILE_RELATIVE_PATH IF NOT EXISTS FOR (file:File) ON (file.relativePath) \ No newline at end of file diff --git a/cypher/GitLog/Set_commit_classification_properties.cypher b/domains/git-history/queries/enrichment/Set_commit_classification_properties.cypher similarity index 95% rename from cypher/GitLog/Set_commit_classification_properties.cypher rename to domains/git-history/queries/enrichment/Set_commit_classification_properties.cypher index 988eac727..2fb87ba89 100644 --- a/cypher/GitLog/Set_commit_classification_properties.cypher +++ b/domains/git-history/queries/enrichment/Set_commit_classification_properties.cypher @@ -1,4 +1,4 @@ -// Classify git commits and set properties like isMergeCommit, isAutomationCommit (=isBotCommit or isMavenCommit). +// Classify git commits and set properties like isMergeCommit, isAutomatedCommit (=isBotAuthor or isMavenCommit). MATCH (git_commit:Git:Commit) WITH git_commit, diff --git a/cypher/GitLog/Set_number_of_aggregated_git_commits.cypher b/domains/git-history/queries/enrichment/Set_number_of_aggregated_git_commits.cypher similarity index 100% rename from cypher/GitLog/Set_number_of_aggregated_git_commits.cypher rename to domains/git-history/queries/enrichment/Set_number_of_aggregated_git_commits.cypher diff --git a/cypher/GitLog/Set_number_of_git_log_commits.cypher b/domains/git-history/queries/enrichment/Set_number_of_git_log_commits.cypher similarity index 100% rename from cypher/GitLog/Set_number_of_git_log_commits.cypher rename to domains/git-history/queries/enrichment/Set_number_of_git_log_commits.cypher diff --git a/cypher/GitLog/Set_number_of_git_plugin_commits.cypher b/domains/git-history/queries/enrichment/Set_number_of_git_plugin_commits.cypher similarity index 100% rename from cypher/GitLog/Set_number_of_git_plugin_commits.cypher rename to domains/git-history/queries/enrichment/Set_number_of_git_plugin_commits.cypher diff --git a/cypher/GitLog/Set_number_of_git_plugin_update_commits.cypher b/domains/git-history/queries/enrichment/Set_number_of_git_plugin_update_commits.cypher similarity index 100% rename from cypher/GitLog/Set_number_of_git_plugin_update_commits.cypher rename to domains/git-history/queries/enrichment/Set_number_of_git_plugin_update_commits.cypher diff --git a/cypher/GitLog/List_ambiguous_git_files.cypher b/domains/git-history/queries/statistics/List_ambiguous_git_files.cypher similarity index 93% rename from cypher/GitLog/List_ambiguous_git_files.cypher rename to domains/git-history/queries/statistics/List_ambiguous_git_files.cypher index a394b3fc2..d6d7bd7e4 100644 --- a/cypher/GitLog/List_ambiguous_git_files.cypher +++ b/domains/git-history/queries/statistics/List_ambiguous_git_files.cypher @@ -1,4 +1,4 @@ -// List ambigiously resolved git files where a single git file is attached to more than one code file for troubleshooting/testing. +// List ambiguously resolved git files where a single git file is attached to more than one code file for troubleshooting/testing. MATCH (file:File&!Git)<-[:RESOLVES_TO]-(git_file:File&Git) OPTIONAL MATCH (artifact:Artifact:Archive)-[:CONTAINS_CHANGED]->(file) diff --git a/cypher/GitLog/List_git_file_directories_with_commit_statistics.cypher b/domains/git-history/queries/statistics/List_git_file_directories_with_commit_statistics.cypher similarity index 100% rename from cypher/GitLog/List_git_file_directories_with_commit_statistics.cypher rename to domains/git-history/queries/statistics/List_git_file_directories_with_commit_statistics.cypher diff --git a/cypher/GitLog/List_git_files_by_resolved_label_and_extension.cypher b/domains/git-history/queries/statistics/List_git_files_by_resolved_label_and_extension.cypher similarity index 100% rename from cypher/GitLog/List_git_files_by_resolved_label_and_extension.cypher rename to domains/git-history/queries/statistics/List_git_files_by_resolved_label_and_extension.cypher diff --git a/cypher/GitLog/List_git_files_per_commit_distribution.cypher b/domains/git-history/queries/statistics/List_git_files_per_commit_distribution.cypher similarity index 100% rename from cypher/GitLog/List_git_files_per_commit_distribution.cypher rename to domains/git-history/queries/statistics/List_git_files_per_commit_distribution.cypher diff --git a/cypher/GitLog/List_git_files_that_were_changed_together.cypher b/domains/git-history/queries/statistics/List_git_files_that_were_changed_together.cypher similarity index 100% rename from cypher/GitLog/List_git_files_that_were_changed_together.cypher rename to domains/git-history/queries/statistics/List_git_files_that_were_changed_together.cypher diff --git a/cypher/GitLog/List_git_files_that_were_changed_together_all_in_one.cypher b/domains/git-history/queries/statistics/List_git_files_that_were_changed_together_all_in_one.cypher similarity index 100% rename from cypher/GitLog/List_git_files_that_were_changed_together_all_in_one.cypher rename to domains/git-history/queries/statistics/List_git_files_that_were_changed_together_all_in_one.cypher diff --git a/cypher/GitLog/List_git_files_that_were_changed_together_with_another_file.cypher b/domains/git-history/queries/statistics/List_git_files_that_were_changed_together_with_another_file.cypher similarity index 100% rename from cypher/GitLog/List_git_files_that_were_changed_together_with_another_file.cypher rename to domains/git-history/queries/statistics/List_git_files_that_were_changed_together_with_another_file.cypher diff --git a/cypher/GitLog/List_git_files_that_were_changed_together_with_another_file_all_in_one.cypher b/domains/git-history/queries/statistics/List_git_files_that_were_changed_together_with_another_file_all_in_one.cypher similarity index 100% rename from cypher/GitLog/List_git_files_that_were_changed_together_with_another_file_all_in_one.cypher rename to domains/git-history/queries/statistics/List_git_files_that_were_changed_together_with_another_file_all_in_one.cypher diff --git a/cypher/GitLog/List_git_files_with_commit_statistics_by_author.cypher b/domains/git-history/queries/statistics/List_git_files_with_commit_statistics_by_author.cypher similarity index 100% rename from cypher/GitLog/List_git_files_with_commit_statistics_by_author.cypher rename to domains/git-history/queries/statistics/List_git_files_with_commit_statistics_by_author.cypher diff --git a/cypher/GitLog/List_pairwise_changed_files.cypher b/domains/git-history/queries/statistics/List_pairwise_changed_files.cypher similarity index 100% rename from cypher/GitLog/List_pairwise_changed_files.cypher rename to domains/git-history/queries/statistics/List_pairwise_changed_files.cypher diff --git a/cypher/GitLog/List_pairwise_changed_files_top_selected_metric.cypher b/domains/git-history/queries/statistics/List_pairwise_changed_files_top_selected_metric.cypher similarity index 100% rename from cypher/GitLog/List_pairwise_changed_files_top_selected_metric.cypher rename to domains/git-history/queries/statistics/List_pairwise_changed_files_top_selected_metric.cypher diff --git a/cypher/GitLog/List_pairwise_changed_files_with_dependencies.cypher b/domains/git-history/queries/statistics/List_pairwise_changed_files_with_dependencies.cypher similarity index 100% rename from cypher/GitLog/List_pairwise_changed_files_with_dependencies.cypher rename to domains/git-history/queries/statistics/List_pairwise_changed_files_with_dependencies.cypher diff --git a/cypher/GitLog/List_unresolved_git_files.cypher b/domains/git-history/queries/statistics/List_unresolved_git_files.cypher similarity index 100% rename from cypher/GitLog/List_unresolved_git_files.cypher rename to domains/git-history/queries/statistics/List_unresolved_git_files.cypher diff --git a/domains/git-history/queries/statistics/Words_for_git_author_Wordcloud_with_frequency.cypher b/domains/git-history/queries/statistics/Words_for_git_author_Wordcloud_with_frequency.cypher new file mode 100644 index 000000000..691f328c1 --- /dev/null +++ b/domains/git-history/queries/statistics/Words_for_git_author_Wordcloud_with_frequency.cypher @@ -0,0 +1,6 @@ +// Wordcloud of git authors and their commit count + + MATCH (author:Git:Author)-[:COMMITTED]-(commit:Git:Commit) + WHERE NOT author.name CONTAINS '[bot]' + AND size(author.name) > 1 +RETURN author.name AS word, count(commit) AS frequency \ No newline at end of file diff --git a/cypher/Validation/ValidateGitHistory.cypher b/domains/git-history/queries/validation/ValidateGitHistory.cypher similarity index 100% rename from cypher/Validation/ValidateGitHistory.cypher rename to domains/git-history/queries/validation/ValidateGitHistory.cypher diff --git a/cypher/GitLog/Verify_code_to_git_file_unambiguous.cypher b/domains/git-history/queries/validation/Verify_code_to_git_file_unambiguous.cypher similarity index 100% rename from cypher/GitLog/Verify_code_to_git_file_unambiguous.cypher rename to domains/git-history/queries/validation/Verify_code_to_git_file_unambiguous.cypher diff --git a/cypher/GitLog/Verify_git_missing_CHANGED_TOGETHER_WITH_properties.cypher b/domains/git-history/queries/validation/Verify_git_missing_CHANGED_TOGETHER_WITH_properties.cypher similarity index 100% rename from cypher/GitLog/Verify_git_missing_CHANGED_TOGETHER_WITH_properties.cypher rename to domains/git-history/queries/validation/Verify_git_missing_CHANGED_TOGETHER_WITH_properties.cypher diff --git a/cypher/GitLog/Verify_git_missing_create_date.cypher b/domains/git-history/queries/validation/Verify_git_missing_create_date.cypher similarity index 79% rename from cypher/GitLog/Verify_git_missing_create_date.cypher rename to domains/git-history/queries/validation/Verify_git_missing_create_date.cypher index 699993378..b74d7807a 100644 --- a/cypher/GitLog/Verify_git_missing_create_date.cypher +++ b/domains/git-history/queries/validation/Verify_git_missing_create_date.cypher @@ -3,4 +3,4 @@ MATCH (git_repository:Git&Repository)-[:HAS_FILE]->(git_file:Git&File&!Repository) WHERE git_file.deletedAt IS NULL // Ignore deleted git files AND git_file.createdAtEpoch IS NULL -RETURN count(DISTINCT git_file) AS numberOfMissingCreateDateEntires \ No newline at end of file +RETURN count(DISTINCT git_file) AS numberOfMissingCreateDateEntries \ No newline at end of file diff --git a/cypher/GitLog/Verify_git_to_code_file_unambiguous.cypher b/domains/git-history/queries/validation/Verify_git_to_code_file_unambiguous.cypher similarity index 100% rename from cypher/GitLog/Verify_git_to_code_file_unambiguous.cypher rename to domains/git-history/queries/validation/Verify_git_to_code_file_unambiguous.cypher diff --git a/domains/git-history/summary/gitHistorySummary.sh b/domains/git-history/summary/gitHistorySummary.sh new file mode 100644 index 000000000..a3ab42689 --- /dev/null +++ b/domains/git-history/summary/gitHistorySummary.sh @@ -0,0 +1,251 @@ +#!/usr/bin/env bash + +# Creates a Markdown report summarising all git history analysis results. +# It requires an already running Neo4j graph database with already imported git log data. +# The results will be written into the sub directory reports/git-history. +# Dynamically triggered by "MarkdownReports.sh" via "gitHistoryMarkdown.sh". + +# Note that "scripts/prepareAnalysis.sh" is required to run prior to this script. +# Note that "gitHistoryCsv.sh" is required to run prior to this script. + +# Requires executeQueryFunctions.sh, cleanupAfterReportGeneration.sh + +# Fail on any error ("-e" = exit on first error, "-o pipefail" exist on errors within piped commands) +set -o errexit -o pipefail + +# Overrideable Constants (defaults also defined in sub scripts) +REPORTS_DIRECTORY=${REPORTS_DIRECTORY:-"reports"} +MARKDOWN_INCLUDES_DIRECTORY=${MARKDOWN_INCLUDES_DIRECTORY:-"includes"} # Subdirectory that contains Markdown files to be included by the Markdown template for the report. + +## Get this "domains/git-history/summary" directory if not already set +# Even if $BASH_SOURCE is made for Bourne-like shells it is also supported by others and therefore here the preferred solution. +# CDPATH reduces the scope of the cd command to potentially prevent unintended directory changes. +# This way non-standard tools like readlink aren't needed. +GIT_HISTORY_SUMMARY_DIR=${GIT_HISTORY_SUMMARY_DIR:-$(CDPATH=. cd -- "$(dirname -- "${BASH_SOURCE[0]}")" && pwd -P)} +# echo "gitHistorySummary: GIT_HISTORY_SUMMARY_DIR=${GIT_HISTORY_SUMMARY_DIR}" + +# Get the "scripts" directory by navigating three levels up from this summary directory. +SCRIPTS_DIR=${SCRIPTS_DIR:-"${GIT_HISTORY_SUMMARY_DIR}/../../../scripts"} +MARKDOWN_SCRIPTS_DIR=${MARKDOWN_SCRIPTS_DIR:-"${SCRIPTS_DIR}/markdown"} + +# Cypher query directory for git history statistics queries within this domain +GIT_HISTORY_STATISTICS_CYPHER_DIR="${GIT_HISTORY_SUMMARY_DIR}/../queries/statistics" + +# Define functions to execute a cypher query from within a given file (first and only argument) like "execute_cypher" +source "${SCRIPTS_DIR}/executeQueryFunctions.sh" + +# ── Front matter ────────────────────────────────────────────────────────────── + +git_history_front_matter() { + local current_date + current_date="$(date +'%Y-%m-%d')" + + local latest_tag + latest_tag="$(git for-each-ref --sort=-creatordate --count=1 --format '%(refname:short)' refs/tags)" + + local analysis_directory + analysis_directory="${PWD##*/}" + + echo "---" + echo "title: \"Git History Report\"" + echo "generated: \"${current_date}\"" + echo "model_version: \"${latest_tag}\"" + echo "dataset: \"${analysis_directory}\"" + echo "authors: [\"JohT/code-graph-analysis-pipeline\"]" + echo "---" +} + +# ── SVG chart reference helpers ─────────────────────────────────────────────── + +# Emits a Markdown image reference for a chart SVG if the file exists, otherwise nothing. +include_svg_if_exists() { + local svg_file="${FULL_REPORT_DIRECTORY}/${1}" + local alt_text="${2}" + if [ -f "${svg_file}" ]; then + echo "" + echo "![${alt_text}](./${1})" + echo "" + fi +} + +# Emits Markdown image references for every SVG matching the given glob pattern, sorted. +include_svgs_matching() { + local base_dir="${1}" + local pattern="${2}" + [ -d "${base_dir}" ] || return 0 # if the base directory doesn't exist, just return without emitting anything + find "${base_dir}" -maxdepth 1 -type f -name "${pattern}" | sort | while read -r svg_file; do + local chart_filename + chart_filename=$(basename -- "${svg_file}") + local rel_path="${base_dir#"${FULL_REPORT_DIRECTORY}/"}/${chart_filename}" + local chart_label="${chart_filename%.*}" + echo "" + echo "![${chart_label}](./${rel_path})" + done +} + +# ── Report assembly helpers ─────────────────────────────────────────────────── + +# Limits a piped Markdown table to at most 10 data rows (header + separator kept in full). +limit_markdown_table() { + awk '/^\|[| :-]*-[| :-]*\|/ { sep=1; print; next } !sep { print } sep && ++rows <= 10 { print }' +} + +# Runs a Cypher query and outputs a limited Markdown table to stdout. +# Arguments: [cypher_params...] +cypher_table() { + execute_cypher "$@" --output-markdown-table | limit_markdown_table +} + +# Appends a CSV download link to stdout if the CSV file exists. +# Arguments: +csv_link() { + local full_csv="${FULL_REPORT_DIRECTORY}/${1}" + if [ -f "${full_csv}" ]; then + echo "" + echo "[Full data](./${1})" + fi +} + +# ── Report assembly ─────────────────────────────────────────────────────────── + +assemble_git_history_report() { + echo "gitHistorySummary: $(date +'%Y-%m-%dT%H:%M:%S%z') Assembling Markdown report..." + + local report_include_directory="${FULL_REPORT_DIRECTORY}/${MARKDOWN_INCLUDES_DIRECTORY}" + mkdir -p "${report_include_directory}" + + # -- Write front matter ------------------------------------------------ + git_history_front_matter > "${report_include_directory}/GitHistoryReportFrontMatter.md" + + # ── Overview ────────────────────────────────────────────────────────── + + { + cypher_table "${GIT_HISTORY_STATISTICS_CYPHER_DIR}/List_git_files_by_resolved_label_and_extension.cypher" + csv_link "List_git_files_by_resolved_label_and_extension.csv" + } > "${report_include_directory}/List_git_files_by_resolved_label_and_extension.md" + + # ── Directory commit statistics ──────────────────────────────────────── + + { + cypher_table "${GIT_HISTORY_STATISTICS_CYPHER_DIR}/List_git_file_directories_with_commit_statistics.cypher" + csv_link "List_git_file_directories_with_commit_statistics.csv" + } > "${report_include_directory}/List_git_file_directories_with_commit_statistics.md" + + { + include_svgs_matching "${FULL_REPORT_DIRECTORY}" "DirectoryCommitStatistics_*.svg" + } > "${report_include_directory}/DirectoryCommitStatisticsCharts.md" + + # ── Co-changed files ────────────────────────────────────────────────── + + { + cypher_table "${GIT_HISTORY_STATISTICS_CYPHER_DIR}/List_git_files_that_were_changed_together.cypher" + csv_link "List_git_files_that_were_changed_together.csv" + } > "${report_include_directory}/List_git_files_that_were_changed_together.md" + + { + cypher_table "${GIT_HISTORY_STATISTICS_CYPHER_DIR}/List_git_files_that_were_changed_together_all_in_one.cypher" + csv_link "List_git_files_that_were_changed_together_all_in_one.csv" + } > "${report_include_directory}/List_git_files_that_were_changed_together_all_in_one.md" + + { + cypher_table "${GIT_HISTORY_STATISTICS_CYPHER_DIR}/List_git_files_that_were_changed_together_with_another_file.cypher" + csv_link "List_git_files_that_were_changed_together_with_another_file.csv" + } > "${report_include_directory}/List_git_files_that_were_changed_together_with_another_file.md" + + { + cypher_table "${GIT_HISTORY_STATISTICS_CYPHER_DIR}/List_git_files_that_were_changed_together_with_another_file_all_in_one.cypher" + csv_link "List_git_files_that_were_changed_together_with_another_file_all_in_one.csv" + } > "${report_include_directory}/List_git_files_that_were_changed_together_with_another_file_all_in_one.md" + + { + include_svgs_matching "${FULL_REPORT_DIRECTORY}" "CoChangedFiles_*.svg" + } > "${report_include_directory}/CoChangedFilesCharts.md" + + # ── File change distribution ─────────────────────────────────────────── + + { + cypher_table "${GIT_HISTORY_STATISTICS_CYPHER_DIR}/List_git_files_per_commit_distribution.cypher" + csv_link "List_git_files_per_commit_distribution.csv" + } > "${report_include_directory}/List_git_files_per_commit_distribution.md" + + { + include_svg_if_exists "FilesPerCommit_Bar.svg" "Files per Commit Distribution" + } > "${report_include_directory}/FilesPerCommitChart.md" + + # ── Pairwise changed files ───────────────────────────────────────────── + + { + cypher_table "${GIT_HISTORY_STATISTICS_CYPHER_DIR}/List_pairwise_changed_files.cypher" + csv_link "List_pairwise_changed_files.csv" + } > "${report_include_directory}/List_pairwise_changed_files.md" + + { + cypher_table "${GIT_HISTORY_STATISTICS_CYPHER_DIR}/List_pairwise_changed_files_top_selected_metric.cypher" \ + "selected_pair_metric=updateCommitLift" + csv_link "List_pairwise_changed_files_top_lift.csv" + } > "${report_include_directory}/List_pairwise_changed_files_top_selected_metric.md" + + { + cypher_table "${GIT_HISTORY_STATISTICS_CYPHER_DIR}/List_pairwise_changed_files_with_dependencies.cypher" + csv_link "List_pairwise_changed_files_with_dependencies.csv" + } > "${report_include_directory}/List_pairwise_changed_files_with_dependencies.md" + + { + include_svgs_matching "${FULL_REPORT_DIRECTORY}" "PairwiseChangedFiles_*.svg" + } > "${report_include_directory}/PairwiseChangedFilesCharts.md" + + # ── Files by author ──────────────────────────────────────────────────── + + { + cypher_table "${GIT_HISTORY_STATISTICS_CYPHER_DIR}/List_git_files_with_commit_statistics_by_author.cypher" + csv_link "List_git_files_with_commit_statistics_by_author.csv" + } > "${report_include_directory}/List_git_files_with_commit_statistics_by_author.md" + + # ── Data quality ────────────────────────────────────────────────────── + + { + cypher_table "${GIT_HISTORY_STATISTICS_CYPHER_DIR}/List_ambiguous_git_files.cypher" + csv_link "List_ambiguous_git_files.csv" + } > "${report_include_directory}/List_ambiguous_git_files.md" + + { + cypher_table "${GIT_HISTORY_STATISTICS_CYPHER_DIR}/List_unresolved_git_files.cypher" + csv_link "List_unresolved_git_files.csv" + } > "${report_include_directory}/List_unresolved_git_files.md" + + # ── Git author wordcloud ─────────────────────────────────────────────── + + { + include_svg_if_exists "GitAuthorWordcloud.svg" "Git Author Wordcloud" + } > "${report_include_directory}/GitAuthorWordcloudChart.md" + + # -- Remove empty Markdown includes ------------------------------------ + source "${SCRIPTS_DIR}/cleanupAfterReportGeneration.sh" "${report_include_directory}" + + # -- Create fallback empty file for optional includes ------------------ + echo "" > "${report_include_directory}/empty.md" + + # -- Copy no-git-data fallback template -------------------------------- + cp -f "${GIT_HISTORY_SUMMARY_DIR}/report_no_git_data.template.md" \ + "${report_include_directory}/report_no_git_data.template.md" + + # -- Assemble final report from template -------------------------------- + cp -f "${GIT_HISTORY_SUMMARY_DIR}/report.template.md" "${FULL_REPORT_DIRECTORY}/report.template.md" + cat "${FULL_REPORT_DIRECTORY}/report.template.md" \ + | "${MARKDOWN_SCRIPTS_DIR}/embedMarkdownIncludes.sh" "${report_include_directory}" \ + > "${FULL_REPORT_DIRECTORY}/git_history_report.md" + + rm -rf "${FULL_REPORT_DIRECTORY}/report.template.md" + rm -rf "${report_include_directory}" + + echo "gitHistorySummary: $(date +'%Y-%m-%dT%H:%M:%S%z') Successfully finished." +} + +# ── Main ────────────────────────────────────────────────────────────────────── + +# Create report directory +REPORT_NAME="git-history" +FULL_REPORT_DIRECTORY="${REPORTS_DIRECTORY}/${REPORT_NAME}" +mkdir -p "${FULL_REPORT_DIRECTORY}" + +assemble_git_history_report diff --git a/domains/git-history/summary/report.template.md b/domains/git-history/summary/report.template.md new file mode 100644 index 000000000..9f67808e1 --- /dev/null +++ b/domains/git-history/summary/report.template.md @@ -0,0 +1,172 @@ + + +# 📜 Git History Report + +## 1. Overview + +This report analyses the **git commit history** of the codebase. It covers: + +- **Directory commit statistics** — which directories change most frequently and by how many authors +- **Co-changed files** — files that are frequently committed together (coupling signals) +- **File change distribution** — how many files are changed per commit +- **Pairwise changed files** — direct co-change relationships between specific file pairs +- **Data quality** — ambiguous or unresolved file references in the git log +- **Git author wordcloud** — visual overview of contributor activity + +> **Reading the tables**: Rows are sorted by priority — the **first rows are the most critical**. +> High commit frequency in a directory can indicate hotspots that benefit from refactoring attention. +> Files that always change together may be candidates for co-location or module consolidation. + +## 📚 Table of Contents + +1. [Overview](#1-overview) +1. [Directory Commit Statistics](#2-directory-commit-statistics) +1. [Co-Changed Files](#3-co-changed-files) +1. [File Change Distribution](#4-file-change-distribution) +1. [Pairwise Changed Files](#5-pairwise-changed-files) +1. [Files by Author](#6-files-by-author) +1. [Data Quality](#7-data-quality) +1. [Git Author Wordcloud](#8-git-author-wordcloud) +1. [Glossary and Column Definitions](#9-glossary-and-column-definitions) + +--- + +## 2. Directory Commit Statistics + +Shows how often each directory has been changed in git history and how many distinct authors contributed to it. High values in `commits` and `authors` point to active, potentially complex directories. + +### 2.1 Directory Commit Statistics (Table) + + + +### 2.2 Directory Commit Statistics (Charts) + + + +--- + +## 3. Co-Changed Files + +Files that are frequently committed together are said to be _co-changed_. High co-change frequency between two files is a signal of logical coupling — they may belong to the same conceptual unit or have a shared concern that could be refactored. + +### 3.1 Co-Changed File Pairs + + + +### 3.2 Co-Changed File Pairs (All in One Commit) + +Files changed together in a single large commit. + + + +### 3.3 Co-Changed With a Specific File + +Shows all files that were changed together with another particular file. + + + +### 3.4 Co-Changed With a Specific File (All in One) + + + +### 3.5 Co-Changed Files (Charts) + + + +--- + +## 4. File Change Distribution + +Shows the distribution of how many files are changed per commit. A high proportion of large commits (many files changed at once) can indicate low commit granularity. + +### 4.1 Files per Commit Distribution + + + +### 4.2 Files per Commit Chart + + + +--- + +## 5. Pairwise Changed Files + +Direct pairwise co-change analysis between individual files, showing commit overlap counts and related dependency information. + +### 5.1 Pairwise Changed Files + + + +### 5.2 Pairwise Changed Files (Top by Lift) + +Filing pairs with the highest _commit lift_ — pairs that co-change more often than random chance (lift > 1). + + + +### 5.3 Pairwise Changed Files With Dependencies + +Files that are co-changed and also have a structural dependency relationship between them. + + + +### 5.4 Pairwise Changed Files (Charts) + + + +--- + +## 6. Files by Author + +Shows the files each author has contributed to, including per-author commit statistics per file. Useful for identifying knowledge boundaries and bus-factor risks. + +### 6.1 Files with Commit Statistics by Author + + + +--- + +## 7. Data Quality + +Identifies potential issues in the git log data: files referenced in commits that cannot be resolved to a known codebase file (unresolved), or that match more than one candidate (ambiguous). These affect the reliability of all co-change metrics. + +### 7.1 File Resolution Summary + +Overview of file resolution by extension: how many files are resolved vs. ambiguous vs. unresolved per file type. + + + +### 7.2 Ambiguous Git Files + +Files in the git log that match multiple candidates in the scanned codebase. These are excluded from co-change analysis. + + + +### 7.3 Unresolved Git Files + +Files referenced in git commits but not found in the scanned codebase. May indicate deleted files, renames, or files outside the scan scope. + + + +--- + +## 8. Git Author Wordcloud + +Visual overview of contributor names by commit frequency. Larger text = more commits. + + + +--- + +## 9. Glossary and Column Definitions + +| Term | Definition | +|------|------------| +| `commits` | Number of git commits touching this file or directory. | +| `authors` | Number of distinct author identities contributing to this file or directory. | +| `coChanges` | Number of commits in which two files were changed together. | +| `coupling` | Ratio of co-changes to total commits (0–1). Higher = tighter logical coupling. | +| `coChangedWith` | The other file in a co-change pair. | +| `ambiguous` | A git file path that matches more than one node in the scanned codebase. | +| `unresolved` | A git file path that matches no node in the scanned codebase. | +| `filesPerCommit` | How many files were changed in a single commit. | +| `frequency` | Relative share of commits at a specific `filesPerCommit` count. | diff --git a/domains/git-history/summary/report_no_git_data.template.md b/domains/git-history/summary/report_no_git_data.template.md new file mode 100644 index 000000000..6bde7e992 --- /dev/null +++ b/domains/git-history/summary/report_no_git_data.template.md @@ -0,0 +1 @@ +⚠️ _No data available — git history not imported for this codebase._ diff --git a/scripts/reports/GitHistoryCsv.sh b/scripts/reports/GitHistoryCsv.sh deleted file mode 100755 index 9f588cdd1..000000000 --- a/scripts/reports/GitHistoryCsv.sh +++ /dev/null @@ -1,62 +0,0 @@ -#!/usr/bin/env bash - -# Executes "GitLog" Cypher queries to get the "git-history-csv" CSV reports. -# It contains lists of files with only one author, last changed or created files, pairwise changed files,... - -# Requires executeQueryFunctions.sh, cleanupAfterReportGeneration.sh - -# Fail on any error ("-e" = exit on first error, "-o pipefail" exist on errors within piped commands) -set -o errexit -o pipefail - -# Overrideable Constants (defaults also defined in sub scripts) -REPORTS_DIRECTORY=${REPORTS_DIRECTORY:-"reports"} - -## Get this "scripts/reports" directory if not already set -# Even if $BASH_SOURCE is made for Bourne-like shells it is also supported by others and therefore here the preferred solution. -# CDPATH reduces the scope of the cd command to potentially prevent unintended directory changes. -# This way non-standard tools like readlink aren't needed. -REPORTS_SCRIPT_DIR=${REPORTS_SCRIPT_DIR:-$( CDPATH=. cd -- "$(dirname -- "${BASH_SOURCE[0]}")" && pwd -P )} -echo "GitHistoryCsv: REPORTS_SCRIPT_DIR=${REPORTS_SCRIPT_DIR}" - -# Get the "scripts" directory by taking the path of this script and going one directory up. -SCRIPTS_DIR=${SCRIPTS_DIR:-"${REPORTS_SCRIPT_DIR}/.."} # Repository directory containing the shell scripts -echo "GitHistoryCsv: SCRIPTS_DIR=${SCRIPTS_DIR}" - -# Get the "cypher" directory by taking the path of this script and going two directory up and then to "cypher". -CYPHER_DIR=${CYPHER_DIR:-"${REPORTS_SCRIPT_DIR}/../../cypher"} -echo "GitHistoryCsv: CYPHER_DIR=${CYPHER_DIR}" - -# Define functions to execute cypher queries from within a given file -source "${SCRIPTS_DIR}/executeQueryFunctions.sh" - -# Create report directory -REPORT_NAME="git-history-csv" -FULL_REPORT_DIRECTORY="${REPORTS_DIRECTORY}/${REPORT_NAME}" -mkdir -p "${FULL_REPORT_DIRECTORY}" - -# Local Constants -GIT_LOG_CYPHER_DIR="${CYPHER_DIR}/GitLog" - -echo "GitHistoryCsv: $(date +'%Y-%m-%dT%H:%M:%S%z') Processing git history..." - -# Detailed git file statistics -execute_cypher "${GIT_LOG_CYPHER_DIR}/List_git_files_with_commit_statistics_by_author.cypher" > "${FULL_REPORT_DIRECTORY}/List_git_files_with_commit_statistics_by_author.csv" -execute_cypher "${GIT_LOG_CYPHER_DIR}/List_git_files_that_were_changed_together_with_another_file.cypher" > "${FULL_REPORT_DIRECTORY}/List_git_files_that_were_changed_together_with_another_file.csv" -execute_cypher "${GIT_LOG_CYPHER_DIR}/List_git_file_directories_with_commit_statistics.cypher" > "${FULL_REPORT_DIRECTORY}/List_git_file_directories_with_commit_statistics.csv" - -# Overall distribution of how many files were changed with one git commit, how many were changed with two, etc. -execute_cypher "${GIT_LOG_CYPHER_DIR}/List_git_files_per_commit_distribution.cypher" > "${FULL_REPORT_DIRECTORY}/List_git_files_per_commit_distribution.csv" - -# Find pairwise changed files that depend on each other -execute_cypher "${GIT_LOG_CYPHER_DIR}/List_pairwise_changed_files_with_dependencies.cypher" > "${FULL_REPORT_DIRECTORY}/List_pairwise_changed_files_with_dependencies.csv" - -# List pairwise changed files with various metrics -execute_cypher "${GIT_LOG_CYPHER_DIR}/List_pairwise_changed_files_top_selected_metric.cypher" "selected_pair_metric=updateCommitCount" > "${FULL_REPORT_DIRECTORY}/List_pairwise_changed_files_top_count.csv" -execute_cypher "${GIT_LOG_CYPHER_DIR}/List_pairwise_changed_files_top_selected_metric.cypher" "selected_pair_metric=updateCommitMinConfidence" > "${FULL_REPORT_DIRECTORY}/List_pairwise_changed_files_top_min_confidence.csv" -execute_cypher "${GIT_LOG_CYPHER_DIR}/List_pairwise_changed_files_top_selected_metric.cypher" "selected_pair_metric=updateCommitJaccardSimilarity" > "${FULL_REPORT_DIRECTORY}/List_pairwise_changed_files_top_jaccard.csv" -execute_cypher "${GIT_LOG_CYPHER_DIR}/List_pairwise_changed_files_top_selected_metric.cypher" "selected_pair_metric=updateCommitLift" > "${FULL_REPORT_DIRECTORY}/List_pairwise_changed_files_top_lift.csv" - -# Clean-up after report generation. Empty reports will be deleted. -source "${SCRIPTS_DIR}/cleanupAfterReportGeneration.sh" "${FULL_REPORT_DIRECTORY}" - -echo "GitHistoryCsv: $(date +'%Y-%m-%dT%H:%M:%S%z') Successfully finished." \ No newline at end of file diff --git a/scripts/resetAndScan.sh b/scripts/resetAndScan.sh index 28f78541e..4e60c0278 100755 --- a/scripts/resetAndScan.sh +++ b/scripts/resetAndScan.sh @@ -28,6 +28,9 @@ TOOLS_DIRECTORY=${TOOLS_DIRECTORY:-"tools"} # Get the tools directory (defaults SCRIPTS_DIR=${SCRIPTS_DIR:-$( CDPATH=. cd -- "$(dirname -- "${BASH_SOURCE[0]}")" && pwd -P )} # Repository directory containing the shell scripts echo "resetAndScan: SCRIPTS_DIR=${SCRIPTS_DIR}" +DOMAINS_DIRECTORY=${DOMAINS_DIRECTORY:-"${SCRIPTS_DIR}/../domains"} # Domains directory containing domain-specific analysis scripts +echo "resetAndScan: DOMAINS_DIRECTORY=${DOMAINS_DIRECTORY}" + # Internal constants JQASSISTANT_DIRECTORY="${TOOLS_DIRECTORY}/${JQASSISTANT_CLI_ARTIFACT}-${JQASSISTANT_CLI_VERSION}" JQASSISTANT_BIN="${JQASSISTANT_DIRECTORY}/bin" @@ -86,4 +89,7 @@ echo "resetAndScan: Analyzing using jQAssistant CLI version ${JQASSISTANT_CLI_VE "${JQASSISTANT_BIN}"/jqassistant.sh analyze # Scan all git repositories within the "source" (default) folder and import their git log (history) if configured. -time source "${SCRIPTS_DIR}/importGit.sh" \ No newline at end of file +# Uses domain-local importGit.sh which resolves Cypher queries from domains/git-history/queries/enrichment/ +# TODO: This sources the git-history domain (domains/git-history/import/importGit.sh). The dependency direction (core → domain) should be revisited +# in a future cleanup task to determine the canonical location for importGit.sh. +time source "${DOMAINS_DIRECTORY}/git-history/import/importGit.sh" \ No newline at end of file