creativecommons
diff --git a/‎README.md‎
Lines changed: 96 additions & 23 deletions b/‎README.md‎
Lines changed: 96 additions & 23 deletions
diff --git a/‎data/2026Q1/1-fetch/gcs_1_count.csv‎
Lines changed: 1 addition & 0 deletions b/‎data/2026Q1/1-fetch/gcs_1_count.csv‎
Lines changed: 1 addition & 0 deletions
diff --git a/‎data/2026Q1/1-fetch/gcs_2_count_by_language.csv‎
Lines changed: 1 addition & 0 deletions b/‎data/2026Q1/1-fetch/gcs_2_count_by_language.csv‎
Lines changed: 1 addition & 0 deletions
diff --git a/‎data/2026Q1/1-fetch/gcs_3_count_by_country.csv‎
Lines changed: 1 addition & 0 deletions b/‎data/2026Q1/1-fetch/gcs_3_count_by_country.csv‎
Lines changed: 1 addition & 0 deletions
diff --git a/‎data/2026Q1/1-fetch/github_1_count.csv‎
Lines changed: 8 additions & 0 deletions b/‎data/2026Q1/1-fetch/github_1_count.csv‎
Lines changed: 8 additions & 0 deletions
diff --git a/‎scripts/1-fetch/wikipedia_fetch.py‎
Lines changed: 13 additions & 0 deletions b/‎scripts/1-fetch/wikipedia_fetch.py‎
Lines changed: 13 additions & 0 deletions
diff --git a/‎scripts/2-process/gcs_process.py‎
Lines changed: 9 additions & 5 deletions b/‎scripts/2-process/gcs_process.py‎
Lines changed: 9 additions & 5 deletions
diff --git a/‎scripts/2-process/github_process.py‎
Lines changed: 12 additions & 45 deletions b/‎scripts/2-process/github_process.py‎
Lines changed: 12 additions & 45 deletions
diff --git a/‎scripts/2-process/wikipedia_process.py‎
Lines changed: 13 additions & 2 deletions b/‎scripts/2-process/wikipedia_process.py‎
Lines changed: 13 additions & 2 deletions
@@ -1,29 +1,16 @@
-# quantifying
+# Quantifying
 
-Quantifying the Commons
+Quantifying the Commons: measure the size and diversity of the commons--the
+collection of works that are openly licensed or in the public domain
 
 
 ## Overview
 
-This project seeks to quantify the size and diversity of the commons--the
-collection of works that are openly licensed or in the public domain.
-
-
-### Meaningful
-
-The reports generated by this project (and the data fetched and processed to
-support it) seeks to be meaningful. We hope this project will provide data and
-analysis that helps inform discussions about the commons--the collection of
-works that are openly licensed or in the public domain.
-
-The goal of this project is to help answer questions like:
-- How has the world's use of the commons changed over time?
-- How is the knowledge and culture of the commons distributed?
-  - Who has access (and how much) to the commons?
-- What significant trends can be observed in the commons?
-  - Which public domain dedication or licenses are the most popular?
-  - What are the correlations between public domain dedication or licenses and
-    region, language, domain/endeavor, etc.?
+This project seeks to quantify the size and diversity of the creative commons
+legal tools. We aim to track the collection of works (articles, images,
+publications, etc.) that are openly licensed or in the public domain. The
+project automates data collection from multiple data sources, processes the
+data, and generates meaningful reports.
 
 
 ## Code of conduct
@@ -47,6 +34,93 @@ See [`CONTRIBUTING.md`][org-contrib].
 [org-contrib]: https://github.com/creativecommons/.github/blob/main/CONTRIBUTING.md
 
 
+### The three phases of generating a report
+
+1. **Fetch**: This phase involves collecting data from a particular source
+   using its API. Before writing any code, we plan the analyses we want to
+   perform by asking meaningful questions about the data. We also consider API
+   limitations (such as query limits) and design a query strategy to work
+   within these limitations. Then we write a python script that gets the data,
+   it is quite important to follow the format of the scripts existing in the
+   project and use the modules and functions where applicable. It ensures
+   consistency in the scripts and we can easily debug issues might arise.
+   - **Meaningful questions**
+     - The reports generated by this project (and the data fetched and
+       processed to support it) seeks to be meaningful. We hope this project
+       will provide data and analysis that helps inform discussions about the
+       commons. The goal of this project is to help answer questions like:
+       - How has the world's use of the commons changed over time?
+       - How is the knowledge and culture of the commons distributed?
+       - Who has access (and how much) to the commons?
+       - What significant trends can be observed in the commons?
+       - Which public domain dedication or licenses are the most popular?
+       - What are the correlations between public domain dedication or licenses
+         and region, language, domain/endeavor, etc.?
+   - **Limitations of an API**
+     - Some data sources provide APIs with query limits (it can be daily or
+       hourly) depending on what is given in the documentation. This restricts
+       how many requests that can be made in the specified period of time. It
+       is important to plan a query strategy and schedule fetch jobs to stay
+       within the allowed limits.
+   - **Headings of data in 1-fetch**
+     - [Tool identifier][tool-identifier]: A unique identifier used to
+       distinguish each Creative Commons legal tool within the dataset. This
+       helps ensure consistency when tracking tools across different data
+       sources.
+     - [SPDX identifier][spdx-identifier]: A standardized identifier maintained
+       by the Software Package Data Exchange (SPDX) project. It provides a
+       consistent way to reference licenses in applications.
+2. **Process**: In this phase, the fetched data is transformed into a
+   structured and standardized format for analysis. The data is then analyzed
+   and categorized based on defined criteria to extract insights that answer
+   the meaningful questions identified during the 1-fetch phase.
+3. **report**: This phase focuses on presenting the results of the analysis.
+   We generate graphs and summaries that clearly show trends, patterns, and
+   distributions in the data. These reports help communicate key insights about
+   the size, diversity, and characteristics of openly licensed and public
+   domain works.
+
+[tool-identifier]: https://creativecommons.org/share-your-work/cclicenses/
+[spdx-identifier]: https://spdx.org/licenses/
+
+
+### Automation phases
+
+For automating these phases, the project uses Python scripts to fetch, process,
+and report data. GitHub Actions is used to automatically run these scripts on a
+defined schedule and on code updates. It handles script execution, manages
+dependencies, and ensures the workflow runs consistently.
+- **Script assumptions**
+  - Execution schedule for each quarter:
+    - 1-Fetch: first month, 1st half of second month
+    - 2-Process: 2nd half of second month
+    - 3-Report: third month
+- **Script requirements**
+  - *Must be safe*
+    - Scripts must not make any changes with default options
+    - Easiest way to run script should also be the safest
+    - Have options spelled out
+    - Must be timely
+  - *Scripts should complete within a maximum of 45 minutes*
+    - *Scripts shouldn't take longer than 3 minutes with default options*
+    - That way there’s a quicker way to see what is happening when it is
+      running; see execution, without errors, etc. Then later in production it
+      can be run with longer options
+  - *Must be idempotent*
+    - [Idempotence - Wikipedia](https://en.wikipedia.org/wiki/Idempotence)
+    - This applies to both the data fetched and the data stored. If the data
+      changes randomly, we can't draw meaningful conclusions.
+  - *Balanced use of third-party libraries*
+    - Third-party libraries should be leveraged when they are:
+      - API specific (google-api-python-client, internetarchive, etc.)
+- File formats
+  - CSV: the format is well supported (rendered on GitHub, etc.), easy to use,
+    and the data used by the project is simple enough to avoid any
+    shortcomings.
+  - YAML: prioritizes human readability which addresses the primary costs and
+    risks associated with configuration files.
+
+
 ### Project structure
 
 Please note that in the directory tree below, all instances of `fetch`,
@@ -91,8 +165,7 @@ Quantifying/
 ```
 
 
-## Development
-
+## How to set up
 
 ### Prerequisites
 
 
@@ -0,0 +1 @@
+"PLAN_INDEX","TOOL_IDENTIFIER","COUNT"
@@ -0,0 +1 @@
+"PLAN_INDEX","TOOL_IDENTIFIER","LANGUAGE","COUNT"
@@ -0,0 +1 @@
+"PLAN_INDEX","TOOL_IDENTIFIER","COUNTRY","COUNT"
@@ -0,0 +1,8 @@
+"TOOL_IDENTIFIER","SPDX_IDENTIFIER","COUNT"
+"BSD Zero Clause License","0BSD","64052"
+"CC0 1.0","CC0-1.0","350419"
+"CC BY 4.0","CC-BY-4.0","102675"
+"CC BY-SA 4.0","CC-BY-SA-4.0","30783"
+"MIT No Attribution","MIT-0","35103"
+"Unlicense","Unlicense","406459"
+"Total public repositories","N/A","289935546"
@@ -63,6 +63,18 @@ def parse_arguments():
     return args
 
 
+def check_for_completion():
+    try:
+        with open(FILE_LANGUAGES, "r", newline="") as file_obj:
+            reader = csv.DictReader(file_obj, dialect="unix")
+            if len(list(reader)) > 300:
+                raise shared.QuantifyingException(
+                    f"Data fetch completed for {QUARTER}", 0
+                )
+    except FileNotFoundError:
+        pass  # File may not be found without --enable-save, etc.
+
+
 def write_data(args, tool_data):
     if not args.enable_save:
         return args
@@ -157,6 +169,7 @@ def query_wikipedia_languages(session):
 def main():
     args = parse_arguments()
     shared.paths_log(LOGGER, PATHS)
+    check_for_completion()
     shared.git_fetch_and_merge(args, PATHS["repo"])
     session = shared.get_session()
     tool_data = query_wikipedia_languages(session)
 
@@ -311,7 +311,9 @@ def main():
 
     # Count data
     file1_count = shared.path_join(PATHS["data_1-fetch"], "gcs_1_count.csv")
-    count_data = pd.read_csv(file1_count, usecols=["TOOL_IDENTIFIER", "COUNT"])
+    count_data = shared.open_data_file(
+        LOGGER, file1_count, usecols=["TOOL_IDENTIFIER", "COUNT"]
+    )
     process_product_totals(args, count_data)
     process_latest_prior_retired_totals(args, count_data)
     process_totals_by_free_cultural(args, count_data)
@@ -321,17 +323,19 @@ def main():
     file2_language = shared.path_join(
         PATHS["data_1-fetch"], "gcs_2_count_by_language.csv"
     )
-    language_data = pd.read_csv(
-        file2_language, usecols=["TOOL_IDENTIFIER", "LANGUAGE", "COUNT"]
+    language_data = shared.open_data_file(
+        LOGGER,
+        file2_language,
+        usecols=["TOOL_IDENTIFIER", "LANGUAGE", "COUNT"],
     )
     process_totals_by_language(args, language_data)
 
     # Country data
     file3_country = shared.path_join(
         PATHS["data_1-fetch"], "gcs_3_count_by_country.csv"
     )
-    country_data = pd.read_csv(
-        file3_country, usecols=["TOOL_IDENTIFIER", "COUNTRY", "COUNT"]
+    country_data = shared.open_data_file(
+        LOGGER, file3_country, usecols=["TOOL_IDENTIFIER", "COUNTRY", "COUNT"]
     )
     process_totals_by_country(args, country_data)
 
 
@@ -11,7 +11,6 @@
 import traceback
 
 # Third-party
-# import pandas as pd
 import pandas as pd
 
 # Add parent directory so shared can be imported
@@ -60,6 +59,13 @@ def parse_arguments():
     return args
 
 
+def check_for_data_file(file_path):
+    if os.path.exists(file_path):
+        raise shared.QuantifyingException(
+            f"Processed data already exists for {QUARTER}", 0
+        )
+
+
 def data_to_csv(args, data, file_path):
     if not args.enable_save:
         return
@@ -92,6 +98,7 @@ def process_totals_by_license(args, count_data):
     file_path = shared.path_join(
         PATHS["data_phase"], "github_totals_by_license.csv"
     )
+    check_for_data_file(file_path)
     data_to_csv(args, data, file_path)
 
 
@@ -126,59 +133,19 @@ def process_totals_by_restriction(args, count_data):
     file_path = shared.path_join(
         PATHS["data_phase"], "github_totals_by_restriction.csv"
     )
+    check_for_data_file(file_path)
     data_to_csv(args, data, file_path)
 
 
-# def load_quarter_data(quarter):
-#     """
-#     Load data for a specific quarter.
-#     """
-#     file_path = os.path.join(PATHS["data"], f"{quarter}",
-#       "1-fetch", "github_fetched")
-#     if not os.path.exists(file_path):
-#         LOGGER.error(f"Data file for quarter {quarter} not found.")
-#         return None
-#     return pd.read_csv(file_path)
-
-
-# def compare_data(current_quarter, previous_quarter):
-#     """
-#     Compare data between two quarters.
-#     """
-#     current_data = load_quarter_data(current_quarter)
-#     previous_data = load_quarter_data(previous_quarter)
-
-#     if current_data is None or previous_data is None:
-#         return
-
-#     Process data to compare totals
-
-
-# def parse_arguments():
-#     """
-#     Parses command-line arguments, returns parsed arguments.
-#     """
-#     LOGGER.info("Parsing command-line arguments")
-#     parser = argparse.ArgumentParser(
-#       description="Google Custom Search Comparison Report")
-#     parser.add_argument(
-#         "--current_quarter", type=str, required=True,
-#       help="Current quarter for comparison (e.g., 2024Q3)"
-#     )
-#     parser.add_argument(
-#         "--previous_quarter", type=str, required=True,
-#           help="Previous quarter for comparison (e.g., 2024Q2)"
-#     )
-#     return parser.parse_args()
-
-
 def main():
     args = parse_arguments()
     shared.paths_log(LOGGER, PATHS)
     shared.git_fetch_and_merge(args, PATHS["repo"])
 
     file_count = shared.path_join(PATHS["data_1-fetch"], "github_1_count.csv")
-    count_data = pd.read_csv(file_count, usecols=["TOOL_IDENTIFIER", "COUNT"])
+    count_data = shared.open_data_file(
+        LOGGER, file_count, usecols=["TOOL_IDENTIFIER", "COUNT"]
+    )
     process_totals_by_license(args, count_data)
     process_totals_by_restriction(args, count_data)
 
 
@@ -63,6 +63,13 @@ def parse_arguments():
     return args
 
 
+def check_for_data_file(file_path):
+    if os.path.exists(file_path):
+        raise shared.QuantifyingException(
+            f"Processed data already exists for {QUARTER}", 0
+        )
+
+
 def data_to_csv(args, data, file_path):
     if not args.enable_save:
         return
@@ -91,6 +98,7 @@ def process_highest_language_usage(args, count_data):
     file_path = shared.path_join(
         PATHS["data_phase"], "wikipedia_highest_language_usage.csv"
     )
+    check_for_data_file(file_path)
     data_to_csv(args, top_10, file_path)
 
 
@@ -114,6 +122,7 @@ def process_least_language_usage(args, count_data):
     file_path = shared.path_join(
         PATHS["data_phase"], "wikipedia_least_language_usage.csv"
     )
+    check_for_data_file(file_path)
     data_to_csv(args, bottom_10, file_path)
 
 
@@ -140,18 +149,20 @@ def process_language_representation(args, count_data):
     file_path = shared.path_join(
         PATHS["data_phase"], "wikipedia_language_representation.csv"
     )
+    check_for_data_file(file_path)
     data_to_csv(args, language_counts, file_path)
 
 
 def main():
     args = parse_arguments()
     shared.paths_log(LOGGER, PATHS)
     shared.git_fetch_and_merge(args, PATHS["repo"])
-
     file_count = shared.path_join(
         PATHS["data_1-fetch"], "wikipedia_count_by_languages.csv"
     )
-    count_data = pd.read_csv(file_count, usecols=["LANGUAGE_NAME_EN", "COUNT"])
+    count_data = shared.open_data_file(
+        LOGGER, file_count, usecols=["LANGUAGE_NAME_EN", "COUNT"]
+    )
     process_language_representation(args, count_data)
     process_highest_language_usage(args, count_data)
     process_least_language_usage(args, count_data)
Original file line number	Diff line number	Diff line change
`@@ -0,0 +1 @@`
	`1`	`+"PLAN_INDEX","TOOL_IDENTIFIER","LANGUAGE","COUNT"`
Original file line number	Diff line number	Diff line change
`@@ -0,0 +1 @@`
	`1`	`+"PLAN_INDEX","TOOL_IDENTIFIER","COUNTRY","COUNT"`