Skip to content

Commit f7791d7

Browse files
committed
Merge branch 'main' into smithsonian
2 parents 542eca2 + ac22fb5 commit f7791d7

File tree

14 files changed

+212
-103
lines changed

14 files changed

+212
-103
lines changed

README.md

Lines changed: 96 additions & 23 deletions
Original file line numberDiff line numberDiff line change
@@ -1,29 +1,16 @@
1-
# quantifying
1+
# Quantifying
22

3-
Quantifying the Commons
3+
Quantifying the Commons: measure the size and diversity of the commons--the
4+
collection of works that are openly licensed or in the public domain
45

56

67
## Overview
78

8-
This project seeks to quantify the size and diversity of the commons--the
9-
collection of works that are openly licensed or in the public domain.
10-
11-
12-
### Meaningful
13-
14-
The reports generated by this project (and the data fetched and processed to
15-
support it) seeks to be meaningful. We hope this project will provide data and
16-
analysis that helps inform discussions about the commons--the collection of
17-
works that are openly licensed or in the public domain.
18-
19-
The goal of this project is to help answer questions like:
20-
- How has the world's use of the commons changed over time?
21-
- How is the knowledge and culture of the commons distributed?
22-
- Who has access (and how much) to the commons?
23-
- What significant trends can be observed in the commons?
24-
- Which public domain dedication or licenses are the most popular?
25-
- What are the correlations between public domain dedication or licenses and
26-
region, language, domain/endeavor, etc.?
9+
This project seeks to quantify the size and diversity of the creative commons
10+
legal tools. We aim to track the collection of works (articles, images,
11+
publications, etc.) that are openly licensed or in the public domain. The
12+
project automates data collection from multiple data sources, processes the
13+
data, and generates meaningful reports.
2714

2815

2916
## Code of conduct
@@ -47,6 +34,93 @@ See [`CONTRIBUTING.md`][org-contrib].
4734
[org-contrib]: https://github.com/creativecommons/.github/blob/main/CONTRIBUTING.md
4835

4936

37+
### The three phases of generating a report
38+
39+
1. **Fetch**: This phase involves collecting data from a particular source
40+
using its API. Before writing any code, we plan the analyses we want to
41+
perform by asking meaningful questions about the data. We also consider API
42+
limitations (such as query limits) and design a query strategy to work
43+
within these limitations. Then we write a python script that gets the data,
44+
it is quite important to follow the format of the scripts existing in the
45+
project and use the modules and functions where applicable. It ensures
46+
consistency in the scripts and we can easily debug issues might arise.
47+
- **Meaningful questions**
48+
- The reports generated by this project (and the data fetched and
49+
processed to support it) seeks to be meaningful. We hope this project
50+
will provide data and analysis that helps inform discussions about the
51+
commons. The goal of this project is to help answer questions like:
52+
- How has the world's use of the commons changed over time?
53+
- How is the knowledge and culture of the commons distributed?
54+
- Who has access (and how much) to the commons?
55+
- What significant trends can be observed in the commons?
56+
- Which public domain dedication or licenses are the most popular?
57+
- What are the correlations between public domain dedication or licenses
58+
and region, language, domain/endeavor, etc.?
59+
- **Limitations of an API**
60+
- Some data sources provide APIs with query limits (it can be daily or
61+
hourly) depending on what is given in the documentation. This restricts
62+
how many requests that can be made in the specified period of time. It
63+
is important to plan a query strategy and schedule fetch jobs to stay
64+
within the allowed limits.
65+
- **Headings of data in 1-fetch**
66+
- [Tool identifier][tool-identifier]: A unique identifier used to
67+
distinguish each Creative Commons legal tool within the dataset. This
68+
helps ensure consistency when tracking tools across different data
69+
sources.
70+
- [SPDX identifier][spdx-identifier]: A standardized identifier maintained
71+
by the Software Package Data Exchange (SPDX) project. It provides a
72+
consistent way to reference licenses in applications.
73+
2. **Process**: In this phase, the fetched data is transformed into a
74+
structured and standardized format for analysis. The data is then analyzed
75+
and categorized based on defined criteria to extract insights that answer
76+
the meaningful questions identified during the 1-fetch phase.
77+
3. **report**: This phase focuses on presenting the results of the analysis.
78+
We generate graphs and summaries that clearly show trends, patterns, and
79+
distributions in the data. These reports help communicate key insights about
80+
the size, diversity, and characteristics of openly licensed and public
81+
domain works.
82+
83+
[tool-identifier]: https://creativecommons.org/share-your-work/cclicenses/
84+
[spdx-identifier]: https://spdx.org/licenses/
85+
86+
87+
### Automation phases
88+
89+
For automating these phases, the project uses Python scripts to fetch, process,
90+
and report data. GitHub Actions is used to automatically run these scripts on a
91+
defined schedule and on code updates. It handles script execution, manages
92+
dependencies, and ensures the workflow runs consistently.
93+
- **Script assumptions**
94+
- Execution schedule for each quarter:
95+
- 1-Fetch: first month, 1st half of second month
96+
- 2-Process: 2nd half of second month
97+
- 3-Report: third month
98+
- **Script requirements**
99+
- *Must be safe*
100+
- Scripts must not make any changes with default options
101+
- Easiest way to run script should also be the safest
102+
- Have options spelled out
103+
- Must be timely
104+
- *Scripts should complete within a maximum of 45 minutes*
105+
- *Scripts shouldn't take longer than 3 minutes with default options*
106+
- That way there’s a quicker way to see what is happening when it is
107+
running; see execution, without errors, etc. Then later in production it
108+
can be run with longer options
109+
- *Must be idempotent*
110+
- [Idempotence - Wikipedia](https://en.wikipedia.org/wiki/Idempotence)
111+
- This applies to both the data fetched and the data stored. If the data
112+
changes randomly, we can't draw meaningful conclusions.
113+
- *Balanced use of third-party libraries*
114+
- Third-party libraries should be leveraged when they are:
115+
- API specific (google-api-python-client, internetarchive, etc.)
116+
- File formats
117+
- CSV: the format is well supported (rendered on GitHub, etc.), easy to use,
118+
and the data used by the project is simple enough to avoid any
119+
shortcomings.
120+
- YAML: prioritizes human readability which addresses the primary costs and
121+
risks associated with configuration files.
122+
123+
50124
### Project structure
51125

52126
Please note that in the directory tree below, all instances of `fetch`,
@@ -91,8 +165,7 @@ Quantifying/
91165
```
92166

93167

94-
## Development
95-
168+
## How to set up
96169

97170
### Prerequisites
98171

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1 @@
1+
"PLAN_INDEX","TOOL_IDENTIFIER","COUNT"
Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1 @@
1+
"PLAN_INDEX","TOOL_IDENTIFIER","LANGUAGE","COUNT"
Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1 @@
1+
"PLAN_INDEX","TOOL_IDENTIFIER","COUNTRY","COUNT"
Lines changed: 8 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,8 @@
1+
"TOOL_IDENTIFIER","SPDX_IDENTIFIER","COUNT"
2+
"BSD Zero Clause License","0BSD","64052"
3+
"CC0 1.0","CC0-1.0","350419"
4+
"CC BY 4.0","CC-BY-4.0","102675"
5+
"CC BY-SA 4.0","CC-BY-SA-4.0","30783"
6+
"MIT No Attribution","MIT-0","35103"
7+
"Unlicense","Unlicense","406459"
8+
"Total public repositories","N/A","289935546"

scripts/1-fetch/wikipedia_fetch.py

Lines changed: 13 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -63,6 +63,18 @@ def parse_arguments():
6363
return args
6464

6565

66+
def check_for_completion():
67+
try:
68+
with open(FILE_LANGUAGES, "r", newline="") as file_obj:
69+
reader = csv.DictReader(file_obj, dialect="unix")
70+
if len(list(reader)) > 300:
71+
raise shared.QuantifyingException(
72+
f"Data fetch completed for {QUARTER}", 0
73+
)
74+
except FileNotFoundError:
75+
pass # File may not be found without --enable-save, etc.
76+
77+
6678
def write_data(args, tool_data):
6779
if not args.enable_save:
6880
return args
@@ -157,6 +169,7 @@ def query_wikipedia_languages(session):
157169
def main():
158170
args = parse_arguments()
159171
shared.paths_log(LOGGER, PATHS)
172+
check_for_completion()
160173
shared.git_fetch_and_merge(args, PATHS["repo"])
161174
session = shared.get_session()
162175
tool_data = query_wikipedia_languages(session)

scripts/2-process/gcs_process.py

Lines changed: 9 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -311,7 +311,9 @@ def main():
311311

312312
# Count data
313313
file1_count = shared.path_join(PATHS["data_1-fetch"], "gcs_1_count.csv")
314-
count_data = pd.read_csv(file1_count, usecols=["TOOL_IDENTIFIER", "COUNT"])
314+
count_data = shared.open_data_file(
315+
LOGGER, file1_count, usecols=["TOOL_IDENTIFIER", "COUNT"]
316+
)
315317
process_product_totals(args, count_data)
316318
process_latest_prior_retired_totals(args, count_data)
317319
process_totals_by_free_cultural(args, count_data)
@@ -321,17 +323,19 @@ def main():
321323
file2_language = shared.path_join(
322324
PATHS["data_1-fetch"], "gcs_2_count_by_language.csv"
323325
)
324-
language_data = pd.read_csv(
325-
file2_language, usecols=["TOOL_IDENTIFIER", "LANGUAGE", "COUNT"]
326+
language_data = shared.open_data_file(
327+
LOGGER,
328+
file2_language,
329+
usecols=["TOOL_IDENTIFIER", "LANGUAGE", "COUNT"],
326330
)
327331
process_totals_by_language(args, language_data)
328332

329333
# Country data
330334
file3_country = shared.path_join(
331335
PATHS["data_1-fetch"], "gcs_3_count_by_country.csv"
332336
)
333-
country_data = pd.read_csv(
334-
file3_country, usecols=["TOOL_IDENTIFIER", "COUNTRY", "COUNT"]
337+
country_data = shared.open_data_file(
338+
LOGGER, file3_country, usecols=["TOOL_IDENTIFIER", "COUNTRY", "COUNT"]
335339
)
336340
process_totals_by_country(args, country_data)
337341

scripts/2-process/github_process.py

Lines changed: 12 additions & 45 deletions
Original file line numberDiff line numberDiff line change
@@ -11,7 +11,6 @@
1111
import traceback
1212

1313
# Third-party
14-
# import pandas as pd
1514
import pandas as pd
1615

1716
# Add parent directory so shared can be imported
@@ -60,6 +59,13 @@ def parse_arguments():
6059
return args
6160

6261

62+
def check_for_data_file(file_path):
63+
if os.path.exists(file_path):
64+
raise shared.QuantifyingException(
65+
f"Processed data already exists for {QUARTER}", 0
66+
)
67+
68+
6369
def data_to_csv(args, data, file_path):
6470
if not args.enable_save:
6571
return
@@ -92,6 +98,7 @@ def process_totals_by_license(args, count_data):
9298
file_path = shared.path_join(
9399
PATHS["data_phase"], "github_totals_by_license.csv"
94100
)
101+
check_for_data_file(file_path)
95102
data_to_csv(args, data, file_path)
96103

97104

@@ -126,59 +133,19 @@ def process_totals_by_restriction(args, count_data):
126133
file_path = shared.path_join(
127134
PATHS["data_phase"], "github_totals_by_restriction.csv"
128135
)
136+
check_for_data_file(file_path)
129137
data_to_csv(args, data, file_path)
130138

131139

132-
# def load_quarter_data(quarter):
133-
# """
134-
# Load data for a specific quarter.
135-
# """
136-
# file_path = os.path.join(PATHS["data"], f"{quarter}",
137-
# "1-fetch", "github_fetched")
138-
# if not os.path.exists(file_path):
139-
# LOGGER.error(f"Data file for quarter {quarter} not found.")
140-
# return None
141-
# return pd.read_csv(file_path)
142-
143-
144-
# def compare_data(current_quarter, previous_quarter):
145-
# """
146-
# Compare data between two quarters.
147-
# """
148-
# current_data = load_quarter_data(current_quarter)
149-
# previous_data = load_quarter_data(previous_quarter)
150-
151-
# if current_data is None or previous_data is None:
152-
# return
153-
154-
# Process data to compare totals
155-
156-
157-
# def parse_arguments():
158-
# """
159-
# Parses command-line arguments, returns parsed arguments.
160-
# """
161-
# LOGGER.info("Parsing command-line arguments")
162-
# parser = argparse.ArgumentParser(
163-
# description="Google Custom Search Comparison Report")
164-
# parser.add_argument(
165-
# "--current_quarter", type=str, required=True,
166-
# help="Current quarter for comparison (e.g., 2024Q3)"
167-
# )
168-
# parser.add_argument(
169-
# "--previous_quarter", type=str, required=True,
170-
# help="Previous quarter for comparison (e.g., 2024Q2)"
171-
# )
172-
# return parser.parse_args()
173-
174-
175140
def main():
176141
args = parse_arguments()
177142
shared.paths_log(LOGGER, PATHS)
178143
shared.git_fetch_and_merge(args, PATHS["repo"])
179144

180145
file_count = shared.path_join(PATHS["data_1-fetch"], "github_1_count.csv")
181-
count_data = pd.read_csv(file_count, usecols=["TOOL_IDENTIFIER", "COUNT"])
146+
count_data = shared.open_data_file(
147+
LOGGER, file_count, usecols=["TOOL_IDENTIFIER", "COUNT"]
148+
)
182149
process_totals_by_license(args, count_data)
183150
process_totals_by_restriction(args, count_data)
184151

scripts/2-process/wikipedia_process.py

Lines changed: 13 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -63,6 +63,13 @@ def parse_arguments():
6363
return args
6464

6565

66+
def check_for_data_file(file_path):
67+
if os.path.exists(file_path):
68+
raise shared.QuantifyingException(
69+
f"Processed data already exists for {QUARTER}", 0
70+
)
71+
72+
6673
def data_to_csv(args, data, file_path):
6774
if not args.enable_save:
6875
return
@@ -91,6 +98,7 @@ def process_highest_language_usage(args, count_data):
9198
file_path = shared.path_join(
9299
PATHS["data_phase"], "wikipedia_highest_language_usage.csv"
93100
)
101+
check_for_data_file(file_path)
94102
data_to_csv(args, top_10, file_path)
95103

96104

@@ -114,6 +122,7 @@ def process_least_language_usage(args, count_data):
114122
file_path = shared.path_join(
115123
PATHS["data_phase"], "wikipedia_least_language_usage.csv"
116124
)
125+
check_for_data_file(file_path)
117126
data_to_csv(args, bottom_10, file_path)
118127

119128

@@ -140,18 +149,20 @@ def process_language_representation(args, count_data):
140149
file_path = shared.path_join(
141150
PATHS["data_phase"], "wikipedia_language_representation.csv"
142151
)
152+
check_for_data_file(file_path)
143153
data_to_csv(args, language_counts, file_path)
144154

145155

146156
def main():
147157
args = parse_arguments()
148158
shared.paths_log(LOGGER, PATHS)
149159
shared.git_fetch_and_merge(args, PATHS["repo"])
150-
151160
file_count = shared.path_join(
152161
PATHS["data_1-fetch"], "wikipedia_count_by_languages.csv"
153162
)
154-
count_data = pd.read_csv(file_count, usecols=["LANGUAGE_NAME_EN", "COUNT"])
163+
count_data = shared.open_data_file(
164+
LOGGER, file_count, usecols=["LANGUAGE_NAME_EN", "COUNT"]
165+
)
155166
process_language_representation(args, count_data)
156167
process_highest_language_usage(args, count_data)
157168
process_least_language_usage(args, count_data)

0 commit comments

Comments
 (0)