Skip to content

Commit 014755c

Browse files
committed
Merge branch 'main' into museums-victoria
2 parents 1dd931e + 34c1caa commit 014755c

File tree

150 files changed

+4850
-193585
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

150 files changed

+4850
-193585
lines changed

Pipfile

Lines changed: 5 additions & 19 deletions
Original file line numberDiff line numberDiff line change
@@ -4,39 +4,25 @@ verify_ssl = true
44
name = "pypi"
55

66
[packages]
7+
cachetools = "*" # Required by google-api-python-client
78
feedparser = "*"
8-
flickrapi = "*"
99
GitPython = "*"
1010
google-api-python-client = "*"
11-
h11 = ">=0.16.0" # Ensure dependency is secure
12-
internetarchive = ">=5.5.1"
13-
jupyterlab = ">=3.6.7"
11+
lxml = "*"
1412
matplotlib = "*"
15-
numpy = "*"
1613
pandas = "*"
17-
plotly = "*"
18-
pillow = ">=11.3.0" # Ensure dependency is secure
19-
Pyarrow = "*"
14+
protobuf = ">=6.33.5" # Ensure dependency is secure
15+
pyasn1 = ">=0.6.2" # Ensure dependency is secure
2016
Pygments = "*"
21-
python-dotenv = "*"
2217
PyYAML = "*"
2318
requests = ">=2.31.0"
24-
seaborn = "*"
25-
urllib3 = ">=2.5.0"
26-
wordcloud = "*"
19+
urllib3 = ">=2.6.3" # Ensure dependency is secure
2720

2821
[dev-packages]
2922
black = "*"
30-
"black[jupyter]" = "*"
3123
flake8 = "*"
3224
isort = "*"
3325
pre-commit = "*"
3426

3527
[requires]
3628
python_version = "3.11"
37-
38-
[scripts]
39-
gcs_fetched = "./scripts/1-fetch/gcs_fetched.py"
40-
flickr_fetched = "./scripts/1-fetch/flickr_fetched.py"
41-
gcs_processed = "./scripts/2-process/gcs_processed.py"
42-
gcs_reports = "./scripts/3-report/gcs_reports.py"

Pipfile.lock

Lines changed: 692 additions & 1833 deletions
Some generated files are not rendered by default. Learn more about customizing how changed files appear on GitHub.

README.md

Lines changed: 103 additions & 24 deletions
Original file line numberDiff line numberDiff line change
@@ -1,29 +1,16 @@
1-
# quantifying
1+
# Quantifying
22

3-
Quantifying the Commons
3+
Quantifying the Commons: measure the size and diversity of the commons--the
4+
collection of works that are openly licensed or in the public domain
45

56

67
## Overview
78

8-
This project seeks to quantify the size and diversity of the commons--the
9-
collection of works that are openly licensed or in the public domain.
10-
11-
12-
### Meaningful
13-
14-
The reports generated by this project (and the data fetched and processed to
15-
support it) seeks to be meaningful. We hope this project will provide data and
16-
analysis that helps inform discussions about the commons--the collection of
17-
works that are openly licensed or in the public domain.
18-
19-
The goal of this project is to help answer questions like:
20-
- How has the world's use of the commons changed over time?
21-
- How is the knowledge and culture of the commons distributed?
22-
- Who has access (and how much) to the commons?
23-
- What significant trends can be observed in the commons?
24-
- Which public domain dedication or licenses are the most popular?
25-
- What are the correlations between public domain dedication or licenses and
26-
region, language, domain/endeavor, etc.?
9+
This project seeks to quantify the size and diversity of the creative commons
10+
legal tools. We aim to track the collection of works (articles, images,
11+
publications, etc.) that are openly licensed or in the public domain. The
12+
project automates data collection from multiple data sources, processes the
13+
data, and generates meaningful reports.
2714

2815

2916
## Code of conduct
@@ -47,6 +34,93 @@ See [`CONTRIBUTING.md`][org-contrib].
4734
[org-contrib]: https://github.com/creativecommons/.github/blob/main/CONTRIBUTING.md
4835

4936

37+
### The three phases of generating a report
38+
39+
1. **Fetch**: This phase involves collecting data from a particular source
40+
using its API. Before writing any code, we plan the analyses we want to
41+
perform by asking meaningful questions about the data. We also consider API
42+
limitations (such as query limits) and design a query strategy to work
43+
within these limitations. Then we write a python script that gets the data,
44+
it is quite important to follow the format of the scripts existing in the
45+
project and use the modules and functions where applicable. It ensures
46+
consistency in the scripts and we can easily debug issues might arise.
47+
- **Meaningful questions**
48+
- The reports generated by this project (and the data fetched and
49+
processed to support it) seeks to be meaningful. We hope this project
50+
will provide data and analysis that helps inform discussions about the
51+
commons. The goal of this project is to help answer questions like:
52+
- How has the world's use of the commons changed over time?
53+
- How is the knowledge and culture of the commons distributed?
54+
- Who has access (and how much) to the commons?
55+
- What significant trends can be observed in the commons?
56+
- Which public domain dedication or licenses are the most popular?
57+
- What are the correlations between public domain dedication or licenses
58+
and region, language, domain/endeavor, etc.?
59+
- **Limitations of an API**
60+
- Some data sources provide APIs with query limits (it can be daily or
61+
hourly) depending on what is given in the documentation. This restricts
62+
how many requests that can be made in the specified period of time. It
63+
is important to plan a query strategy and schedule fetch jobs to stay
64+
within the allowed limits.
65+
- **Headings of data in 1-fetch**
66+
- [Tool identifier][tool-identifier]: A unique identifier used to
67+
distinguish each Creative Commons legal tool within the dataset. This
68+
helps ensure consistency when tracking tools across different data
69+
sources.
70+
- [SPDX identifier][spdx-identifier]: A standardized identifier maintained
71+
by the Software Package Data Exchange (SPDX) project. It provides a
72+
consistent way to reference licenses in applications.
73+
2. **Process**: In this phase, the fetched data is transformed into a
74+
structured and standardized format for analysis. The data is then analyzed
75+
and categorized based on defined criteria to extract insights that answer
76+
the meaningful questions identified during the 1-fetch phase.
77+
3. **report**: This phase focuses on presenting the results of the analysis.
78+
We generate graphs and summaries that clearly show trends, patterns, and
79+
distributions in the data. These reports help communicate key insights about
80+
the size, diversity, and characteristics of openly licensed and public
81+
domain works.
82+
83+
[tool-identifier]: https://creativecommons.org/share-your-work/cclicenses/
84+
[spdx-identifier]: https://spdx.org/licenses/
85+
86+
87+
### Automation phases
88+
89+
For automating these phases, the project uses Python scripts to fetch, process,
90+
and report data. GitHub Actions is used to automatically run these scripts on a
91+
defined schedule and on code updates. It handles script execution, manages
92+
dependencies, and ensures the workflow runs consistently.
93+
- **Script assumptions**
94+
- Execution schedule for each quarter:
95+
- 1-Fetch: first month, 1st half of second month
96+
- 2-Process: 2nd half of second month
97+
- 3-Report: third month
98+
- **Script requirements**
99+
- *Must be safe*
100+
- Scripts must not make any changes with default options
101+
- Easiest way to run script should also be the safest
102+
- Have options spelled out
103+
- Must be timely
104+
- *Scripts should complete within a maximum of 45 minutes*
105+
- *Scripts shouldn't take longer than 3 minutes with default options*
106+
- That way there’s a quicker way to see what is happening when it is
107+
running; see execution, without errors, etc. Then later in production it
108+
can be run with longer options
109+
- *Must be idempotent*
110+
- [Idempotence - Wikipedia](https://en.wikipedia.org/wiki/Idempotence)
111+
- This applies to both the data fetched and the data stored. If the data
112+
changes randomly, we can't draw meaningful conclusions.
113+
- *Balanced use of third-party libraries*
114+
- Third-party libraries should be leveraged when they are:
115+
- API specific (google-api-python-client, internetarchive, etc.)
116+
- File formats
117+
- CSV: the format is well supported (rendered on GitHub, etc.), easy to use,
118+
and the data used by the project is simple enough to avoid any
119+
shortcomings.
120+
- YAML: prioritizes human readability which addresses the primary costs and
121+
risks associated with configuration files.
122+
123+
50124
### Project structure
51125

52126
Please note that in the directory tree below, all instances of `fetch`,
@@ -69,7 +143,6 @@ Quantifying/
69143
│ │ │ └── README.md # All generated reports are displayed in the README
70144
│ └── ...
71145
├── dev/
72-
├── pre-automation/ # All Quantifying work prior to adding automation system
73146
├── scripts/ # Run scripts for all phases
74147
│ ├── 1-fetch/
75148
│ ├── 2-process/
@@ -91,8 +164,7 @@ Quantifying/
91164
```
92165

93166

94-
## Development
95-
167+
## How to set up
96168

97169
### Prerequisites
98170

@@ -155,6 +227,13 @@ When run this way, the shared library (`scripts/shared.py`) provides easy access
155227
to all of the necessary paths and all of the modules managed by pipenv are
156228
available.
157229
230+
In order for scripts to be run directly (as shown above), the script must be
231+
executable. For more information on making files executable, please see:
232+
[File Permissions - Foundational technologies — Creative Commons Open
233+
Source][file-perms].
234+
235+
[file-perms]: https://opensource.creativecommons.org/contributing-code/foundational-tech/#file-permissions
236+
158237
159238
### Static analysis
160239
Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,2 @@
1+
"CC0_RECORDS","CC0_RECORDS_WITH_CC0_MEDIA","CC0_MEDIA","CC0_MEDIA_PERCENTAGE","TOTAL_OBJECTS"
2+
"14273329","5199915","4503016","36","15616799"
Lines changed: 39 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,39 @@
1+
"UNIT","CC0_RECORDS","CC0_RECORDS_WITH_CC0_MEDIA","TOTAL_OBJECTS"
2+
"AAA","0","0","29735"
3+
"AAG","0","0","344"
4+
"ACM","251","247","2977"
5+
"ACMA","0","0","57"
6+
"CFCHFOLKLIFE","17544","0","18517"
7+
"CHNDM","58158","54590","201545"
8+
"FBR","1517","37","11248"
9+
"FSG","4720","4720","45588"
10+
"HAC","430","430","1437"
11+
"HMSG","449","448","13898"
12+
"HSFA","0","0","299"
13+
"NASM","1010","989","32325"
14+
"NMAAHC","22224","4465","22577"
15+
"NMAH","1316502","10548","1317248"
16+
"NMAI","237637","180","239307"
17+
"NMAfA","111","111","12477"
18+
"NMNHANTHRO","497734","0","497734"
19+
"NMNHBIRDS","635217","559038","635217"
20+
"NMNHBOTANY","4562256","3572487","4562256"
21+
"NMNHEDUCATION","6473","4090","6473"
22+
"NMNHENTO","731838","197223","731838"
23+
"NMNHFISHES","502585","10806","502585"
24+
"NMNHHERPS","615308","2345","615308"
25+
"NMNHINV","2003972","70094","2003972"
26+
"NMNHMAMMALS","626133","542046","626133"
27+
"NMNHMINSCI","465275","11311","465275"
28+
"NMNHPALEO","743533","94487","743533"
29+
"NPG","15446","14540","123566"
30+
"NPM","10814","8005","83710"
31+
"NZP","1061","1061","2086"
32+
"OCIO_DPO3D","108","17","146"
33+
"OFEO-SG","5509","3665","7295"
34+
"SAAM","13626","12891","188157"
35+
"SIA","35498","5477","48169"
36+
"SIL","1035579","13567","1039087"
37+
"SILAF","63416","0","63416"
38+
"SILNMAHTL","34577","0","34577"
39+
"SLA_SRO","104811","0","104811"

0 commit comments

Comments
 (0)