Skip to content

Commit af3ed14

Browse files
committed
Merge branch 'main' into glam
2 parents 98c4d63 + 835f57f commit af3ed14

18 files changed

Lines changed: 547 additions & 299 deletions

.github/CODEOWNERS

Lines changed: 4 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1,3 +1,5 @@
11
# https://help.github.com/en/articles/about-code-owners
2-
* @creativecommons/sre
3-
* @creativecommons/ct-quantifying-core-committers
2+
# If you want to match two or more code owners with the same pattern, all the
3+
# code owners must be on the same line. If the code owners are not on the same
4+
# line, the pattern matches only the last mentioned code owner.
5+
* @creativecommons/technology @creativecommons/ct-quantifying-core-committers

.gitignore

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -142,4 +142,5 @@ ehthumbs.db
142142
Thumbs.db
143143

144144
# secrets
145+
.env
145146
query_secrets.py

Pipfile

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -4,6 +4,7 @@ verify_ssl = true
44
name = "pypi"
55

66
[packages]
7+
python-dotenv = "*"
78
flickrapi = "*"
89
internetarchive = "*"
910
jupyterlab = "*"

Pipfile.lock

Lines changed: 249 additions & 121 deletions
Some generated files are not rendered by default. Learn more about customizing how changed files appear on GitHub.

README.md

Lines changed: 68 additions & 128 deletions
Original file line numberDiff line numberDiff line change
@@ -9,7 +9,7 @@ This project seeks to quantify the size and diversity of the commons--the
99
collection of works that are openly licensed or in the public domain.
1010

1111

12-
## Code of Conduct
12+
## Code of conduct
1313

1414
[`CODE_OF_CONDUCT.md`](CODE_OF_CONDUCT.md):
1515
> The Creative Commons team is committed to fostering a welcoming community.
@@ -27,174 +27,114 @@ collection of works that are openly licensed or in the public domain.
2727
See [`CONTRIBUTING.md`](CONTRIBUTING.md).
2828

2929

30-
3130
## Development
3231

3332

3433
### Prerequisites
3534

3635
This repository uses [pipenv][pipenvdocs] to manage the required Python
3736
modules:
38-
- Linux: [Installing Pipenv][pipenvinstall]
39-
- macOS:
40-
1. Install [Homebrew][homebrew]
41-
2. Install pipenv:
42-
```
37+
1. Install `pipenv`:
38+
- Linux: [Installing Pipenv][pipenvinstall]
39+
- macOS:
40+
1. Install [Homebrew][homebrew]
41+
2. Install pipenv:
42+
```shell
4343
brew install pipenv
4444
```
45+
- Windows: [Installing Pipenv][pipenvinstall]
46+
2. Create the Python virtual environment and install prerequisites using
47+
`pipenv`:
48+
```shell
49+
pipenv sync --dev
50+
```
4551

4652
[pipenvdocs]: https://pipenv.pypa.io/en/latest/
53+
[pipenvinstall]: https://pipenv.pypa.io/en/latest/installation/
4754
[homebrew]: https://brew.sh/
48-
[pipenvinstall]: https://pipenv.pypa.io/en/latest/install/#installing-pipenv
49-
50-
51-
### Tooling
52-
53-
- **[Python Guidelines — Creative Commons Open Source][ccospyguide]**
54-
- [Black][black]: the uncompromising Python code formatter
55-
- [flake8][flake8]: a python tool that glues together pep8, pyflakes, mccabe,
56-
and third-party plugins to check the style and quality of some python code.
57-
- [isort][isort]: A Python utility / library to sort imports.
58-
59-
[ccospyguide]: https://opensource.creativecommons.org/contributing-code/python-guidelines/
60-
[black]: https://github.com/psf/black
61-
[flake8]: https://gitlab.com/pycqa/flake8
62-
[isort]: https://pycqa.github.io/isort/
63-
64-
65-
## Data Sources
66-
67-
68-
### CC Legal Tools
69-
70-
- [`legal-tool-paths.txt`](google_custom_search/legal-tool-paths.txt)
71-
- A `.txt` provided by Timid Robot containing all legal tool paths. The data
72-
from Google Custom Search will only cover 50+ general, most significant
73-
categories of CC License for data collection quota constraint. As an
74-
additional note, the order of precedence of license the collected data's
75-
first column is sorted due to intermediate data analysis progress.
76-
- [add list of all current CC legal tool paths by TimidRobot · Pull Request
77-
#7 · creativecommons/quantifying][pr7]
7855

79-
[pr7]: https://github.com/creativecommons/quantifying/pull/7
8056

57+
### Running scripts that require client cedentials
8158

82-
### Flickr
59+
To successfully run scripts that require client credentials, you will need to
60+
follow these steps:
61+
1. Copy the contents of the `env.example` file in the script's directory to
62+
`.env`:
63+
```shell
64+
cp env.example .env
65+
```
66+
2. Uncomment the variables in the `.env` file and assign values as needed. See
67+
[`sources.md`](sources.md) on how to get credentials:
68+
```
69+
GOOGLE_API_KEYS=your_api_key
70+
PSE_KEY=your_pse_key
71+
```
72+
3. Save the changes to the `.env` file.
73+
4. You should now be able to run scripts that require client credentials
74+
without any issues.
8375
84-
- The Flickr API exposes identifiers for users, photos, photosets and other
85-
uniquely identifiable objects.
86-
- The Flickr API consists of a set of callable methods, and some API endpoints.
87-
- For more detailed description, visit: [API documentation - Flickr
88-
Services](https://www.flickr.com/services/api/).
89-
- The `hs.csv` file is a sample CSV of pulled data. Ideally the script will
90-
generate final data CSVs.
91-
- Each license will have a CSV to save the data.
92-
- Due to memory limit, the license CSVs are not pushed into github.
9376
77+
### Static analysis
9478
95-
### Google Custom Search JSON API
79+
The [`dev/tools.sh`][tools-sh] helper script runs the static analysis tools
80+
(`black`, `flake8`, and `isort`):
81+
```shell
82+
./dev/tools.sh
83+
```
9684
97-
- The Custom Search JSON API allows user-defined detailed query and access
98-
towards related query data using a programmable search engine.
99-
- [Custom Search JSON API Reference | Programmable Search Engine | Google
100-
Developers][googlejsonapi]
101-
- [Method: cse.list | Custom Search JSON API | Google Developers][cselist]
102-
- [`google_countries.tsv`](google_custom_search/google_countries.txt)
103-
- Created by directly copy and pasting the `cr` parameter list from the
104-
following link into a `.tsv` file as there were no reliable algorithmic way
105-
for retrieving such data found in the process so far. The script itself
106-
will take care of the formatting and country-selection process.
107-
- [Country Collection Values | JSON API reference | Programmable Search
108-
Engine | Google Developers][googlecountry]
109-
- [`google_lang.txt`](google_custom_search/google_lang.txt)
110-
- Created by directly copy and pasting the `lr` parameter list from the
111-
following link into a `.txt` file as there were no reliable algorithmic way
112-
for retrieving such data found in the process so far. The script itself
113-
will take care of the data formatting and language-selection process.
114-
- [Parameter: lr | Method: cse.list | Custom Search JSON API | Google
115-
Developers][googlelang]
85+
It can also accept command-line arguments to specify specific files or
86+
directories to check:
87+
```shell
88+
./dev/tools.sh PATH/TO/MY/FILE.PY
89+
```
11690
117-
[googlejsonapi]: https://developers.google.com/custom-search/v1
118-
[cselist]: https://developers.google.com/custom-search/v1/reference/rest/v1/cse/list
119-
[googlecountry]: https://developers.google.com/custom-search/docs/json_api_reference#countryCollections
120-
[googlelang]: https://developers.google.com/custom-search/v1/reference/rest/v1/cse/list#body.QUERY_PARAMETERS.lr
91+
[tools-sh]: /dev/tools.sh
12192
12293
123-
### Internet Archive Python Interface
94+
### Resources
12495
125-
A python interface to archive.org to achieve API requests towards internet
126-
archive.
127-
- [`internetarchive.Search` - Internetarchive: A Python Interface to
128-
archive.org][iasearch]
129-
130-
[iasearch]: https://internetarchive.readthedocs.io/en/stable/internetarchive.html#internetarchive.Search
131-
132-
133-
### The Metropolitan Museum of Art Collection API
134-
135-
An API endpoint for receiving Metropolitan Muesum of Art Collection's
136-
CC-Licensed works.
137-
138-
[Latest Updates | The Metropolitan Museum of Art Collection API][metapi]:
139-
> The Metropolitan Museum of Art provides select datasets of information on
140-
> more than 470,000 artworks in its Collection for unrestricted commercial and
141-
> noncommercial use. To the extent possible under law, The Metropolitan Museum
142-
> of Art has waived all copyright and related or neighboring rights to this
143-
> dataset using the [Creative Commons Zero][cc-zero] license.
144-
145-
[metapi]: https://metmuseum.github.io/
146-
[cc-zero]: https://creativecommons.org/publicdomain/zero/1.0/
147-
148-
149-
### Vimeo API
150-
151-
The Vimeo API allows users to perform filtered, advanced search on Vimeo
152-
videos.
153-
- [Getting Started with the Vimeo API][vimeostart]
154-
- [Search for videos - Vimeo API Reference: Videos][vimeoapisearch]
96+
- **[Python Guidelines — Creative Commons Open Source][ccospyguide]**
97+
- [Black][black]: _the uncompromising Python code formatter_
98+
- [flake8][flake8]: _a python tool that glues together pep8, pyflakes, mccabe,
99+
and third-party plugins to check the style and quality of some python code._
100+
- [isort][isort]: _A Python utility / library to sort imports_
101+
- (It doesn't import any libraries, it only sorts and formats them.)
102+
- [ppypa/pipenv][pipenv]: _Python Development Workflow for Humans._
155103

156-
[vimeostart]: https://developer.vimeo.com/api/guides/start
157-
[vimeoapisearch]: https://developer.vimeo.com/api/reference/videos#search_videos
104+
[ccospyguide]: https://opensource.creativecommons.org/contributing-code/python-guidelines/
105+
[black]: https://github.com/psf/black
106+
[flake8]: https://gitlab.com/pycqa/flake8
107+
[isort]: https://pycqa.github.io/isort/
108+
[pipenv]: https://github.com/pypa/pipenv
158109

159110

160-
### MediaWiki API
111+
### GitHub Actions
161112

162-
- The MediaWiki Action API is a web service that allows access to some wiki
163-
features like authentication, page operations, and search. It can provide
164-
meta information about the wiki and the logged-in user.
165-
- Example query: https://commons.wikimedia.org/w/api.php?action=query&cmtitle=Category:CC-BY&list=categorymembers
166-
- [`language-codes_csv.csv`](wikipedia/language-codes_csv.csv)
167-
- A list of language codes in ISO 639-1 Format to access statistics of each
168-
wikipedia main page across different languages. In the script, this file is
169-
named as `language-codes_csv` to minimize the amount of manual work
170-
required for running the script provided the same language encoding file.
171-
The user would have to rename the header and file name of their `.csv` ISO
172-
code list according to the concurrent file on Github if they would like to
173-
use some list other than the concurrent one.
174-
- This file that this script uses can be downloaded from:
175-
https://datahub.io/core/language-codes
113+
The [`.github/workflows/python_static_analysis.yml`][workflow-static-analysis]
114+
GitHub Actions workflow performs static analysis (`black`, `flake8`, and
115+
`isort`) on committed changes. The workflow is triggered automatically when you
116+
push changes to the main branch or open a pull request.
176117

118+
[workflow-static-analysis]: .github/workflows/python_static_analysis.yml
177119

178-
### Youtube Data API
179120

180-
An API from YouTube for platform users to upload videos, adjust video
181-
parameters, and obtain search results.
182-
- [Search: list | YouTube Data API | Google Developers][youtubeapi]
121+
## Data sources
183122

184-
[youtubeapi]: https://developers.google.com/youtube/v3/docs/search/list
123+
Kindly visit the [`sources.md`](sources.md) file for it.
185124

186125

187126
## History
188127

189128
For information on past efforts, see [`history.md`](history.md).
190129

191130

192-
## Copying & License
131+
## Copying & license
193132

194133

195134
### Code
196135

197-
[`LICENSE`](LICENSE): the code within this repository is licensed under the Expat/[MIT][mit] license.
136+
[`LICENSE`](LICENSE): the code within this repository is licensed under the
137+
Expat/[MIT][mit] license.
198138

199139
[mit]: http://www.opensource.org/licenses/MIT "The MIT License | Open Source Initiative"
200140

@@ -219,4 +159,4 @@ The documentation within the project is licensed under a [Creative Commons
219159
Attribution 4.0 International License][cc-by].
220160

221161
[cc-by-png]: https://licensebuttons.net/l/by/4.0/88x31.png#floatleft "CC BY 4.0 license button"
222-
[cc-by]: https://creativecommons.org/licenses/by/4.0/ "Creative Commons Attribution 4.0 International License"
162+
[cc-by]: https://creativecommons.org/licenses/by/4.0/ "Creative Commons Attribution 4.0 International License"

deviantart/deviantart_scratcher.py

Lines changed: 7 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -12,19 +12,22 @@
1212

1313
# Third-party
1414
import pandas as pd
15-
import query_secrets
1615
import requests
16+
from dotenv import load_dotenv
1717
from requests.adapters import HTTPAdapter
1818
from urllib3.util.retry import Retry
1919

20+
CWD = os.path.dirname(os.path.abspath(__file__))
21+
dotenv_path = os.path.join(os.path.dirname(CWD), ".env")
22+
load_dotenv(dotenv_path)
23+
2024
today = dt.datetime.today()
21-
API_KEYS = query_secrets.API_KEYS
25+
API_KEYS = os.getenv("GOOGLE_API_KEYS").split(",")
2226
API_KEYS_IND = 0
23-
CWD = os.path.dirname(os.path.abspath(__file__))
2427
DATA_WRITE_FILE = (
2528
f"{CWD}" f"/data_deviantart_{today.year}_{today.month}_{today.day}.csv"
2629
)
27-
PSE_KEY = query_secrets.PSE_KEY
30+
PSE_KEY = os.getenv("PSE_KEY")
2831

2932

3033
def get_license_list():

deviantart/query_secrets.example.py

Lines changed: 0 additions & 8 deletions
This file was deleted.

env.example

Lines changed: 41 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,41 @@
1+
## photos.py & photos_detail.py
2+
# "The flickr developer guide: https://www.flickr.com/services/developer/"
3+
4+
# FLICKR_API_KEY =
5+
# FLICKR_API_SECRET =
6+
7+
8+
## deviantart_scratcher.py & google_scratcher.py
9+
# "Custom Search JSON API requires the use of an API key. An API key is a way
10+
# to identify your client to Google."
11+
# https://developers.google.com/custom-search/v1/introduction
12+
13+
# GOOGLE_API_KEYS = key1, key2
14+
15+
# "The identifier of an engine created using the Programmable Search Engine
16+
# Control Panel [https://programmablesearchengine.google.com/about/]"
17+
# https://developers.google.com/custom-search/v1/reference/rest/v1/Search
18+
19+
# PSE_KEY =
20+
21+
22+
## vimeo_scratcher.py
23+
# "Before we set you loose on the API, we ask that you provide a little
24+
# information about your app. An app in this sense can be a full-featured
25+
# mobile application, a dynamic web page, or a three-line script. If it's
26+
# making API calls, it's an app."
27+
# https://developer.vimeo.com/api/guides/start#register-your-app
28+
29+
# VIMEO_ACCESS_TOKEN =
30+
# VIMEO_CLIENT_ID =
31+
32+
33+
## youtube_scratcher.py
34+
# "Every request must either specify an API key (with the key parameter) [...].
35+
# Your API key is available in the Developer Console's API Access pane
36+
# [https://console.developers.google.com/] for your project."
37+
# https://developers.google.com/youtube/v3/docs
38+
39+
# YOUTUBE_API_KEY =
40+
41+

flickr/data_cleaning.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -57,7 +57,7 @@ def save_new_data(
5757

5858

5959
def main():
60-
drop_empty_column("final.csv", "dataset/cleaned_license10.csv")
60+
drop_empty_column("hs.csv", "dataset/cleaned_license10.csv")
6161
drop_duplicate_id(
6262
"dataset/cleaned_license10.csv", "dataset/cleaned_license10.csv"
6363
)

flickr/photos.py

Lines changed: 7 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1,19 +1,24 @@
11
# Standard library
22
import json
3+
import os
34
import os.path
45
import sys
56
import traceback
67

78
# Third-party
89
import flickrapi
9-
import query_secrets
10+
from dotenv import load_dotenv
1011

1112
CWD = os.path.dirname(os.path.abspath(__file__))
13+
dotenv_path = os.path.join(os.path.dirname(CWD), ".env")
14+
load_dotenv(dotenv_path)
1215

1316

1417
def main():
1518
flickr = flickrapi.FlickrAPI(
16-
query_secrets.api_key, query_secrets.api_secret, format="json"
19+
os.getenv("FLICKR_API_KEY"),
20+
os.getenv("FLICKR_API_SECRET"),
21+
format="json",
1722
)
1823

1924
# use search method to pull general photo info under each cc license data

0 commit comments

Comments
 (0)