Skip to content

Commit 7f102d7

Browse files
committed
Merge branch 'main' into license-classify-modeling
2 parents 5c315fb + 3488540 commit 7f102d7

16 files changed

Lines changed: 452 additions & 286 deletions

.gitignore

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -142,4 +142,5 @@ ehthumbs.db
142142
Thumbs.db
143143

144144
# secrets
145+
.env
145146
query_secrets.py

Pipfile

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -4,6 +4,7 @@ verify_ssl = true
44
name = "pypi"
55

66
[packages]
7+
python-dotenv = "*"
78
flickrapi = "*"
89
internetarchive = "*"
910
jupyterlab = "*"

Pipfile.lock

Lines changed: 249 additions & 121 deletions
Some generated files are not rendered by default. Learn more about customizing how changed files appear on GitHub.

README.md

Lines changed: 18 additions & 118 deletions
Original file line numberDiff line numberDiff line change
@@ -48,6 +48,23 @@ modules:
4848
[pipenvinstall]: https://pipenv.pypa.io/en/latest/install/#installing-pipenv
4949
5050
51+
### Running Scripts that Require Client Credentials
52+
53+
To successfully run scripts that require client credentials, you will need to follow these steps:
54+
1. Copy the contents of the `env.example` file in the script's directory to `.env`:
55+
```
56+
cp env.example .env
57+
```
58+
2. Uncomment the variables in the `.env` file and assign values as needed. See [`sources.md`](sources.md) on how to get credentials:
59+
```
60+
GOOGLE_API_KEYS=your_api_key
61+
PSE_KEY=your_pse_key
62+
```
63+
3. Save the changes to the `.env` file.
64+
65+
4. You should now be able to run scripts that require client credentials without any issues.
66+
67+
5168
### Tooling
5269
5370
- **[Python Guidelines — Creative Commons Open Source][ccospyguide]**
@@ -64,124 +81,7 @@ modules:
6481
6582
## Data Sources
6683
67-
68-
### CC Legal Tools
69-
70-
- [`legal-tool-paths.txt`](google_custom_search/legal-tool-paths.txt)
71-
- A `.txt` provided by Timid Robot containing all legal tool paths. The data
72-
from Google Custom Search will only cover 50+ general, most significant
73-
categories of CC License for data collection quota constraint. As an
74-
additional note, the order of precedence of license the collected data's
75-
first column is sorted due to intermediate data analysis progress.
76-
- [add list of all current CC legal tool paths by TimidRobot · Pull Request
77-
#7 · creativecommons/quantifying][pr7]
78-
79-
[pr7]: https://github.com/creativecommons/quantifying/pull/7
80-
81-
82-
### Flickr
83-
84-
- The Flickr API exposes identifiers for users, photos, photosets and other
85-
uniquely identifiable objects.
86-
- The Flickr API consists of a set of callable methods, and some API endpoints.
87-
- For more detailed description, visit: [API documentation - Flickr
88-
Services](https://www.flickr.com/services/api/).
89-
- The `hs.csv` file is a sample CSV of pulled data. Ideally the script will
90-
generate final data CSVs.
91-
- Each license will have a CSV to save the data.
92-
- Due to memory limit, the license CSVs are not pushed into github.
93-
94-
95-
### Google Custom Search JSON API
96-
97-
- The Custom Search JSON API allows user-defined detailed query and access
98-
towards related query data using a programmable search engine.
99-
- [Custom Search JSON API Reference | Programmable Search Engine | Google
100-
Developers][googlejsonapi]
101-
- [Method: cse.list | Custom Search JSON API | Google Developers][cselist]
102-
- [`google_countries.tsv`](google_custom_search/google_countries.txt)
103-
- Created by directly copy and pasting the `cr` parameter list from the
104-
following link into a `.tsv` file as there were no reliable algorithmic way
105-
for retrieving such data found in the process so far. The script itself
106-
will take care of the formatting and country-selection process.
107-
- [Country Collection Values | JSON API reference | Programmable Search
108-
Engine | Google Developers][googlecountry]
109-
- [`google_lang.txt`](google_custom_search/google_lang.txt)
110-
- Created by directly copy and pasting the `lr` parameter list from the
111-
following link into a `.txt` file as there were no reliable algorithmic way
112-
for retrieving such data found in the process so far. The script itself
113-
will take care of the data formatting and language-selection process.
114-
- [Parameter: lr | Method: cse.list | Custom Search JSON API | Google
115-
Developers][googlelang]
116-
117-
[googlejsonapi]: https://developers.google.com/custom-search/v1
118-
[cselist]: https://developers.google.com/custom-search/v1/reference/rest/v1/cse/list
119-
[googlecountry]: https://developers.google.com/custom-search/docs/json_api_reference#countryCollections
120-
[googlelang]: https://developers.google.com/custom-search/v1/reference/rest/v1/cse/list#body.QUERY_PARAMETERS.lr
121-
122-
123-
### Internet Archive Python Interface
124-
125-
A python interface to archive.org to achieve API requests towards internet
126-
archive.
127-
- [`internetarchive.Search` - Internetarchive: A Python Interface to
128-
archive.org][iasearch]
129-
130-
[iasearch]: https://internetarchive.readthedocs.io/en/stable/internetarchive.html#internetarchive.Search
131-
132-
133-
### The Metropolitan Museum of Art Collection API
134-
135-
An API endpoint for receiving Metropolitan Muesum of Art Collection's
136-
CC-Licensed works.
137-
138-
[Latest Updates | The Metropolitan Museum of Art Collection API][metapi]:
139-
> The Metropolitan Museum of Art provides select datasets of information on
140-
> more than 470,000 artworks in its Collection for unrestricted commercial and
141-
> noncommercial use. To the extent possible under law, The Metropolitan Museum
142-
> of Art has waived all copyright and related or neighboring rights to this
143-
> dataset using the [Creative Commons Zero][cc-zero] license.
144-
145-
[metapi]: https://metmuseum.github.io/
146-
[cc-zero]: https://creativecommons.org/publicdomain/zero/1.0/
147-
148-
149-
### Vimeo API
150-
151-
The Vimeo API allows users to perform filtered, advanced search on Vimeo
152-
videos.
153-
- [Getting Started with the Vimeo API][vimeostart]
154-
- [Search for videos - Vimeo API Reference: Videos][vimeoapisearch]
155-
156-
[vimeostart]: https://developer.vimeo.com/api/guides/start
157-
[vimeoapisearch]: https://developer.vimeo.com/api/reference/videos#search_videos
158-
159-
160-
### MediaWiki API
161-
162-
- The MediaWiki Action API is a web service that allows access to some wiki
163-
features like authentication, page operations, and search. It can provide
164-
meta information about the wiki and the logged-in user.
165-
- Example query: https://commons.wikimedia.org/w/api.php?action=query&cmtitle=Category:CC-BY&list=categorymembers
166-
- [`language-codes_csv.csv`](wikipedia/language-codes_csv.csv)
167-
- A list of language codes in ISO 639-1 Format to access statistics of each
168-
wikipedia main page across different languages. In the script, this file is
169-
named as `language-codes_csv` to minimize the amount of manual work
170-
required for running the script provided the same language encoding file.
171-
The user would have to rename the header and file name of their `.csv` ISO
172-
code list according to the concurrent file on Github if they would like to
173-
use some list other than the concurrent one.
174-
- This file that this script uses can be downloaded from:
175-
https://datahub.io/core/language-codes
176-
177-
178-
### Youtube Data API
179-
180-
An API from YouTube for platform users to upload videos, adjust video
181-
parameters, and obtain search results.
182-
- [Search: list | YouTube Data API | Google Developers][youtubeapi]
183-
184-
[youtubeapi]: https://developers.google.com/youtube/v3/docs/search/list
84+
Kindly visit the [sources.md](sources.md) file for it.
18585
18686
18787
## History

deviantart/deviantart_scratcher.py

Lines changed: 7 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -12,19 +12,22 @@
1212

1313
# Third-party
1414
import pandas as pd
15-
import query_secrets
1615
import requests
16+
from dotenv import load_dotenv
1717
from requests.adapters import HTTPAdapter
1818
from urllib3.util.retry import Retry
1919

20+
CWD = os.path.dirname(os.path.abspath(__file__))
21+
dotenv_path = os.path.join(os.path.dirname(CWD), ".env")
22+
load_dotenv(dotenv_path)
23+
2024
today = dt.datetime.today()
21-
API_KEYS = query_secrets.API_KEYS
25+
API_KEYS = os.getenv("GOOGLE_API_KEYS").split(",")
2226
API_KEYS_IND = 0
23-
CWD = os.path.dirname(os.path.abspath(__file__))
2427
DATA_WRITE_FILE = (
2528
f"{CWD}" f"/data_deviantart_{today.year}_{today.month}_{today.day}.csv"
2629
)
27-
PSE_KEY = query_secrets.PSE_KEY
30+
PSE_KEY = os.getenv("PSE_KEY")
2831

2932

3033
def get_license_list():

deviantart/query_secrets.example.py

Lines changed: 0 additions & 8 deletions
This file was deleted.

env.example

Lines changed: 41 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,41 @@
1+
## photos.py & photos_detail.py
2+
# "The flickr developer guide: https://www.flickr.com/services/developer/"
3+
4+
# FLICKR_API_KEY =
5+
# FLICKR_API_SECRET =
6+
7+
8+
## deviantart_scratcher.py & google_scratcher.py
9+
# "Custom Search JSON API requires the use of an API key. An API key is a way
10+
# to identify your client to Google."
11+
# https://developers.google.com/custom-search/v1/introduction
12+
13+
# GOOGLE_API_KEYS = key1, key2
14+
15+
# "The identifier of an engine created using the Programmable Search Engine
16+
# Control Panel [https://programmablesearchengine.google.com/about/]"
17+
# https://developers.google.com/custom-search/v1/reference/rest/v1/Search
18+
19+
# PSE_KEY =
20+
21+
22+
## vimeo_scratcher.py
23+
# "Before we set you loose on the API, we ask that you provide a little
24+
# information about your app. An app in this sense can be a full-featured
25+
# mobile application, a dynamic web page, or a three-line script. If it's
26+
# making API calls, it's an app."
27+
# https://developer.vimeo.com/api/guides/start#register-your-app
28+
29+
# VIMEO_ACCESS_TOKEN =
30+
# VIMEO_CLIENT_ID =
31+
32+
33+
## youtube_scratcher.py
34+
# "Every request must either specify an API key (with the key parameter) [...].
35+
# Your API key is available in the Developer Console's API Access pane
36+
# [https://console.developers.google.com/] for your project."
37+
# https://developers.google.com/youtube/v3/docs
38+
39+
# YOUTUBE_API_KEY =
40+
41+

flickr/photos.py

Lines changed: 7 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1,19 +1,24 @@
11
# Standard library
22
import json
3+
import os
34
import os.path
45
import sys
56
import traceback
67

78
# Third-party
89
import flickrapi
9-
import query_secrets
10+
from dotenv import load_dotenv
1011

1112
CWD = os.path.dirname(os.path.abspath(__file__))
13+
dotenv_path = os.path.join(os.path.dirname(CWD), ".env")
14+
load_dotenv(dotenv_path)
1215

1316

1417
def main():
1518
flickr = flickrapi.FlickrAPI(
16-
query_secrets.api_key, query_secrets.api_secret, format="json"
19+
os.getenv("FLICKR_API_KEY"),
20+
os.getenv("FLICKR_API_SECRET"),
21+
format="json",
1722
)
1823

1924
# use search method to pull general photo info under each cc license data

flickr/photos_detail.py

Lines changed: 8 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -9,6 +9,7 @@
99

1010
# Standard library
1111
import json
12+
import os
1213
import os.path
1314
import sys
1415
import time
@@ -17,9 +18,12 @@
1718
# Third-party
1819
import flickrapi
1920
import pandas as pd
20-
import query_secrets
21+
from dotenv import load_dotenv
2122

2223
CWD = os.path.dirname(os.path.abspath(__file__))
24+
dotenv_path = os.path.join(os.path.dirname(CWD), ".env")
25+
load_dotenv(dotenv_path)
26+
2327
RETRIES = 0
2428

2529

@@ -149,7 +153,9 @@ def main():
149153
hs_csv_path = os.path.join(CWD, "hs.csv")
150154

151155
flickr = flickrapi.FlickrAPI(
152-
query_secrets.api_key, query_secrets.api_secret, format="json"
156+
os.getenv("FLICKR_API_KEY"),
157+
os.getenv("FLICKR_API_SECRET"),
158+
format="json",
153159
)
154160
# below is the cc licenses list
155161
license_list = [1, 2, 3, 4, 5, 6, 9, 10]

google_custom_search/google_scratcher.py

Lines changed: 7 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -12,15 +12,18 @@
1212

1313
# Third-party
1414
import pandas as pd
15-
import query_secrets
1615
import requests
16+
from dotenv import load_dotenv
1717
from requests.adapters import HTTPAdapter
1818
from urllib3.util.retry import Retry
1919

20+
CWD = os.path.dirname(os.path.abspath(__file__))
21+
dotenv_path = os.path.join(os.path.dirname(CWD), ".env")
22+
load_dotenv(dotenv_path)
23+
2024
today = dt.datetime.today()
21-
API_KEYS = query_secrets.API_KEYS
25+
API_KEYS = os.getenv("GOOGLE_API_KEYS").split(",")
2226
API_KEYS_IND = 0
23-
CWD = os.path.dirname(os.path.abspath(__file__))
2427
DATA_WRITE_FILE = (
2528
f"{CWD}"
2629
f"/data_google_custom_search_{today.year}_{today.month}_{today.day}.csv"
@@ -36,7 +39,7 @@
3639
f"{today.year}_{today.month}_{today.day}.csv"
3740
)
3841
SEARCH_HALFYEAR_SPAN = 20
39-
PSE_KEY = query_secrets.PSE_KEY
42+
PSE_KEY = os.getenv("PSE_KEY")
4043

4144

4245
def get_license_list():

0 commit comments

Comments
 (0)