Skip to content

Commit ce648dc

Browse files
Fixes #13
1 parent fa0ce8e commit ce648dc

12 files changed

Lines changed: 42 additions & 42 deletions

docs/search/search_index.json

Lines changed: 1 addition & 1 deletion
Large diffs are not rendered by default.

docs/section-2-HTML-based-scraping/index.html

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -620,7 +620,7 @@ <h3 id="get-and-post-calls-to-retrieve-response">GET and POST calls to retrieve
620620
<p><strong>POST call</strong> - POST is used to send data in the URL request to either update details or request specific content from the web server. In a POST call, data is sent and then a response can be expected from the web server. An example would be to request content from a web server based on a particular selection from a drop-down menu. The selection option is upadted while also respective content is sent back.</p>
621621
<h3 id="scraping-a-webpage">Scraping a webpage<a class="headerlink" href="#scraping-a-webpage" title="Permanent link">&para;</a></h3>
622622
<hr />
623-
<p>Let us now scrape a <strong>list of the fotune 500 companies for the year 2018</strong>. The website from which the data is to be scraped is <code>https://www.zyxware.com/articles/5914/list-of-fortune-500-companies-and-their-websites-2018</code>.</p>
623+
<p>Let us now scrape a <strong>list of the fotune 500 companies for the year 2018</strong>. The website from which the data is to be scraped is <a href="https://www.zyxware.com/articles/5914/list-of-fortune-500-companies-and-their-websites-2018">this</a>.</p>
624624
<p><img alt="fortune 500" src="../images/fortune_500.png" /></p>
625625
<p>It can be seen on this website that the list contains the rank, company name and the website of the company. The whole content of this website can be received as a response when requested with the request library in Python</p>
626626
<table class="codehilitetable"><tr><td class="linenos"><div class="linenodiv"><pre> 1

docs/section-5-legal-and-ethical-considerations/index.html

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -522,7 +522,7 @@ <h3 id="web-scraping-code-of-conduct">Web scraping code of conduct<a class="head
522522
<p><strong>Don't download copies of documents that are clearly not public.</strong> For example, academic journal publishers often have very strict rules about what you can and what you cannot do with their databases. Mass downloading article PDFs is probably prohibited and can put you (or at the very least your friendly university librarian) in trouble. If your project requires local copies of documents (e.g. for text mining projects), special agreements can be reached with the publisher. The library is a good place to start investigating something like that.</p>
523523
</li>
524524
<li>
525-
<p><strong>Check your local legislation.</strong> For example, certain countries have laws protecting personal information such as email addresses and phone numbers. Scraping such information, even from publicly avaialable web sites, can be illegal (e.g. in Australia).</p>
525+
<p><strong>Check your local legislation.</strong> For example, certain countries have laws protecting personal information such as email addresses and phone numbers. Scraping such information, even from publicly available web sites, can be illegal (e.g. in Australia).</p>
526526
</li>
527527
<li>
528528
<p><strong>Don't share downloaded content illegally.</strong> Scraping for personal purposes is usually OK, even if it is copyrighted information, as it could fall under the fair use provision of the intellectual property legislation. However, sharing data for which you don’t hold the right to share is illegal.</p>
@@ -541,7 +541,7 @@ <h3 id="web-scraping-code-of-conduct">Web scraping code of conduct<a class="head
541541
<h3 id="references">References<a class="headerlink" href="#references" title="Permanent link">&para;</a></h3>
542542
<ul>
543543
<li>The <a href="https://en.wikipedia.org/wiki/Web_scraping">Web scraping Wikipedia page</a> has a concise definition of many concepts discussed here.</li>
544-
<li><a href="http://naelshiab.com/members-parliament-web-scraping/">This case study</a> is a great example of what can be done using web scraping and how to achieve it.</li>
544+
<li><a href="https://www.analyticsvidhya.com/blog/2017/07/web-scraping-in-python-using-scrapy/">This case study</a> is a great example of what can be done using web scraping and a stepping stone to a more advanced python library <code>scrapy</code>.</li>
545545
<li><a href="https://www.eff.org/deeplinks/2019/09/victory-ruling-hiq-v-linkedin-protects-scraping-public-data">This recent case</a> about Linkedin data is a good read.</li>
546546
<li>Commencing 25 May 2018, Monash University will also become subject to the European Union’s General Data Protection Regulation (<a href="https://en.wikipedia.org/wiki/General_Data_Protection_Regulation">GDPR</a>).</li>
547547
<li><a href="https://software-carpentry.org/">Software Carpentry</a> is a non-profit organisation that runs learn-to-code workshops worldwide. All lessons are publicly available and can be followed indepentently. This lesson is heavily inspired by Software Carpentry.</li>

docs/sitemap.xml

Lines changed: 8 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -2,42 +2,42 @@
22
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
33
<url>
44
<loc>https://monashdatafluency.github.io/python-web-scraping/</loc>
5-
<lastmod>2020-07-28</lastmod>
5+
<lastmod>2020-08-15</lastmod>
66
<changefreq>daily</changefreq>
77
</url>
88
<url>
99
<loc>https://monashdatafluency.github.io/python-web-scraping/section-0-brief-python-refresher/</loc>
10-
<lastmod>2020-07-28</lastmod>
10+
<lastmod>2020-08-15</lastmod>
1111
<changefreq>daily</changefreq>
1212
</url>
1313
<url>
1414
<loc>https://monashdatafluency.github.io/python-web-scraping/section-1-intro-to-web-scraping/</loc>
15-
<lastmod>2020-07-28</lastmod>
15+
<lastmod>2020-08-15</lastmod>
1616
<changefreq>daily</changefreq>
1717
</url>
1818
<url>
1919
<loc>https://monashdatafluency.github.io/python-web-scraping/section-2-HTML-based-scraping/</loc>
20-
<lastmod>2020-07-28</lastmod>
20+
<lastmod>2020-08-15</lastmod>
2121
<changefreq>daily</changefreq>
2222
</url>
2323
<url>
2424
<loc>https://monashdatafluency.github.io/python-web-scraping/section-3-API-based-scraping/</loc>
25-
<lastmod>2020-07-28</lastmod>
25+
<lastmod>2020-08-15</lastmod>
2626
<changefreq>daily</changefreq>
2727
</url>
2828
<url>
2929
<loc>https://monashdatafluency.github.io/python-web-scraping/section-4-wrangling-and-analysis/</loc>
30-
<lastmod>2020-07-28</lastmod>
30+
<lastmod>2020-08-15</lastmod>
3131
<changefreq>daily</changefreq>
3232
</url>
3333
<url>
3434
<loc>https://monashdatafluency.github.io/python-web-scraping/section-5-legal-and-ethical-considerations/</loc>
35-
<lastmod>2020-07-28</lastmod>
35+
<lastmod>2020-08-15</lastmod>
3636
<changefreq>daily</changefreq>
3737
</url>
3838
<url>
3939
<loc>https://monashdatafluency.github.io/python-web-scraping/section-7-references/</loc>
40-
<lastmod>2020-07-28</lastmod>
40+
<lastmod>2020-08-15</lastmod>
4141
<changefreq>daily</changefreq>
4242
</url>
4343
</urlset>

docs/sitemap.xml.gz

0 Bytes
Binary file not shown.

markdowns/section-0-brief-python-refresher.md

Lines changed: 11 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -16,7 +16,7 @@ print("Hello world!!")
1616
```
1717

1818
Hello world!!
19-
19+
2020

2121
4. **Shift-Enter** to run the contents of the cell.
2222

@@ -37,7 +37,7 @@ print('Pandas version : {}'.format(pd.__version__))
3737
```
3838

3939
Pandas version : 1.0.1
40-
40+
4141

4242
Suppose we wanted to create a dataframe as follows,
4343

@@ -274,7 +274,7 @@ print(y["age"])
274274
```
275275

276276
30
277-
277+
278278

279279
Lets take a look at `y` as follows,
280280

@@ -288,7 +288,7 @@ print(y)
288288
"age": 30,
289289
"city": "New York"
290290
}
291-
291+
292292

293293
We can obtain the exact same JSON string we defined earlier from a Python dictionary as follows,
294294

@@ -311,7 +311,7 @@ print(y)
311311
```
312312

313313
{"name": "John", "age": 30, "city": "New York"}
314-
314+
315315

316316
For better formatting we can indent the same as,
317317

@@ -328,7 +328,7 @@ print(y)
328328
"age": 30,
329329
"city": "New York"
330330
}
331-
331+
332332

333333
### Regex
334334
---
@@ -369,7 +369,7 @@ else:
369369
```
370370

371371
Found
372-
372+
373373

374374
We can use `[0-9]` in the regular expression to identify any one number in the string.
375375

@@ -417,7 +417,7 @@ print(re.search('[0-9][0-9][0-9]','hello world')) # matches nothing
417417
<re.Match object; span=(0, 3), match='012'>
418418
<re.Match object; span=(5, 8), match='567'>
419419
None
420-
420+
421421

422422
As seen above, it matches the first occurance of three digits occuring together.
423423

@@ -429,7 +429,7 @@ print(re.search('[a-z]*[0-9]*','hello123@@')) # matches hello123
429429
```
430430

431431
<re.Match object; span=(0, 8), match='hello123'>
432-
432+
433433

434434
What if we just want to capture only the numbers? `Capture group` is the answer.
435435

@@ -556,7 +556,7 @@ print(html)
556556
</body>
557557
</html>
558558
559-
559+
560560

561561
Now, if we are only interested in :
562562
- names i.e. the data inside the `<h1></h1>` tags, and
@@ -588,7 +588,7 @@ print(names, titles)
588588
```
589589

590590
['Sam', 'Rob'] ['Physicist', 'Economist']
591-
591+
592592

593593
### From a web scraping perspective
594594
- `JSON` and `XML` are the most widely used formats to carry data all over the internet.

markdowns/section-2-HTML-based-scraping.md

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -121,7 +121,7 @@ There are mainly two types of requests which can be made to the web server. A GE
121121
### Scraping a webpage
122122
---
123123
124-
Let us now scrape a **list of the fotune 500 companies for the year 2018**. The website from which the data is to be scraped is `https://www.zyxware.com/articles/5914/list-of-fortune-500-companies-and-their-websites-2018`.
124+
Let us now scrape a **list of the fotune 500 companies for the year 2018**. The website from which the data is to be scraped is [this](https://www.zyxware.com/articles/5914/list-of-fortune-500-companies-and-their-websites-2018).
125125
126126
![fortune 500](../images/fortune_500.png)
127127
@@ -148,7 +148,7 @@ print('Content of the website\n', response.content[:2000])
148148
149149
Content of the website
150150
b'<!DOCTYPE html>\n<html lang="en" dir="ltr" prefix="content: http://purl.org/rss/1.0/modules/content/ dc: http://purl.org/dc/terms/ foaf: http://xmlns.com/foaf/0.1/ og: http://ogp.me/ns# rdfs: http://www.w3.org/2000/01/rdf-schema# schema: http://schema.org/ sioc: http://rdfs.org/sioc/ns# sioct: http://rdfs.org/sioc/types# skos: http://www.w3.org/2004/02/skos/core# xsd: http://www.w3.org/2001/XMLSchema# ">\n <head>\n <meta charset="utf-8" />\n<script>dataLayer = [];dataLayer.push({"tag": "5914"});</script>\n<script>window.dataLayer = window.dataLayer || []; window.dataLayer.push({"drupalLanguage":"en","drupalCountry":"IN","siteName":"Zyxware Technologies","entityCreated":"1562300185","entityLangcode":"en","entityStatus":"1","entityUid":"1","entityUuid":"6fdfb477-ce5d-4081-9010-3afd9260cdf7","entityVid":"15541","entityName":"webmaster","entityType":"node","entityBundle":"story","entityId":"5914","entityTitle":"List of Fortune 500 companies and their websites (2018)","entityTaxonomy":{"vocabulary_2":"Business Insight, Fortune 500, Drupal Insight, Marketing Resources"},"userUid":0});</script>\n<script async src="https://www.googletagmanager.com/gtag/js?id=UA-1488254-2"></script>\n<script>window.google_analytics_uacct = "UA-1488254-2";window.dataLayer = window.dataLayer || [];function gtag(){dataLayer.push(arguments)};gtag("js", new Date());window[\'GoogleAnalyticsObject\'] = \'ga\';\r\n window[\'ga\'] = window[\'ga\'] || function() {\r\n (window[\'ga\'].q = window[\'ga\'].q || []).push(arguments)\r\n };\r\nga("set", "dimension2", window.analytics_manager_node_age);\r\nga("set", "dimension3", window.analytics_manager_node_author);gtag("config", "UA-1488254-2", {"groups":"default","anonymize_ip":true,"page_path":location.pathname + location.search + location.hash,"link_attribution":true,"allow_ad_personalization_signals":false});</script>\n<script>(function(w,d,t,u,n,a,m){w[\'MauticTrackingObject\']=n;w[n]=w[n]||function(){(w[n].q=w[n].q||[]).push(arguments)},a=d.createElement(t),m=d.ge'
151-
151+
152152
153153
This text when formatted looks like this,
154154
@@ -306,7 +306,7 @@ print(all_values[2])
306306
<td>Exxon Mobil</td>
307307
<td><a href="http://www.exxonmobil.com">http://www.exxonmobil.com</a></td>
308308
</tr>
309-
309+
310310
311311
#### Challenge
312312
---

markdowns/section-3-API-based-scraping.md

Lines changed: 7 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -75,7 +75,7 @@ print('wptools version : {}'.format(wptools.__version__)) # checking the install
7575
```
7676

7777
wptools version : 0.4.17
78-
78+
7979

8080
Now let's load the data which we scrapped in the previous section as follows,
8181

@@ -147,7 +147,7 @@ for i, j in enumerate(companies): # looping through the list of 20 company
147147
18. General Electric
148148
19. Walgreens Boots Alliance
149149
20. JPMorgan Chase
150-
150+
151151

152152
### Getting article names from wiki
153153

@@ -250,7 +250,7 @@ for idx, company in enumerate(wiki_search):
250250
JPMorgan Chase, Chase Bank, 2012 JPMorgan Chase trading loss, JPMorgan Chase Tower (Houston), 270 Park Avenue, Chase Paymentech, 2014 JPMorgan Chase data breach, Bear Stearns, Jamie Dimon, JPMorgan Chase Building (Houston)
251251

252252

253-
253+
254254

255255
Now let's get the most probable ones (the first suggestion) for each of the first 20 companies on the Fortune 500 list,
256256

@@ -263,7 +263,7 @@ print(most_probable)
263263
```
264264

265265
[('Walmart', 'Walmart'), ('Exxon Mobil', 'ExxonMobil'), ('Berkshire Hathaway', 'Berkshire Hathaway'), ('Apple', 'Apple'), ('UnitedHealth Group', 'UnitedHealth Group'), ('McKesson', 'McKesson Corporation'), ('CVS Health', 'CVS Health'), ('Amazon.com', 'Amazon (company)'), ('AT&T', 'AT&T'), ('General Motors', 'General Motors'), ('Ford Motor', 'Ford Motor Company'), ('AmerisourceBergen', 'AmerisourceBergen'), ('Chevron', 'Chevron Corporation'), ('Cardinal Health', 'Cardinal Health'), ('Costco', 'Costco'), ('Verizon', 'Verizon Communications'), ('Kroger', 'Kroger'), ('General Electric', 'General Electric'), ('Walgreens Boots Alliance', 'Walgreens Boots Alliance'), ('JPMorgan Chase', 'JPMorgan Chase')]
266-
266+
267267

268268
We can notice that most of the wiki article titles make sense. However, **Apple** is quite ambiguous in this regard as it can indicate the fruit as well as the company. However we can see that the second suggestion returned by was **Apple Inc.**. Hence, we can manually replace it with **Apple Inc.** as follows,
269269

@@ -274,7 +274,7 @@ print(companies) # final list of wikipedia article titles
274274
```
275275

276276
['Walmart', 'ExxonMobil', 'Berkshire Hathaway', 'Apple Inc.', 'UnitedHealth Group', 'McKesson Corporation', 'CVS Health', 'Amazon (company)', 'AT&T', 'General Motors', 'Ford Motor Company', 'AmerisourceBergen', 'Chevron Corporation', 'Cardinal Health', 'Costco', 'Verizon Communications', 'Kroger', 'General Electric', 'Walgreens Boots Alliance', 'JPMorgan Chase']
277-
277+
278278

279279
### Retrieving the infoboxes
280280

@@ -303,7 +303,7 @@ page.get_parse() # parses the wikipedia article
303303
wikidata_url: https://www.wikidata.org/wiki/Q483551
304304
wikitext: <str(277438)> {{about|the retail chain|other uses}}{{p...
305305
}
306-
306+
307307

308308

309309

@@ -681,7 +681,7 @@ for company in companies:
681681
wikidata_url: https://www.wikidata.org/wiki/Q192314
682682
wikitext: <str(117507)> {{About|JPMorgan Chase & Co|its main sub...
683683
}
684-
684+
685685

686686
Let's take a look at the first instance in `wiki_data` i.e. **Walmart**,
687687

0 commit comments

Comments
 (0)