MonashDataFluency
diff --git a/‎docs/search/search_index.json‎
Lines changed: 1 addition & 1 deletion b/‎docs/search/search_index.json‎
Lines changed: 1 addition & 1 deletion
diff --git a/‎docs/section-2-HTML-based-scraping/index.html‎
Lines changed: 1 addition & 1 deletion b/‎docs/section-2-HTML-based-scraping/index.html‎
Lines changed: 1 addition & 1 deletion
diff --git a/‎docs/section-5-legal-and-ethical-considerations/index.html‎
Lines changed: 2 additions & 2 deletions b/‎docs/section-5-legal-and-ethical-considerations/index.html‎
Lines changed: 2 additions & 2 deletions
diff --git a/‎docs/sitemap.xml‎
Lines changed: 8 additions & 8 deletions b/‎docs/sitemap.xml‎
Lines changed: 8 additions & 8 deletions
diff --git a/‎docs/sitemap.xml.gz‎
0 Bytes b/‎docs/sitemap.xml.gz‎
0 Bytes
diff --git a/‎markdowns/section-0-brief-python-refresher.md‎
Lines changed: 11 additions & 11 deletions b/‎markdowns/section-0-brief-python-refresher.md‎
Lines changed: 11 additions & 11 deletions
diff --git a/‎markdowns/section-2-HTML-based-scraping.md‎
Lines changed: 3 additions & 3 deletions b/‎markdowns/section-2-HTML-based-scraping.md‎
Lines changed: 3 additions & 3 deletions
diff --git a/‎markdowns/section-3-API-based-scraping.md‎
Lines changed: 7 additions & 7 deletions b/‎markdowns/section-3-API-based-scraping.md‎
Lines changed: 7 additions & 7 deletions
@@ -620,7 +620,7 @@ <h3 id="get-and-post-calls-to-retrieve-response">GET and POST calls to retrieve
 <p><strong>POST call</strong> - POST is used to send data in the URL request to either update details or request specific content from the web server. In a POST call, data is sent and then a response can be expected from the web server. An example would be to request content from a web server based on a particular selection from a drop-down menu. The selection option is upadted while also respective content is sent back.</p>
 <h3 id="scraping-a-webpage">Scraping a webpage<a class="headerlink" href="#scraping-a-webpage" title="Permanent link">&para;</a></h3>
 <hr />
-<p>Let us now scrape a <strong>list of the fotune 500 companies for the year 2018</strong>. The website from which the data is to be scraped is <code>https://www.zyxware.com/articles/5914/list-of-fortune-500-companies-and-their-websites-2018</code>.</p>
+<p>Let us now scrape a <strong>list of the fotune 500 companies for the year 2018</strong>. The website from which the data is to be scraped is <a href="https://www.zyxware.com/articles/5914/list-of-fortune-500-companies-and-their-websites-2018">this</a>.</p>
 <p><img alt="fortune 500" src="../images/fortune_500.png" /></p>
 <p>It can be seen on this website that the list contains the rank, company name and the website of the company. The whole content of this website can be received as a response when requested with the request library in Python</p>
 <table class="codehilitetable"><tr><td class="linenos"><div class="linenodiv"><pre> 1
 
@@ -522,7 +522,7 @@ <h3 id="web-scraping-code-of-conduct">Web scraping code of conduct<a class="head
 <p><strong>Don't download copies of documents that are clearly not public.</strong> For example, academic journal publishers often have very strict rules about what you can and what you cannot do with their databases. Mass downloading article PDFs is probably prohibited and can put you (or at the very least your friendly university librarian) in trouble. If your project requires local copies of documents (e.g. for text mining projects), special agreements can be reached with the publisher. The library is a good place to start investigating something like that.</p>
 </li>
 <li>
-<p><strong>Check your local legislation.</strong> For example, certain countries have laws protecting personal information such as email addresses and phone numbers. Scraping such information, even from publicly avaialable web sites, can be illegal (e.g. in Australia).</p>
+<p><strong>Check your local legislation.</strong> For example, certain countries have laws protecting personal information such as email addresses and phone numbers. Scraping such information, even from publicly available web sites, can be illegal (e.g. in Australia).</p>
 </li>
 <li>
 <p><strong>Don't share downloaded content illegally.</strong> Scraping for personal purposes is usually OK, even if it is copyrighted information, as it could fall under the fair use provision of the intellectual property legislation. However, sharing data for which you don’t hold the right to share is illegal.</p>
@@ -541,7 +541,7 @@ <h3 id="web-scraping-code-of-conduct">Web scraping code of conduct<a class="head
 <h3 id="references">References<a class="headerlink" href="#references" title="Permanent link">&para;</a></h3>
 <ul>
 <li>The <a href="https://en.wikipedia.org/wiki/Web_scraping">Web scraping Wikipedia page</a> has a concise definition of many concepts discussed here.</li>
-<li><a href="http://naelshiab.com/members-parliament-web-scraping/">This case study</a> is a great example of what can be done using web scraping and how to achieve it.</li>
+<li><a href="https://www.analyticsvidhya.com/blog/2017/07/web-scraping-in-python-using-scrapy/">This case study</a> is a great example of what can be done using web scraping and a stepping stone to a more advanced python library <code>scrapy</code>.</li>
 <li><a href="https://www.eff.org/deeplinks/2019/09/victory-ruling-hiq-v-linkedin-protects-scraping-public-data">This recent case</a> about Linkedin data is a good read.</li>
 <li>Commencing 25 May 2018, Monash University will also become subject to the European Union’s General Data Protection Regulation (<a href="https://en.wikipedia.org/wiki/General_Data_Protection_Regulation">GDPR</a>).</li>
 <li><a href="https://software-carpentry.org/">Software Carpentry</a> is a non-profit organisation that runs learn-to-code workshops worldwide. All lessons are publicly available and can be followed indepentently. This lesson is heavily inspired by Software Carpentry.</li>
 
@@ -2,42 +2,42 @@
 <urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
     <url>
      <loc>https://monashdatafluency.github.io/python-web-scraping/</loc>
-     <lastmod>2020-07-28</lastmod>
+     <lastmod>2020-08-15</lastmod>
      <changefreq>daily</changefreq>
     </url>
     <url>
      <loc>https://monashdatafluency.github.io/python-web-scraping/section-0-brief-python-refresher/</loc>
-     <lastmod>2020-07-28</lastmod>
+     <lastmod>2020-08-15</lastmod>
      <changefreq>daily</changefreq>
     </url>
     <url>
      <loc>https://monashdatafluency.github.io/python-web-scraping/section-1-intro-to-web-scraping/</loc>
-     <lastmod>2020-07-28</lastmod>
+     <lastmod>2020-08-15</lastmod>
      <changefreq>daily</changefreq>
     </url>
     <url>
      <loc>https://monashdatafluency.github.io/python-web-scraping/section-2-HTML-based-scraping/</loc>
-     <lastmod>2020-07-28</lastmod>
+     <lastmod>2020-08-15</lastmod>
      <changefreq>daily</changefreq>
     </url>
     <url>
      <loc>https://monashdatafluency.github.io/python-web-scraping/section-3-API-based-scraping/</loc>
-     <lastmod>2020-07-28</lastmod>
+     <lastmod>2020-08-15</lastmod>
      <changefreq>daily</changefreq>
     </url>
     <url>
      <loc>https://monashdatafluency.github.io/python-web-scraping/section-4-wrangling-and-analysis/</loc>
-     <lastmod>2020-07-28</lastmod>
+     <lastmod>2020-08-15</lastmod>
      <changefreq>daily</changefreq>
     </url>
     <url>
      <loc>https://monashdatafluency.github.io/python-web-scraping/section-5-legal-and-ethical-considerations/</loc>
-     <lastmod>2020-07-28</lastmod>
+     <lastmod>2020-08-15</lastmod>
      <changefreq>daily</changefreq>
     </url>
     <url>
      <loc>https://monashdatafluency.github.io/python-web-scraping/section-7-references/</loc>
-     <lastmod>2020-07-28</lastmod>
+     <lastmod>2020-08-15</lastmod>
      <changefreq>daily</changefreq>
     </url>
 </urlset>
@@ -16,7 +16,7 @@ print("Hello world!!")
 ```
 
     Hello world!!
-
+    
 
 4. **Shift-Enter** to run the contents of the cell.
 
@@ -37,7 +37,7 @@ print('Pandas version : {}'.format(pd.__version__))
 ```
 
     Pandas version : 1.0.1
-
+    
 
 Suppose we wanted to create a dataframe as follows,
 
@@ -274,7 +274,7 @@ print(y["age"])
 ```
 
     30
-
+    
 
 Lets take a look at `y` as follows,
 
@@ -288,7 +288,7 @@ print(y)
         "age": 30,
         "city": "New York"
     }
-
+    
 
 We can obtain the exact same JSON string we defined earlier from a Python dictionary as follows,
 
@@ -311,7 +311,7 @@ print(y)
 ```
 
     {"name": "John", "age": 30, "city": "New York"}
-
+    
 
 For better formatting we can indent the same as,
 
@@ -328,7 +328,7 @@ print(y)
         "age": 30,
         "city": "New York"
     }
-
+    
 
 ### Regex
 ---
@@ -369,7 +369,7 @@ else:
 ```
 
     Found
-
+    
 
 We can use `[0-9]` in the regular expression to identify any one number in the string.
 
@@ -417,7 +417,7 @@ print(re.search('[0-9][0-9][0-9]','hello world')) # matches nothing
     <re.Match object; span=(0, 3), match='012'>
     <re.Match object; span=(5, 8), match='567'>
     None
-
+    
 
 As seen above, it matches the first occurance of three digits occuring together.
 
@@ -429,7 +429,7 @@ print(re.search('[a-z]*[0-9]*','hello123@@')) # matches hello123
 ```
 
     <re.Match object; span=(0, 8), match='hello123'>
-
+    
 
 What if we just want to capture only the numbers? `Capture group` is the answer.
 
@@ -556,7 +556,7 @@ print(html)
             </body>
             </html> 
             
-
+    
 
 Now, if we are only interested in : 
 - names i.e. the data inside the `<h1></h1>` tags, and
@@ -588,7 +588,7 @@ print(names, titles)
 ```
 
     ['Sam', 'Rob'] ['Physicist', 'Economist']
-
+    
 
 ### From a web scraping perspective
 - `JSON` and `XML` are the most widely used formats to carry data all over the internet.
 
@@ -121,7 +121,7 @@ There are mainly two types of requests which can be made to the web server. A GE
 ### Scraping a webpage
 ---
 
-Let us now scrape a **list of the fotune 500 companies for the year 2018**. The website from which the data is to be scraped is `https://www.zyxware.com/articles/5914/list-of-fortune-500-companies-and-their-websites-2018`.
+Let us now scrape a **list of the fotune 500 companies for the year 2018**. The website from which the data is to be scraped is [this](https://www.zyxware.com/articles/5914/list-of-fortune-500-companies-and-their-websites-2018).
 
 ![fortune 500](../images/fortune_500.png)
 
@@ -148,7 +148,7 @@ print('Content of the website\n', response.content[:2000])
     
     Content of the website
      b'<!DOCTYPE html>\n<html lang="en" dir="ltr" prefix="content: http://purl.org/rss/1.0/modules/content/  dc: http://purl.org/dc/terms/  foaf: http://xmlns.com/foaf/0.1/  og: http://ogp.me/ns#  rdfs: http://www.w3.org/2000/01/rdf-schema#  schema: http://schema.org/  sioc: http://rdfs.org/sioc/ns#  sioct: http://rdfs.org/sioc/types#  skos: http://www.w3.org/2004/02/skos/core#  xsd: http://www.w3.org/2001/XMLSchema# ">\n  <head>\n    <meta charset="utf-8" />\n<script>dataLayer = [];dataLayer.push({"tag": "5914"});</script>\n<script>window.dataLayer = window.dataLayer || []; window.dataLayer.push({"drupalLanguage":"en","drupalCountry":"IN","siteName":"Zyxware Technologies","entityCreated":"1562300185","entityLangcode":"en","entityStatus":"1","entityUid":"1","entityUuid":"6fdfb477-ce5d-4081-9010-3afd9260cdf7","entityVid":"15541","entityName":"webmaster","entityType":"node","entityBundle":"story","entityId":"5914","entityTitle":"List of Fortune 500 companies and their websites (2018)","entityTaxonomy":{"vocabulary_2":"Business Insight, Fortune 500, Drupal Insight, Marketing Resources"},"userUid":0});</script>\n<script async src="https://www.googletagmanager.com/gtag/js?id=UA-1488254-2"></script>\n<script>window.google_analytics_uacct = "UA-1488254-2";window.dataLayer = window.dataLayer || [];function gtag(){dataLayer.push(arguments)};gtag("js", new Date());window[\'GoogleAnalyticsObject\'] = \'ga\';\r\n  window[\'ga\'] = window[\'ga\'] || function() {\r\n    (window[\'ga\'].q = window[\'ga\'].q || []).push(arguments)\r\n  };\r\nga("set", "dimension2", window.analytics_manager_node_age);\r\nga("set", "dimension3", window.analytics_manager_node_author);gtag("config", "UA-1488254-2", {"groups":"default","anonymize_ip":true,"page_path":location.pathname + location.search + location.hash,"link_attribution":true,"allow_ad_personalization_signals":false});</script>\n<script>(function(w,d,t,u,n,a,m){w[\'MauticTrackingObject\']=n;w[n]=w[n]||function(){(w[n].q=w[n].q||[]).push(arguments)},a=d.createElement(t),m=d.ge'
-
+    
 
 This text when formatted looks like this,
 
@@ -306,7 +306,7 @@ print(all_values[2])
     <td>Exxon Mobil</td>
     <td><a href="http://www.exxonmobil.com">http://www.exxonmobil.com</a></td>
     </tr>
-
+    
 
 #### Challenge
 ---
 
@@ -75,7 +75,7 @@ print('wptools version : {}'.format(wptools.__version__)) # checking the install
 ```
 
     wptools version : 0.4.17
-
+    
 
 Now let's load the data which we scrapped in the previous section as follows,
 
@@ -147,7 +147,7 @@ for i, j in enumerate(companies):   # looping through the list of 20 company
     18. General Electric
     19. Walgreens Boots Alliance
     20. JPMorgan Chase
-
+    
 
 ### Getting article names from wiki
 
@@ -250,7 +250,7 @@ for idx, company in enumerate(wiki_search):
     JPMorgan Chase, Chase Bank, 2012 JPMorgan Chase trading loss, JPMorgan Chase Tower (Houston), 270 Park Avenue, Chase Paymentech, 2014 JPMorgan Chase data breach, Bear Stearns, Jamie Dimon, JPMorgan Chase Building (Houston)
 
 
-
+    
 
 Now let's get the most probable ones (the first suggestion) for each of the first 20 companies on the Fortune 500 list,
 
@@ -263,7 +263,7 @@ print(most_probable)
 ```
 
     [('Walmart', 'Walmart'), ('Exxon Mobil', 'ExxonMobil'), ('Berkshire Hathaway', 'Berkshire Hathaway'), ('Apple', 'Apple'), ('UnitedHealth Group', 'UnitedHealth Group'), ('McKesson', 'McKesson Corporation'), ('CVS Health', 'CVS Health'), ('Amazon.com', 'Amazon (company)'), ('AT&T', 'AT&T'), ('General Motors', 'General Motors'), ('Ford Motor', 'Ford Motor Company'), ('AmerisourceBergen', 'AmerisourceBergen'), ('Chevron', 'Chevron Corporation'), ('Cardinal Health', 'Cardinal Health'), ('Costco', 'Costco'), ('Verizon', 'Verizon Communications'), ('Kroger', 'Kroger'), ('General Electric', 'General Electric'), ('Walgreens Boots Alliance', 'Walgreens Boots Alliance'), ('JPMorgan Chase', 'JPMorgan Chase')]
-
+    
 
 We can notice that most of the wiki article titles make sense. However, **Apple** is quite ambiguous in this regard as it can indicate the fruit as well as the company. However we can see that the second suggestion returned by was **Apple Inc.**. Hence, we can manually replace it with **Apple Inc.** as follows,
 
@@ -274,7 +274,7 @@ print(companies) # final list of wikipedia article titles
 ```
 
     ['Walmart', 'ExxonMobil', 'Berkshire Hathaway', 'Apple Inc.', 'UnitedHealth Group', 'McKesson Corporation', 'CVS Health', 'Amazon (company)', 'AT&T', 'General Motors', 'Ford Motor Company', 'AmerisourceBergen', 'Chevron Corporation', 'Cardinal Health', 'Costco', 'Verizon Communications', 'Kroger', 'General Electric', 'Walgreens Boots Alliance', 'JPMorgan Chase']
-
+    
 
 ### Retrieving the infoboxes
 
@@ -303,7 +303,7 @@ page.get_parse()    # parses the wikipedia article
       wikidata_url: https://www.wikidata.org/wiki/Q483551
       wikitext: <str(277438)> {{about|the retail chain|other uses}}{{p...
     }
-
+    
 
 
 
@@ -681,7 +681,7 @@ for company in companies:
       wikidata_url: https://www.wikidata.org/wiki/Q192314
       wikitext: <str(117507)> {{About|JPMorgan Chase & Co|its main sub...
     }
-
+    
 
 Let's take a look at the first instance in `wiki_data` i.e. **Walmart**,