You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: docs/section-2-HTML-based-scraping/index.html
+1-1Lines changed: 1 addition & 1 deletion
Original file line number
Diff line number
Diff line change
@@ -620,7 +620,7 @@ <h3 id="get-and-post-calls-to-retrieve-response">GET and POST calls to retrieve
620
620
<p><strong>POST call</strong> - POST is used to send data in the URL request to either update details or request specific content from the web server. In a POST call, data is sent and then a response can be expected from the web server. An example would be to request content from a web server based on a particular selection from a drop-down menu. The selection option is upadted while also respective content is sent back.</p>
621
621
<h3id="scraping-a-webpage">Scraping a webpage<aclass="headerlink" href="#scraping-a-webpage" title="Permanent link">¶</a></h3>
622
622
<hr/>
623
-
<p>Let us now scrape a <strong>list of the fotune 500 companies for the year 2018</strong>. The website from which the data is to be scraped is <code>https://www.zyxware.com/articles/5914/list-of-fortune-500-companies-and-their-websites-2018</code>.</p>
623
+
<p>Let us now scrape a <strong>list of the fotune 500 companies for the year 2018</strong>. The website from which the data is to be scraped is <ahref="https://www.zyxware.com/articles/5914/list-of-fortune-500-companies-and-their-websites-2018">this</a>.</p>
<p>It can be seen on this website that the list contains the rank, company name and the website of the company. The whole content of this website can be received as a response when requested with the request library in Python</p>
<p><strong>Don't download copies of documents that are clearly not public.</strong> For example, academic journal publishers often have very strict rules about what you can and what you cannot do with their databases. Mass downloading article PDFs is probably prohibited and can put you (or at the very least your friendly university librarian) in trouble. If your project requires local copies of documents (e.g. for text mining projects), special agreements can be reached with the publisher. The library is a good place to start investigating something like that.</p>
523
523
</li>
524
524
<li>
525
-
<p><strong>Check your local legislation.</strong> For example, certain countries have laws protecting personal information such as email addresses and phone numbers. Scraping such information, even from publicly avaialable web sites, can be illegal (e.g. in Australia).</p>
525
+
<p><strong>Check your local legislation.</strong> For example, certain countries have laws protecting personal information such as email addresses and phone numbers. Scraping such information, even from publicly available web sites, can be illegal (e.g. in Australia).</p>
526
526
</li>
527
527
<li>
528
528
<p><strong>Don't share downloaded content illegally.</strong> Scraping for personal purposes is usually OK, even if it is copyrighted information, as it could fall under the fair use provision of the intellectual property legislation. However, sharing data for which you don’t hold the right to share is illegal.</p>
<li>The <ahref="https://en.wikipedia.org/wiki/Web_scraping">Web scraping Wikipedia page</a> has a concise definition of many concepts discussed here.</li>
544
-
<li><ahref="http://naelshiab.com/members-parliament-web-scraping/">This case study</a> is a great example of what can be done using web scraping and how to achieve it.</li>
544
+
<li><ahref="https://www.analyticsvidhya.com/blog/2017/07/web-scraping-in-python-using-scrapy/">This case study</a> is a great example of what can be done using web scraping and a stepping stone to a more advanced python library <code>scrapy</code>.</li>
545
545
<li><ahref="https://www.eff.org/deeplinks/2019/09/victory-ruling-hiq-v-linkedin-protects-scraping-public-data">This recent case</a> about Linkedin data is a good read.</li>
546
546
<li>Commencing 25 May 2018, Monash University will also become subject to the European Union’s General Data Protection Regulation (<ahref="https://en.wikipedia.org/wiki/General_Data_Protection_Regulation">GDPR</a>).</li>
547
547
<li><ahref="https://software-carpentry.org/">Software Carpentry</a> is a non-profit organisation that runs learn-to-code workshops worldwide. All lessons are publicly available and can be followed indepentently. This lesson is heavily inspired by Software Carpentry.</li>
Copy file name to clipboardExpand all lines: markdowns/section-2-HTML-based-scraping.md
+3-3Lines changed: 3 additions & 3 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -121,7 +121,7 @@ There are mainly two types of requests which can be made to the web server. A GE
121
121
### Scraping a webpage
122
122
---
123
123
124
-
Let us now scrape a **list of the fotune 500 companies for the year 2018**. The website from which the data is to be scraped is `https://www.zyxware.com/articles/5914/list-of-fortune-500-companies-and-their-websites-2018`.
124
+
Let us now scrape a **list of the fotune 500 companies for the year 2018**. The website from which the data is to be scraped is [this](https://www.zyxware.com/articles/5914/list-of-fortune-500-companies-and-their-websites-2018).
125
125
126
126

127
127
@@ -148,7 +148,7 @@ print('Content of the website\n', response.content[:2000])
We can notice that most of the wiki article titles make sense. However, **Apple** is quite ambiguous in this regard as it can indicate the fruit as well as the company. However we can see that the second suggestion returned by was **Apple Inc.**. Hence, we can manually replace it with **Apple Inc.** as follows,
269
269
@@ -274,7 +274,7 @@ print(companies) # final list of wikipedia article titles
0 commit comments