Skip to content

Commit 75d3062

Browse files
Fixes #14
1 parent ce648dc commit 75d3062

5 files changed

Lines changed: 15 additions & 9 deletions

File tree

docs/search/search_index.json

Lines changed: 1 addition & 1 deletion
Large diffs are not rendered by default.

docs/section-5-legal-and-ethical-considerations/index.html

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -536,13 +536,17 @@ <h3 id="web-scraping-code-of-conduct">Web scraping code of conduct<a class="head
536536
<li>
537537
<p><strong>Publish your own data in a reusable way.</strong> Don’t force others to write their own scrapers to get at your data. Use open and software-agnostic formats (e.g. JSON, XML), provide metadata (data about your data: where it came from, what it represents, how to use it, etc.) and make sure it can be indexed by search engines so that people can find it.</p>
538538
</li>
539+
<li>
540+
<p><strong>View <code>robots.txt</code> file</strong>. Robots.txt is a file used by websites to let ‘bots’ know if or how the site should be crawled and indexed. When you are trying to extract data from the web, it is critical to understand what robots.txt is and how to respect it to avoid legal ramifications. This file can be accessed for any domain by accessing <domain_url>/robots.txt. For eg: <a href="https://www.monash.edu/robots.txt"><code>monash.edu/robots.txt</code></a>, <a href="https://www.facebook.com/robots.txt"><code>facebook.com/robots.txt</code></a>, <a href="https://www.linkedin.com/robots.txt"><code>linkedin.com/robots.txt</code></a>.</p>
541+
</li>
539542
</ol>
540543
<p>Happy scraping!</p>
541544
<h3 id="references">References<a class="headerlink" href="#references" title="Permanent link">&para;</a></h3>
542545
<ul>
543546
<li>The <a href="https://en.wikipedia.org/wiki/Web_scraping">Web scraping Wikipedia page</a> has a concise definition of many concepts discussed here.</li>
544547
<li><a href="https://www.analyticsvidhya.com/blog/2017/07/web-scraping-in-python-using-scrapy/">This case study</a> is a great example of what can be done using web scraping and a stepping stone to a more advanced python library <code>scrapy</code>.</li>
545548
<li><a href="https://www.eff.org/deeplinks/2019/09/victory-ruling-hiq-v-linkedin-protects-scraping-public-data">This recent case</a> about Linkedin data is a good read.</li>
549+
<li>A crisp and simple explanation to <code>robots.txt</code> can be found <a href="https://www.promptcloud.com/blog/how-to-read-and-respect-robots-file/">here</a>. </li>
546550
<li>Commencing 25 May 2018, Monash University will also become subject to the European Union’s General Data Protection Regulation (<a href="https://en.wikipedia.org/wiki/General_Data_Protection_Regulation">GDPR</a>).</li>
547551
<li><a href="https://software-carpentry.org/">Software Carpentry</a> is a non-profit organisation that runs learn-to-code workshops worldwide. All lessons are publicly available and can be followed indepentently. This lesson is heavily inspired by Software Carpentry.</li>
548552
<li><a href="http://www.datacarpentry.org/">Data Carpentry</a> is a sister organisation of Software Carpentry focused on the fundamental data management skills required to conduct research.</li>

docs/sitemap.xml

Lines changed: 8 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -2,42 +2,42 @@
22
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
33
<url>
44
<loc>https://monashdatafluency.github.io/python-web-scraping/</loc>
5-
<lastmod>2020-08-15</lastmod>
5+
<lastmod>2020-09-04</lastmod>
66
<changefreq>daily</changefreq>
77
</url>
88
<url>
99
<loc>https://monashdatafluency.github.io/python-web-scraping/section-0-brief-python-refresher/</loc>
10-
<lastmod>2020-08-15</lastmod>
10+
<lastmod>2020-09-04</lastmod>
1111
<changefreq>daily</changefreq>
1212
</url>
1313
<url>
1414
<loc>https://monashdatafluency.github.io/python-web-scraping/section-1-intro-to-web-scraping/</loc>
15-
<lastmod>2020-08-15</lastmod>
15+
<lastmod>2020-09-04</lastmod>
1616
<changefreq>daily</changefreq>
1717
</url>
1818
<url>
1919
<loc>https://monashdatafluency.github.io/python-web-scraping/section-2-HTML-based-scraping/</loc>
20-
<lastmod>2020-08-15</lastmod>
20+
<lastmod>2020-09-04</lastmod>
2121
<changefreq>daily</changefreq>
2222
</url>
2323
<url>
2424
<loc>https://monashdatafluency.github.io/python-web-scraping/section-3-API-based-scraping/</loc>
25-
<lastmod>2020-08-15</lastmod>
25+
<lastmod>2020-09-04</lastmod>
2626
<changefreq>daily</changefreq>
2727
</url>
2828
<url>
2929
<loc>https://monashdatafluency.github.io/python-web-scraping/section-4-wrangling-and-analysis/</loc>
30-
<lastmod>2020-08-15</lastmod>
30+
<lastmod>2020-09-04</lastmod>
3131
<changefreq>daily</changefreq>
3232
</url>
3333
<url>
3434
<loc>https://monashdatafluency.github.io/python-web-scraping/section-5-legal-and-ethical-considerations/</loc>
35-
<lastmod>2020-08-15</lastmod>
35+
<lastmod>2020-09-04</lastmod>
3636
<changefreq>daily</changefreq>
3737
</url>
3838
<url>
3939
<loc>https://monashdatafluency.github.io/python-web-scraping/section-7-references/</loc>
40-
<lastmod>2020-08-15</lastmod>
40+
<lastmod>2020-09-04</lastmod>
4141
<changefreq>daily</changefreq>
4242
</url>
4343
</urlset>

docs/sitemap.xml.gz

0 Bytes
Binary file not shown.

markdowns/section-5-legal-and-ethical-considerations.md

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -73,13 +73,15 @@ This all being said, if you adhere to the following simple rules, you will proba
7373

7474
7. __Publish your own data in a reusable way.__ Don’t force others to write their own scrapers to get at your data. Use open and software-agnostic formats (e.g. JSON, XML), provide metadata (data about your data: where it came from, what it represents, how to use it, etc.) and make sure it can be indexed by search engines so that people can find it.
7575

76+
8. __View `robots.txt` file__. Robots.txt is a file used by websites to let ‘bots’ know if or how the site should be crawled and indexed. When you are trying to extract data from the web, it is critical to understand what robots.txt is and how to respect it to avoid legal ramifications. This file can be accessed for any domain by accessing <domain_url>/robots.txt. For eg: [`monash.edu/robots.txt`](https://www.monash.edu/robots.txt), [`facebook.com/robots.txt`](https://www.facebook.com/robots.txt), [`linkedin.com/robots.txt`](https://www.linkedin.com/robots.txt).
7677

7778
Happy scraping!
7879

7980
### References
8081
* The [Web scraping Wikipedia page](https://en.wikipedia.org/wiki/Web_scraping) has a concise definition of many concepts discussed here.
8182
* [This case study](https://www.analyticsvidhya.com/blog/2017/07/web-scraping-in-python-using-scrapy/) is a great example of what can be done using web scraping and a stepping stone to a more advanced python library `scrapy`.
8283
* [This recent case](https://www.eff.org/deeplinks/2019/09/victory-ruling-hiq-v-linkedin-protects-scraping-public-data) about Linkedin data is a good read.
84+
* A crisp and simple explanation to `robots.txt` can be found [here](https://www.promptcloud.com/blog/how-to-read-and-respect-robots-file/).
8385
* Commencing 25 May 2018, Monash University will also become subject to the European Union’s General Data Protection Regulation ([GDPR](https://en.wikipedia.org/wiki/General_Data_Protection_Regulation)).
8486
* [Software Carpentry](https://software-carpentry.org/) is a non-profit organisation that runs learn-to-code workshops worldwide. All lessons are publicly available and can be followed indepentently. This lesson is heavily inspired by Software Carpentry.
8587
* [Data Carpentry](http://www.datacarpentry.org/) is a sister organisation of Software Carpentry focused on the fundamental data management skills required to conduct research.

0 commit comments

Comments
 (0)