MonashDataFluency
diff --git a/‎docs/search/search_index.json‎
Lines changed: 1 addition & 1 deletion b/‎docs/search/search_index.json‎
Lines changed: 1 addition & 1 deletion
diff --git a/‎docs/section-5-legal-and-ethical-considerations/index.html‎
Lines changed: 4 additions & 0 deletions b/‎docs/section-5-legal-and-ethical-considerations/index.html‎
Lines changed: 4 additions & 0 deletions
diff --git a/‎docs/sitemap.xml‎
Lines changed: 8 additions & 8 deletions b/‎docs/sitemap.xml‎
Lines changed: 8 additions & 8 deletions
diff --git a/‎docs/sitemap.xml.gz‎
0 Bytes b/‎docs/sitemap.xml.gz‎
0 Bytes
diff --git a/‎markdowns/section-5-legal-and-ethical-considerations.md‎
Lines changed: 2 additions & 0 deletions b/‎markdowns/section-5-legal-and-ethical-considerations.md‎
Lines changed: 2 additions & 0 deletions
@@ -536,13 +536,17 @@ <h3 id="web-scraping-code-of-conduct">Web scraping code of conduct<a class="head
 <li>
 <p><strong>Publish your own data in a reusable way.</strong> Don’t force others to write their own scrapers to get at your data. Use open and software-agnostic formats (e.g. JSON, XML), provide metadata (data about your data: where it came from, what it represents, how to use it, etc.) and make sure it can be indexed by search engines so that people can find it.</p>
 </li>
+<li>
+<p><strong>View <code>robots.txt</code> file</strong>. Robots.txt is a file used by websites to let ‘bots’ know if or how the site should be crawled and indexed. When you are trying to extract data from the web, it is critical to understand what robots.txt is and how to respect it to avoid legal ramifications. This file can be accessed for any domain by accessing <domain_url>/robots.txt. For eg: <a href="https://www.monash.edu/robots.txt"><code>monash.edu/robots.txt</code></a>, <a href="https://www.facebook.com/robots.txt"><code>facebook.com/robots.txt</code></a>, <a href="https://www.linkedin.com/robots.txt"><code>linkedin.com/robots.txt</code></a>.</p>
+</li>
 </ol>
 <p>Happy scraping!</p>
 <h3 id="references">References<a class="headerlink" href="#references" title="Permanent link">&para;</a></h3>
 <ul>
 <li>The <a href="https://en.wikipedia.org/wiki/Web_scraping">Web scraping Wikipedia page</a> has a concise definition of many concepts discussed here.</li>
 <li><a href="https://www.analyticsvidhya.com/blog/2017/07/web-scraping-in-python-using-scrapy/">This case study</a> is a great example of what can be done using web scraping and a stepping stone to a more advanced python library <code>scrapy</code>.</li>
 <li><a href="https://www.eff.org/deeplinks/2019/09/victory-ruling-hiq-v-linkedin-protects-scraping-public-data">This recent case</a> about Linkedin data is a good read.</li>
+<li>A crisp and simple explanation  to <code>robots.txt</code> can be found <a href="https://www.promptcloud.com/blog/how-to-read-and-respect-robots-file/">here</a>. </li>
 <li>Commencing 25 May 2018, Monash University will also become subject to the European Union’s General Data Protection Regulation (<a href="https://en.wikipedia.org/wiki/General_Data_Protection_Regulation">GDPR</a>).</li>
 <li><a href="https://software-carpentry.org/">Software Carpentry</a> is a non-profit organisation that runs learn-to-code workshops worldwide. All lessons are publicly available and can be followed indepentently. This lesson is heavily inspired by Software Carpentry.</li>
 <li><a href="http://www.datacarpentry.org/">Data Carpentry</a> is a sister organisation of Software Carpentry focused on the fundamental data management skills required to conduct research.</li>
 
@@ -2,42 +2,42 @@
 <urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
     <url>
      <loc>https://monashdatafluency.github.io/python-web-scraping/</loc>
-     <lastmod>2020-08-15</lastmod>
+     <lastmod>2020-09-04</lastmod>
      <changefreq>daily</changefreq>
     </url>
     <url>
      <loc>https://monashdatafluency.github.io/python-web-scraping/section-0-brief-python-refresher/</loc>
-     <lastmod>2020-08-15</lastmod>
+     <lastmod>2020-09-04</lastmod>
      <changefreq>daily</changefreq>
     </url>
     <url>
      <loc>https://monashdatafluency.github.io/python-web-scraping/section-1-intro-to-web-scraping/</loc>
-     <lastmod>2020-08-15</lastmod>
+     <lastmod>2020-09-04</lastmod>
      <changefreq>daily</changefreq>
     </url>
     <url>
      <loc>https://monashdatafluency.github.io/python-web-scraping/section-2-HTML-based-scraping/</loc>
-     <lastmod>2020-08-15</lastmod>
+     <lastmod>2020-09-04</lastmod>
      <changefreq>daily</changefreq>
     </url>
     <url>
      <loc>https://monashdatafluency.github.io/python-web-scraping/section-3-API-based-scraping/</loc>
-     <lastmod>2020-08-15</lastmod>
+     <lastmod>2020-09-04</lastmod>
      <changefreq>daily</changefreq>
     </url>
     <url>
      <loc>https://monashdatafluency.github.io/python-web-scraping/section-4-wrangling-and-analysis/</loc>
-     <lastmod>2020-08-15</lastmod>
+     <lastmod>2020-09-04</lastmod>
      <changefreq>daily</changefreq>
     </url>
     <url>
      <loc>https://monashdatafluency.github.io/python-web-scraping/section-5-legal-and-ethical-considerations/</loc>
-     <lastmod>2020-08-15</lastmod>
+     <lastmod>2020-09-04</lastmod>
      <changefreq>daily</changefreq>
     </url>
     <url>
      <loc>https://monashdatafluency.github.io/python-web-scraping/section-7-references/</loc>
-     <lastmod>2020-08-15</lastmod>
+     <lastmod>2020-09-04</lastmod>
      <changefreq>daily</changefreq>
     </url>
 </urlset>
@@ -73,13 +73,15 @@ This all being said, if you adhere to the following simple rules, you will proba
 
 7. __Publish your own data in a reusable way.__ Don’t force others to write their own scrapers to get at your data. Use open and software-agnostic formats (e.g. JSON, XML), provide metadata (data about your data: where it came from, what it represents, how to use it, etc.) and make sure it can be indexed by search engines so that people can find it.
 
+8. __View `robots.txt` file__. Robots.txt is a file used by websites to let ‘bots’ know if or how the site should be crawled and indexed. When you are trying to extract data from the web, it is critical to understand what robots.txt is and how to respect it to avoid legal ramifications. This file can be accessed for any domain by accessing <domain_url>/robots.txt. For eg: [`monash.edu/robots.txt`](https://www.monash.edu/robots.txt), [`facebook.com/robots.txt`](https://www.facebook.com/robots.txt), [`linkedin.com/robots.txt`](https://www.linkedin.com/robots.txt).
 
 Happy scraping!
 
 ### References
 * The [Web scraping Wikipedia page](https://en.wikipedia.org/wiki/Web_scraping) has a concise definition of many concepts discussed here.
 * [This case study](https://www.analyticsvidhya.com/blog/2017/07/web-scraping-in-python-using-scrapy/) is a great example of what can be done using web scraping and a stepping stone to a more advanced python library `scrapy`.
 * [This recent case](https://www.eff.org/deeplinks/2019/09/victory-ruling-hiq-v-linkedin-protects-scraping-public-data) about Linkedin data is a good read.
+* A crisp and simple explanation  to `robots.txt` can be found [here](https://www.promptcloud.com/blog/how-to-read-and-respect-robots-file/). 
 * Commencing 25 May 2018, Monash University will also become subject to the European Union’s General Data Protection Regulation ([GDPR](https://en.wikipedia.org/wiki/General_Data_Protection_Regulation)).
 * [Software Carpentry](https://software-carpentry.org/) is a non-profit organisation that runs learn-to-code workshops worldwide. All lessons are publicly available and can be followed indepentently. This lesson is heavily inspired by Software Carpentry.
 * [Data Carpentry](http://www.datacarpentry.org/) is a sister organisation of Software Carpentry focused on the fundamental data management skills required to conduct research.