You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
<p><strong>Publish your own data in a reusable way.</strong> Don’t force others to write their own scrapers to get at your data. Use open and software-agnostic formats (e.g. JSON, XML), provide metadata (data about your data: where it came from, what it represents, how to use it, etc.) and make sure it can be indexed by search engines so that people can find it.</p>
538
538
</li>
539
+
<li>
540
+
<p><strong>View <code>robots.txt</code> file</strong>. Robots.txt is a file used by websites to let ‘bots’ know if or how the site should be crawled and indexed. When you are trying to extract data from the web, it is critical to understand what robots.txt is and how to respect it to avoid legal ramifications. This file can be accessed for any domain by accessing <domain_url>/robots.txt. For eg: <ahref="https://www.monash.edu/robots.txt"><code>monash.edu/robots.txt</code></a>, <ahref="https://www.facebook.com/robots.txt"><code>facebook.com/robots.txt</code></a>, <ahref="https://www.linkedin.com/robots.txt"><code>linkedin.com/robots.txt</code></a>.</p>
<li>The <ahref="https://en.wikipedia.org/wiki/Web_scraping">Web scraping Wikipedia page</a> has a concise definition of many concepts discussed here.</li>
544
547
<li><ahref="https://www.analyticsvidhya.com/blog/2017/07/web-scraping-in-python-using-scrapy/">This case study</a> is a great example of what can be done using web scraping and a stepping stone to a more advanced python library <code>scrapy</code>.</li>
545
548
<li><ahref="https://www.eff.org/deeplinks/2019/09/victory-ruling-hiq-v-linkedin-protects-scraping-public-data">This recent case</a> about Linkedin data is a good read.</li>
549
+
<li>A crisp and simple explanation to <code>robots.txt</code> can be found <ahref="https://www.promptcloud.com/blog/how-to-read-and-respect-robots-file/">here</a>. </li>
546
550
<li>Commencing 25 May 2018, Monash University will also become subject to the European Union’s General Data Protection Regulation (<ahref="https://en.wikipedia.org/wiki/General_Data_Protection_Regulation">GDPR</a>).</li>
547
551
<li><ahref="https://software-carpentry.org/">Software Carpentry</a> is a non-profit organisation that runs learn-to-code workshops worldwide. All lessons are publicly available and can be followed indepentently. This lesson is heavily inspired by Software Carpentry.</li>
548
552
<li><ahref="http://www.datacarpentry.org/">Data Carpentry</a> is a sister organisation of Software Carpentry focused on the fundamental data management skills required to conduct research.</li>
Copy file name to clipboardExpand all lines: markdowns/section-5-legal-and-ethical-considerations.md
+2Lines changed: 2 additions & 0 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -73,13 +73,15 @@ This all being said, if you adhere to the following simple rules, you will proba
73
73
74
74
7.__Publish your own data in a reusable way.__ Don’t force others to write their own scrapers to get at your data. Use open and software-agnostic formats (e.g. JSON, XML), provide metadata (data about your data: where it came from, what it represents, how to use it, etc.) and make sure it can be indexed by search engines so that people can find it.
75
75
76
+
8.__View `robots.txt` file__. Robots.txt is a file used by websites to let ‘bots’ know if or how the site should be crawled and indexed. When you are trying to extract data from the web, it is critical to understand what robots.txt is and how to respect it to avoid legal ramifications. This file can be accessed for any domain by accessing <domain_url>/robots.txt. For eg: [`monash.edu/robots.txt`](https://www.monash.edu/robots.txt), [`facebook.com/robots.txt`](https://www.facebook.com/robots.txt), [`linkedin.com/robots.txt`](https://www.linkedin.com/robots.txt).
76
77
77
78
Happy scraping!
78
79
79
80
### References
80
81
* The [Web scraping Wikipedia page](https://en.wikipedia.org/wiki/Web_scraping) has a concise definition of many concepts discussed here.
81
82
*[This case study](https://www.analyticsvidhya.com/blog/2017/07/web-scraping-in-python-using-scrapy/) is a great example of what can be done using web scraping and a stepping stone to a more advanced python library `scrapy`.
82
83
*[This recent case](https://www.eff.org/deeplinks/2019/09/victory-ruling-hiq-v-linkedin-protects-scraping-public-data) about Linkedin data is a good read.
84
+
* A crisp and simple explanation to `robots.txt` can be found [here](https://www.promptcloud.com/blog/how-to-read-and-respect-robots-file/).
83
85
* Commencing 25 May 2018, Monash University will also become subject to the European Union’s General Data Protection Regulation ([GDPR](https://en.wikipedia.org/wiki/General_Data_Protection_Regulation)).
84
86
*[Software Carpentry](https://software-carpentry.org/) is a non-profit organisation that runs learn-to-code workshops worldwide. All lessons are publicly available and can be followed indepentently. This lesson is heavily inspired by Software Carpentry.
85
87
*[Data Carpentry](http://www.datacarpentry.org/) is a sister organisation of Software Carpentry focused on the fundamental data management skills required to conduct research.
0 commit comments