Robots.txt Experiments and Metrics

How is the Robots Exclusion Protocol (robots.txt or RFC 9309) used in the WWW? This projects tries to get some insights mining Common Crawl's robots.txt captures of the years 2016 – 2024.

Top-K Sampling of Web Sites

Three Tranco top-1M lists have been combined into a single ranked list, see top-k-sites. The resulting list of 2 million web sites is used to obtain samples on multiple strata (1k, 5k, 10k, 100k, 1M, 2M).

Locating and Downloading Robots.txt Captures in Common Crawl's Web Archives

Common Crawl's Web Archives include since 2016 a robots.txt data set from which the robots.txt captures are extracted. This is done utilizing the columnar URL index. The necessary steps are described in the data preparation notebook.

Metrics and Findings

top-k metrics notebook: first aggregations and few plots
user-agent metrics notebook: more plots about user-agents addressed in robots.txt files

Poster at IIPC Web Archiving Conference 2025

Condensed results of this project were presented as poster on the IIPC Web Archiving Conference 2025. A copy of the poster is available here.

Notes and Credits

This project is an extension of work done for a presentation at #ossym2022: "The robots.txt standard – Implementations and Usage". The corresponding code is found at ossym2022-robotstxt-experiments.

The idea to look at multiple strata (top-k) is inspired by the work of Longpre et al. "Consent in crisis" (https://arxiv.org/abs/2407.14933) and Liu et al. "Somesite I used to crawl" (https://arxiv.org/pdf/2411.15091).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Robots.txt Experiments and Metrics

Top-K Sampling of Web Sites

Locating and Downloading Robots.txt Captures in Common Crawl's Web Archives

Metrics and Findings

Poster at IIPC Web Archiving Conference 2025

Notes and Credits

FilesExpand file tree

README.md

Latest commit

History

README.md

File metadata and controls

Robots.txt Experiments and Metrics

Top-K Sampling of Web Sites

Locating and Downloading Robots.txt Captures in Common Crawl's Web Archives

Metrics and Findings

Poster at IIPC Web Archiving Conference 2025

Notes and Credits