oc-botwatch

Classifies traffic from OpenCitations server access logs into three categories (human visitors, generic bots, LLM bots) and three services (web, API, SPARQL).

Input data

The source data is published on Zenodo (https://doi.org/10.5281/zenodo.20291127). To reproduce the results, download and extract the archive into input/.

How classification works

A request only counts as a bot when its user-agent identifies it as one of the well-known crawlers. We match the user-agent against three public lists:

If the user-agent matches an entry in ai.robots.txt, the request is an llm_bot. The two names "Spider" and "Code" are skipped because they're too generic and would match strings that have nothing to do with LLM crawlers. If instead it matches crawler-user-agents (minus the entries already tagged ai-crawler) or COUNTER-Robots, it's a generic_bot.

A handful of crawlers turn up in our logs but aren't in any of those three lists, so we keep an extra file for them:

supplementary_bots.txt

Everything else is human. That covers the case of a person browsing the site, but it also covers the cases of somebody using a Python script, a curl command, etc.

Why these three sources

Because they have already been adopted in the literature. In particular, Liu et al. (2025) uses Dark Visitors, the upstream data source of ai.robots.txt, as its primary reference for compiling LLM user agents, and relies on crawler-user-agents as a supplementary corpus of general-purpose bot signatures when testing the coverage of Cloudflare's bot-blocking feature.

COUNTER-Robots is the robot list maintained by Project COUNTER, an international initiative that sets standards for counting usage of electronic scholarly resources. Since OpenCitations is itself a scholarly infrastructure, filtering its logs with COUNTER-Robots aligns with the conventions of the domain.

Service classification

Each request is also assigned to one of three services, based on request_host, request_path, and request_method. Redirects (3xx) are dropped from the dataset entirely: they come from deprecated API paths (e.g. /index/coci/api/v1/) or from the main domain bouncing requests to a subdomain, and the actual response appears as a separate log entry on the destination host.

sparql: any non-redirect request whose path matches /sparql, or a request to sparql.opencitations.net that carries a ?query= parameter or uses the POST method.
api: any non-redirect request whose path matches a versioned API route with at least one segment after the version number: /index/v\d+/…, /index/api/v\d+/…, /meta/v\d+/…, /meta/api/v\d+/…. Bare version roots (/index/v1, /meta/v1) and non-API paths on api.opencitations.net (/robots.txt, /) fall through to web. The INDEX and META REST endpoints are grouped together under api.
web: everything else. The main site (opencitations.net/, /about, /governance, ...), subdomains such as ldd.opencitations.net, search.opencitations.net, download.opencitations.net, statistics.opencitations.net, oci.opencitations.net, and sparontologies.net, and non-service requests on sparql.opencitations.net and api.opencitations.net (static assets, robots.txt, UI pages).

Findings

The dataset covers January through April 2026. The output/ directory contains daily_traffic.csv (per-day counts by category), daily_traffic_by_service.csv (per-day counts by category and service).

Human traffic sits between 32% and 45% of monthly requests. Generic bots take 48% to 55%. LLM bots go from 2% in January to 13% in April, or 1.06M to 4.09M monthly requests (+287%).

By service

The bulk of LLM bot traffic targets the website. Over the four months, LLM crawlers make up 31.5% of website requests: nearly one in three. On the API and SPARQL endpoints they barely register, hovering around 4%, while generic bots dominate both (58.7% and 61.3%).

Service	Human	Generic bot	LLM bot
web	56.0%	12.5%	31.5%
api	37.4%	58.7%	3.9%
sparql	34.6%	61.3%	4.1%

The growth from 2% to 13% is therefore almost entirely on the browsable site. LLM crawlers are not querying the REST API or the SPARQL endpoint in any significant volume.

Limitations

We only catch bots that openly identify themselves through the user-agent. Anything that spoofs a browser string, or uses a custom user-agent that doesn't appear in the three lists, is going to land in the human bucket. So in practice the bot counts are a lower bound and the human counts an upper bound. The numbers still work well for tracking how the relative shares move over time, since the same rules are applied across the whole dataset.

Running

Requires Python 3.10+ and uv.

uv sync
uv run python -m oc_botwatch.classify
uv run python -m oc_botwatch.visualize

Tests

uv sync --dev
uv run pytest

Name		Name	Last commit message	Last commit date
Latest commit History 25 Commits
.github/workflows		.github/workflows
COUNTER-Robots @ 87325c9		COUNTER-Robots @ 87325c9
LICENSES		LICENSES
ai-robots-txt @ 1fbf7a0		ai-robots-txt @ 1fbf7a0
crawler-user-agents @ 76fbd3f		crawler-user-agents @ 76fbd3f
oc_botwatch		oc_botwatch
output		output
tests		tests
.gitignore		.gitignore
.gitmodules		.gitmodules
.python-version		.python-version
LICENSE		LICENSE
README.md		README.md
REUSE.toml		REUSE.toml
pyproject.toml		pyproject.toml
supplementary_bots.txt		supplementary_bots.txt
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

oc-botwatch

Input data

How classification works

Why these three sources

Service classification

Findings

By service

Limitations

Running

Tests

About

Uh oh!

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

oc-botwatch

Input data

How classification works

Why these three sources

Service classification

Findings

By service

Limitations

Running

Tests

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages