A structured, LLM-maintained knowledge base covering anti-bot systems, scraping tools, browser fingerprinting, proxy infrastructure, and everything else that matters when extracting data from the web.
This wiki compiles and organizes the knowledge accumulated across 300+ articles published on The Web Scraping Club newsletter since 2022, plus selected research from outside sources (Antoine Vastel's Device and Browser Info, Castle.io research, vendor blogs from DataDome, Cloudflare, Akamai, Bright Data, Oxylabs, and others).
Instead of searching through years of articles to find what we tested on Cloudflare, how Akamai's TLS detection works, or which tool bypassed Kasada last, the wiki keeps it all in one place, cross-referenced and continuously updated.
The pattern comes from Andrej Karpathy's LLM Wiki: rather than re-deriving knowledge from raw sources on every question, an LLM incrementally builds and maintains a persistent wiki that grows richer over time.
The Obsidian-friendly authoring conventions and the use of .canvas (JSON Canvas) and .base (Obsidian Bases) files come from Steph Ango (kepano), Obsidian's CEO, and his obsidian-skills bundle. The wiki uses defuddle for clean source extraction, obsidian-markdown conventions for page authoring, obsidian-bases for live cross-cutting views, and json-canvas for visual landscape maps.
The wiki is built from publicly available articles, posts, and READMEs. The main feeds are:
- The Web Scraping Club — every TWSC article from 2022 onward.
- Device and Browser Info — Antoine Vastel's research on browser fingerprinting and bot detection.
- Hacker News — front-page submissions matching the wiki domain (web scraping, anti-bot, proxies, browsers, fingerprinting).
- Vendor research blogs (DataDome, Cloudflare, Akamai, Castle.io, Bright Data, Oxylabs, and more).
- Selected GitHub repositories and project sites for tools, libraries, and stealth browsers.
Every page lists its specific sources in YAML frontmatter under sources:, and the trail goes back to a URL or a filename inside this repository.
We do not maintain this wiki manually. An LLM pipeline reads new articles every day, decides which ones belong in the wiki, drafts new entity or concept pages, links sources to existing pages, and commits the result. Contradictions between old and new findings are resolved explicitly: the newest behavioral observation wins, but the prior version is preserved with a date so the evolution remains visible. The full ruleset is in schema.md.
These are summaries of public articles, restructured for navigation. If you spot an error, an outdated claim, or a misattribution, please open an issue on this repository pointing to the specific page and what is wrong. We will fix it on the next daily run.
The wiki holds 120 pages across six types. Each type answers a different shape of question.
One page per concrete thing in the domain. An entity is a tool, a library, a stealth browser, a commercial anti-bot product, a proxy network, or any other identifiable subject that has its own technical profile. The page describes what it is, how it works, what TWSC observed when testing it, and known limitations.
Examples: DataDome, Cloudflare, Camoufox, Scrapling, Playwright, curl_cffi, Lightpanda, Browser Use.
One page per technique, pattern, or domain idea. Concepts cover the how and the why: how a detection technique works, what signals it relies on, where it shows up across vendors. They reference the entities that implement or exploit them.
Examples: Browser Fingerprinting, TLS Fingerprinting, CDP Detection, Bot Detection, ML-Based Bot Detection, Hybrid Scraping, Cookie and Session Reuse, Proxy Fundamentals.
Side-by-side analyses when two or more entities or approaches occupy the same niche and the question "which one and when?" is itself the topic. Each comparison page has a tabular dimension and a narrative of the differences that actually matter.
Examples: Firefox vs Chrome Stealth, Anti-Detect Browser Benchmark 2024.
How something evolved over time, tracked across multiple sources. Used for situations where the current state is meaningless without the trajectory that produced it.
Example: Cloudflare Bypass Evolution from 2022 to 2026.
JSON Canvas files (.canvas) are graph-shaped visualizations Obsidian renders as an interactive whiteboard. Used for landscapes where 5+ entities and the relationships between them are the point. Each node references the matching entity page, so the canvas works as a navigable map.
Example: Agentic Browsers Landscape 2026 covering OpenAI Operator/Atlas, Anthropic Computer Use, Perplexity Comet, Browser Use, Browserbase, BrowserOS, Hyperbrowser, and the proxy companies pivoting to managed browsers.
Obsidian Bases files (.base) are declarative queries over the rest of the vault. A view does not store knowledge — it surfaces it. As soon as a new entity or concept is added with the matching frontmatter, the view updates automatically. Useful for cross-cutting reads like "every anti-bot vendor sorted by last update" or "every entity touched in the last 90 days".
Files: all-entities, anti-bot-vendors, recently-touched, tools-and-browsers.
Start from index.md for the full catalog grouped by category, then drill into the specific entity, concept, or comparison page you care about. Each page lists its sources: in YAML frontmatter and ends with a ## Sources section linking back to the original URLs.
Obsidian is a free local-first markdown editor that treats a folder of .md files as a personal knowledge base. It builds a graph of cross-links, supports YAML frontmatter as queryable metadata, and ships with a Bases plugin (since 1.7) that turns the queries above into live tables.
This repository works directly as an Obsidian vault. Clone it standalone or symlink it under your existing vault, then open the folder in Obsidian:
# standalone
git clone https://github.com/TheWebScrapingClub/scraping-wiki.git ~/Vaults/scraping-wiki
open -a Obsidian ~/Vaults/scraping-wiki
# or as a sub-vault inside an existing vault
ln -s /path/to/scraping-wiki ~/MyVault/WikiOnce open, you can:
- Browse the graph view — every cross-link between entities and concepts becomes an edge, and the wiki's structure becomes visible as a network.
- Click any
.basefile under views/ to render a live, sortable table of matching pages. - Click agentic-browsers-landscape-2026.canvas to open the interactive landscape map.
- Use any markdown editor or
git diffto read the wiki — Obsidian is convenient but optional.
If you find something outdated, wrong, or misattributed, please open an issue on this repository pointing to the specific page and the claim that is off. Since the wiki is regenerated continuously, a simple GitHub issue is enough — no PR needed unless you also want to attach test results.
The content of this wiki is derived from articles published on The Web Scraping Club and from third-party sources cited in each page. The wiki itself is open for reading and reference. For reuse of substantial portions, please credit the source.