You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: docs/index.md
+32-26Lines changed: 32 additions & 26 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -2,36 +2,48 @@
2
2
3
3
## What is SCP?
4
4
5
-
The Site Content Protocol (SCP) is a collection-based format for efficiently serving web content to crawlers while users continue accessing regular HTML pages.
5
+
The Site Content Protocol (SCP) is a format for serving clean, structured web content to AI training systems and search engines. Websites provide pre-generated JSON collections optimized for machine consumption, while end users continue accessing regular HTML pages.
6
6
7
7
## Problem
8
8
9
-
Web crawlers (search engines, AI bots, aggregators) consume massive bandwidth and server resources by parsing web-pages designed for human viewing.
10
-
With the explosion of AI crawlers, this traffic has become a significant cost for websites and strain on internet infrastructure.
9
+
AI training systems and search engines need massive web content datasets, but current HTML scraping approaches create three critical problems:
11
10
12
-
Sources:
13
-
14
-
-[Cloudflare Year in Review 2025](https://radar.cloudflare.com/year-in-review/2025)
15
-
-[FOSS Infrastructure Under Attack by AI Companies](https://thelibre.news/foss-infrastructure-is-under-attack-by-ai-companies/)
1.**Low-quality training data** - Content extracted from HTML is contaminated with navigation menus, advertisements, boilerplate text, and formatting markup, degrading model training quality.
12
+
2.**High infrastructure costs** - Processing complete HTML/CSS/JavaScript responses for millions of pages creates substantial bandwidth and computational overhead for both publishers and crawlers.
13
+
3.**Legal and ethical uncertainty** - Automated scraping exists in a gray area. Websites lack a clear, voluntary mechanism to contribute high-quality content to AI training while maintaining control over their intellectual property.
17
14
18
15
## Solution
19
16
20
-
Websites pre-generate compressed collections and host them on CDN or Cloud Object Storage:
17
+
SCP provides a voluntary, structured alternative to HTML scraping:
18
+
19
+
**For Publishers:**
20
+
21
+
- Generate clean JSON collections from your CMS/database (not HTML parsing)
22
+
- Host compressed files on CDN or object storage
23
+
- Declare collection availability in sitemap.xml
24
+
- Maintain full control over what content is included
25
+
26
+
**For Crawlers:**
27
+
28
+
- Download entire content sections in one request
29
+
- Receive structured data optimized for training/indexing
30
+
- Use efficient delta updates (only changed pages)
0 commit comments