Skip to content

Commit bde2a2c

Browse files
committed
Spec draft and docs updated
1 parent 6b3adb4 commit bde2a2c

8 files changed

Lines changed: 898 additions & 372 deletions

File tree

docs/cli-tools.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -77,7 +77,7 @@ scp-inspect --json collection.scp.gz > output.json
7777
scp-inspect --url "https://example.com/page" collection.scp.gz
7878

7979
# Show only pages modified after date
80-
scp-inspect --since "2025-01-15T00:00:00Z" collection.scp.gz
80+
scp-inspect --since "2000-01-15T00:00:00Z" collection.scp.gz
8181
```
8282

8383

@@ -100,7 +100,7 @@ scp-inspect --json collection.scp.gz > data.json
100100

101101
**View recent changes (delta)**:
102102
```bash
103-
scp-inspect --pages --since "2025-01-15T00:00:00Z" collection.scp.gz
103+
scp-inspect --pages --since "2000-01-15T00:00:00Z" collection.scp.gz
104104
```
105105

106106
**Debug content blocks**:

docs/implementation.md

Lines changed: 23 additions & 21 deletions
Original file line numberDiff line numberDiff line change
@@ -31,13 +31,13 @@ for page in pages[:5]:
3131
```python
3232
from scp.generator import SCPGenerator
3333

34-
gen = SCPGenerator("blog-snapshot-2025-01", "blog", "snapshot")
34+
gen = SCPGenerator("blog-snapshot-q1", "blog", "snapshot")
3535

3636
gen.add_page(
3737
url="https://example.com/post1",
3838
title="First Post",
3939
description="My first blog post",
40-
modified="2025-01-15T10:00:00Z",
40+
modified="2000-01-15T10:00:00Z",
4141
language="en",
4242
content=[
4343
{"type": "heading", "level": 1, "text": "First Post"},
@@ -57,7 +57,7 @@ from scp.generator import SCPGenerator
5757

5858
# Create generator for snapshot
5959
gen = SCPGenerator(
60-
collection_id="blog-snapshot-2025-01-15",
60+
collection_id="blog-snapshot-day15",
6161
section="blog",
6262
collection_type="snapshot"
6363
)
@@ -109,20 +109,22 @@ gen.save(f"blog-delta-{datetime.now().strftime('%Y-%m-%d')}.scp.gz", compress="g
109109

110110
## Hosting Collections
111111

112-
### Cloudflare R2 (example)
112+
### Using Object Storage or CDN
113+
114+
Upload collections to any S3-compatible object storage or CDN using tools like rclone, aws-cli, or provider-specific clients:
113115

114116
```bash
115-
# Upload to Cloudflare R2 using rclone
116-
rclone copy blog-snapshot.scp.gz r2:your-bucket/collections/
117+
# Example: Upload using rclone (works with AWS S3, Azure Blob, Google Cloud Storage, etc.)
118+
rclone copy blog-snapshot.scp.gz remote:your-bucket/collections/
117119
```
118120

119-
Configure rclone for R2:
121+
Configure rclone for S3-compatible storage:
120122
```bash
121-
rclone config create r2 s3 \
122-
provider Cloudflare \
123-
access_key_id your-r2-access-key \
124-
secret_access_key your-r2-secret-key \
125-
endpoint https://your-account-id.r2.cloudflarestorage.com
123+
rclone config create remote s3 \
124+
provider <your-provider> \
125+
access_key_id your-access-key \
126+
secret_access_key your-secret-key \
127+
endpoint https://your-storage-endpoint.com
126128
```
127129

128130
### Update Sitemap.xml
@@ -141,21 +143,21 @@ sitemap.add_section("blog", update_freq="daily", pages=5247)
141143
sitemap.add_collection(
142144
section="blog",
143145
collection_type="snapshot",
144-
url="https://cdn.example.com/collections/blog-snapshot-2025-01-15.scp.gz",
145-
generated="2025-01-15T00:00:00Z",
146+
url="https://cdn.example.com/collections/blog-snapshot-day15.scp.gz",
147+
generated="2000-01-15T00:00:00Z",
146148
pages=5247,
147149
size=52000000
148150
)
149151

150152
# Add delta
151153
sitemap.add_delta(
152154
section="blog",
153-
period="2025-01-15",
154-
url="https://cdn.example.com/collections/blog-delta-2025-01-15.scp.gz",
155-
generated="2025-01-15T23:00:00Z",
155+
period="day15",
156+
url="https://cdn.example.com/collections/blog-delta-day15.scp.gz",
157+
generated="2000-01-15T23:00:00Z",
156158
pages=47,
157159
size=480000,
158-
since="2025-01-14T00:00:00Z"
160+
since="2000-01-14T00:00:00Z"
159161
)
160162

161163
# Save sitemap
@@ -173,14 +175,14 @@ sitemap.save("sitemap.xml")
173175
# Generate snapshot
174176
python generate_snapshot.py
175177

176-
# Upload to R2
177-
rclone copy blog-snapshot.scp.gz r2:your-bucket/collections/
178+
# Upload to object storage
179+
rclone copy blog-snapshot.scp.gz remote:your-bucket/collections/
178180

179181
# Update sitemap
180182
python update_sitemap.py
181183

182184
# Upload sitemap
183-
rclone copy sitemap.xml r2:your-bucket/
185+
rclone copy sitemap.xml remote:your-bucket/
184186
```
185187

186188
Schedule with cron:

docs/index.md

Lines changed: 32 additions & 26 deletions
Original file line numberDiff line numberDiff line change
@@ -2,36 +2,48 @@
22

33
## What is SCP?
44

5-
The Site Content Protocol (SCP) is a collection-based format for efficiently serving web content to crawlers while users continue accessing regular HTML pages.
5+
The Site Content Protocol (SCP) is a format for serving clean, structured web content to AI training systems and search engines. Websites provide pre-generated JSON collections optimized for machine consumption, while end users continue accessing regular HTML pages.
66

77
## Problem
88

9-
Web crawlers (search engines, AI bots, aggregators) consume massive bandwidth and server resources by parsing web-pages designed for human viewing.
10-
With the explosion of AI crawlers, this traffic has become a significant cost for websites and strain on internet infrastructure.
9+
AI training systems and search engines need massive web content datasets, but current HTML scraping approaches create three critical problems:
1110

12-
Sources:
13-
14-
- [Cloudflare Year in Review 2025](https://radar.cloudflare.com/year-in-review/2025)
15-
- [FOSS Infrastructure Under Attack by AI Companies](https://thelibre.news/foss-infrastructure-is-under-attack-by-ai-companies/)
16-
- [Web Scraping Market Report 2025](https://scrapeops.io/web-scraping-playbook/web-scraping-market-report-2025/)
11+
1. **Low-quality training data** - Content extracted from HTML is contaminated with navigation menus, advertisements, boilerplate text, and formatting markup, degrading model training quality.
12+
2. **High infrastructure costs** - Processing complete HTML/CSS/JavaScript responses for millions of pages creates substantial bandwidth and computational overhead for both publishers and crawlers.
13+
3. **Legal and ethical uncertainty** - Automated scraping exists in a gray area. Websites lack a clear, voluntary mechanism to contribute high-quality content to AI training while maintaining control over their intellectual property.
1714

1815
## Solution
1916

20-
Websites pre-generate compressed collections and host them on CDN or Cloud Object Storage:
17+
SCP provides a voluntary, structured alternative to HTML scraping:
18+
19+
**For Publishers:**
20+
21+
- Generate clean JSON collections from your CMS/database (not HTML parsing)
22+
- Host compressed files on CDN or object storage
23+
- Declare collection availability in sitemap.xml
24+
- Maintain full control over what content is included
25+
26+
**For Crawlers:**
27+
28+
- Download entire content sections in one request
29+
- Receive structured data optimized for training/indexing
30+
- Use efficient delta updates (only changed pages)
31+
- Respect publisher-provided content boundaries
2132

22-
1. Website generates `blog-snapshot-2025-01-15.scp.gz` (5,247 pages → 52 MB)
33+
**Example:**
34+
35+
1. Website generates `blog-snapshot-day15.scp.gz` (5,247 pages → 52 MB)
2336
2. Uploads to CDN or Cloud Object Storage
24-
3. Declares availability of content collections in sitemap.xml
25-
4. Crawler downloads entire collection in one request
26-
5. Later: crawler downloads delta `blog-delta-2025-01-16.scp.gz` (47 pages → 480 KB)
37+
3. Crawler downloads entire collection in one request
38+
4. Later: crawler downloads delta `blog-delta-day16.scp.gz` (47 pages → 480 KB)
2739

2840
### Expected Impact
2941

30-
- 50-60% bandwidth reduction for initial snapshots
31-
- 90-95% bandwidth reduction with delta updates
32-
- Faster parsing than HTML/CSS/JS
33-
- 90% fewer requests (one download fetches entire sections)
34-
- Zero impact on user experience
42+
- **Clean training data**: Structured content without navigation menus, ads, boilerplate, or formatting markup
43+
- **Voluntary contribution**: Clear mechanism for sites to contribute high-quality content to AI training with explicit consent
44+
- **Reduced infrastructure costs**: Lower bandwidth and processing overhead for both publishers and crawlers
45+
- **Efficient updates**: Delta collections deliver only changed pages, minimizing redundant transfers
46+
- **Zero user impact**: End users continue accessing regular HTML pages
3547

3648
## Documentation
3749

@@ -54,16 +66,10 @@ Websites pre-generate compressed collections and host them on CDN or Cloud Objec
5466

5567
**Next Steps**:
5668

57-
1. Community feedback (3 months)
69+
1. Community feedback (1 month)
5870
- Post to Hacker News, Reddit, tech blogs
5971
- Iterate on spec based on feedback
60-
2. Creation of IETF Internet-Draft (2 months)
61-
62-
**Future**:
63-
64-
- Bot verification using [Web Bot Auth](https://developers.cloudflare.com/bots/reference/bot-verification/web-bot-auth/)
65-
- Pay-per-crawl model similar to [Cloudflare's Pay Per Crawl](https://blog.cloudflare.com/introducing-pay-per-crawl/)
66-
72+
2. Update of IETF Internet-Draft (2 weeks)
6773

6874
## License
6975

reference-impl/scp/parser.py

Lines changed: 65 additions & 24 deletions
Original file line numberDiff line numberDiff line change
@@ -2,13 +2,16 @@
22

33
from __future__ import annotations
44

5+
import logging
56
from pathlib import Path
67

78
import orjson
89

910
from scp import checksum, compression, schema
1011
from scp.exceptions import ChecksumError, DecompressionError, SizeLimitError, ValidationError
1112

13+
logger = logging.getLogger(__name__)
14+
1215

1316
class CollectionMetadata:
1417
"""Collection metadata from first line of SCP file."""
@@ -66,14 +69,16 @@ class SCPParser:
6669
Parses line-by-line for memory efficiency.
6770
"""
6871

69-
def __init__(self, validate: bool = True, strict: bool = False):
72+
def __init__(self, strict: bool = False):
7073
"""Initialize parser.
7174
7275
Args:
73-
validate: Whether to validate against JSON schemas (default: True)
7476
strict: Whether to raise on non-fatal errors like unknown content blocks
77+
78+
Note:
79+
Validation is mandatory per SCP v0.1 specification.
80+
Parsers MUST validate collection metadata and page objects.
7581
"""
76-
self.validate = validate
7782
self.strict = strict
7883
self.metadata: CollectionMetadata | None = None
7984
self.pages: list[Page] = []
@@ -130,9 +135,8 @@ def parse_file(self, file_path: str | Path) -> tuple[CollectionMetadata, list[Pa
130135
if "collection" not in metadata_dict:
131136
raise ValidationError("First line must contain collection metadata")
132137

133-
# Validate metadata
134-
if self.validate:
135-
schema.validate_collection_metadata(metadata_dict)
138+
# Validate metadata (mandatory per SCP v0.1 spec)
139+
schema.validate_collection_metadata(metadata_dict)
136140

137141
self.metadata = CollectionMetadata(metadata_dict)
138142

@@ -168,16 +172,34 @@ def parse_file(self, file_path: str | Path) -> tuple[CollectionMetadata, list[Pa
168172
f"({page_size} > {schema.MAX_PAGE_SIZE})"
169173
)
170174

171-
# Validate page
172-
if self.validate:
173-
try:
174-
schema.validate_page(page_dict)
175-
except ValidationError as e:
176-
error_msg = f"Page validation failed at line {line_num}: {e}"
177-
if self.strict:
178-
raise ValidationError(error_msg) from e
179-
self._errors.append(error_msg)
180-
continue
175+
# Validate page (mandatory per SCP v0.1 spec)
176+
try:
177+
schema.validate_page(page_dict)
178+
except ValidationError as e:
179+
error_msg = f"Page validation failed at line {line_num}: {e}"
180+
if self.strict:
181+
raise ValidationError(error_msg) from e
182+
self._errors.append(error_msg)
183+
continue
184+
185+
# Check for unknown content block types (SCP v0.1 spec: MUST log warning, MUST continue)
186+
if "content" in page_dict:
187+
for block_idx, block in enumerate(page_dict["content"]):
188+
if isinstance(block, dict) and "type" in block:
189+
try:
190+
is_known = schema.validate_content_block(block)
191+
if not is_known:
192+
# MUST log warning for unknown types
193+
warning_msg = (
194+
f"Unknown content block type '{block['type']}' "
195+
f"at line {line_num}, block {block_idx}. "
196+
f"Skipping block and continuing (per SCP v0.1 spec)."
197+
)
198+
logger.warning(warning_msg)
199+
self._errors.append(warning_msg)
200+
except ValidationError:
201+
# Known type but validation failed - already handled by page validation
202+
pass
181203

182204
try:
183205
page = Page(page_dict)
@@ -214,10 +236,9 @@ def parse_bytes(self, data: bytes) -> tuple[CollectionMetadata, list[Page]]:
214236
if not lines:
215237
raise ValidationError("Empty data")
216238

217-
# Parse metadata
239+
# Parse metadata (mandatory per SCP v0.1 spec)
218240
metadata_dict = orjson.loads(lines[0])
219-
if self.validate:
220-
schema.validate_collection_metadata(metadata_dict)
241+
schema.validate_collection_metadata(metadata_dict)
221242
self.metadata = CollectionMetadata(metadata_dict)
222243

223244
# Verify checksum
@@ -231,26 +252,46 @@ def parse_bytes(self, data: bytes) -> tuple[CollectionMetadata, list[Page]]:
231252
continue
232253

233254
page_dict = orjson.loads(line)
234-
if self.validate:
235-
schema.validate_page(page_dict)
255+
# Validate page (mandatory per SCP v0.1 spec)
256+
schema.validate_page(page_dict)
257+
258+
# Check for unknown content block types (SCP v0.1 spec: MUST log warning, MUST continue)
259+
if "content" in page_dict:
260+
for block_idx, block in enumerate(page_dict["content"]):
261+
if isinstance(block, dict) and "type" in block:
262+
try:
263+
is_known = schema.validate_content_block(block)
264+
if not is_known:
265+
# MUST log warning for unknown types
266+
warning_msg = (
267+
f"Unknown content block type '{block['type']}' "
268+
f"at line {line_num}, block {block_idx}. "
269+
f"Skipping block and continuing (per SCP v0.1 spec)."
270+
)
271+
logger.warning(warning_msg)
272+
except ValidationError:
273+
# Known type but validation failed - already handled by page validation
274+
pass
236275

237276
self.pages.append(Page(page_dict))
238277

239278
return self.metadata, self.pages
240279

241280

242281
def parse_collection(
243-
file_path: str | Path, validate: bool = True, strict: bool = False
282+
file_path: str | Path, strict: bool = False
244283
) -> tuple[CollectionMetadata, list[Page]]:
245284
"""Parse SCP collection file (convenience function).
246285
247286
Args:
248287
file_path: Path to .scp, .scp.gz, or .scp.zst file
249-
validate: Whether to validate against JSON schemas
250288
strict: Whether to raise on non-fatal errors
251289
252290
Returns:
253291
Tuple of (metadata, pages)
292+
293+
Note:
294+
Validation is mandatory per SCP v0.1 specification.
254295
"""
255-
parser = SCPParser(validate=validate, strict=strict)
296+
parser = SCPParser(strict=strict)
256297
return parser.parse_file(file_path)

reference-impl/tests/test_parser.py

Lines changed: 0 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -124,14 +124,6 @@ def test_parse_with_checksum(tmp_path: Path) -> None:
124124
assert metadata.checksum.startswith("sha256:")
125125

126126

127-
def test_parse_without_validation(example_snapshot: Path) -> None:
128-
"""Test parsing without schema validation."""
129-
scp_parser = parser.SCPParser(validate=False)
130-
metadata, pages = scp_parser.parse_file(example_snapshot)
131-
132-
assert len(pages) == 2
133-
134-
135127
def test_parse_delta_collection(tmp_path: Path) -> None:
136128
"""Test parsing delta collection."""
137129
gen = generator.SCPGenerator("test-delta", "blog", "delta", since="2025-01-14T00:00:00Z")

0 commit comments

Comments
 (0)