Skip to content

Commit 513a5b0

Browse files
fix: Malte's review: add test -s for logs. add fsspec to recs. grammar fix.
1 parent 32ba031 commit 513a5b0

3 files changed

Lines changed: 6 additions & 1 deletion

File tree

Makefile

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -41,17 +41,21 @@ cdxj:
4141
cdxj-remote-https:
4242
@echo "indexing End-of-Term-2024 Internet Archive WARC over HTTPS (File size ~1GB, showing first 10 records):"
4343
cdxj-indexer $(EOT_IA_WARC_HTTPS) 2>/dev/null | head -n 10 | tee eot-ia.cdxj
44+
@test -s eot-ia.cdxj || { echo "ERROR: no records indexed from $(EOT_IA_WARC_HTTPS) -- check network connectivity"; exit 1; }
4445
@echo
4546
@echo "indexing End-of-Term-2024 Common Crawl repackage WARC over HTTPS (File size ~1GB, showing first 10 records):"
4647
cdxj-indexer $(EOT_CC_WARC_HTTPS) 2>/dev/null | head -n 10 | tee eot-cc.cdxj
48+
@test -s eot-cc.cdxj || { echo "ERROR: no records indexed from $(EOT_CC_WARC_HTTPS) -- check network connectivity"; exit 1; }
4749

4850
cdxj-remote-s3:
4951
@echo "!! this step requires authentication via S3 credentials (even though it is free)"
5052
@echo "indexing End-of-Term-2024 Internet Archive WARC over S3 (File size ~1GB, showing first 10 records):"
5153
cdxj-indexer $(EOT_IA_WARC_S3) 2>/dev/null | head -n 10 | tee eot-ia.cdxj
54+
@test -s eot-ia.cdxj || { echo "ERROR: no records indexed from $(EOT_IA_WARC_S3) -- check network connectivity and S3 credentials"; exit 1; }
5255
@echo
5356
@echo "indexing End-of-Term-2024 Common Crawl repackage WARC over S3 (File size ~1GB, showing first 10 records):"
5457
cdxj-indexer $(EOT_CC_WARC_S3) 2>/dev/null | head -n 10 | tee eot-cc.cdxj
58+
@test -s eot-cc.cdxj || { echo "ERROR: no records indexed from $(EOT_CC_WARC_S3) -- check network connectivity and S3 credentials"; exit 1; }
5559

5660
extract:
5761
@echo "creating extraction.* from local warcs, the offset numbers are from the cdxj index"

README.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -268,7 +268,7 @@ cdxj-indexer https://eotarchive.s3.amazonaws.com/crawl-data/EOT-2024/segments/CC
268268
```
269269
</details>
270270

271-
The first command fetches and indexes these two WARC over HTTPS. Since they are both around 1GB each, so we display and save only the first 10 records.
271+
The first command fetches and indexes these two WARCs over HTTPS. Since they are both around 1GB each, we display and save only the first 10 records.
272272

273273
If you have AWS credentials configured, you can also access the same files over S3, which is faster when running on AWS. Even though you will need AWS credentials for authentication purposes, this process is still free of charge since these are public buckets.
274274
If you do not have AWS credentials, you can access the same information over HTTPS as described above.

requirements.txt

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,7 @@
11
warcio[s3]>=1.8.0
22
cdx_toolkit
33
duckdb
4+
fsspec
45
pyarrow
56
pandas
67
polars

0 commit comments

Comments
 (0)