Skip to content

Commit 5310f13

Browse files
committed
demonstrate compress-at-record access to WARC file using only JWARC
1 parent 0edfbc2 commit 5310f13

2 files changed

Lines changed: 89 additions & 36 deletions

File tree

Makefile

Lines changed: 21 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -65,26 +65,39 @@ iterate: build
6565
# @echo "warning! this might take 1-10 minutes"
6666
# python duck.py cloudfront
6767
#
68-
wreck_the_warc: build
68+
get_jwarc:
69+
@echo "downloading JWarc JAR"
70+
curl -fL -o jwarc-0.33.0.jar https://github.com/iipc/jwarc/releases/download/v0.33.0/jwarc-0.33.0.jar
71+
72+
wreck_the_warc: build get_jwarc
6973
@echo
7074
@echo we will break and then fix this warc
7175
cp data/whirlwind.warc.gz data/testing.warc.gz
7276
rm -f data/testing.warc
7377
gzip -d data/testing.warc.gz # windows gunzip no work-a
7478
@echo
75-
@echo iterate over this uncompressed warc: works
76-
mvn -q exec:java -Dexec.mainClass=org.commoncrawl.whirlwind.ReadWARC -Dexec.args="data/testing.warc"
77-
@echo
7879
@echo compress it the wrong way
7980
gzip data/testing.warc
8081
@echo
81-
@echo iterating over this compressed warc fails
82-
mvn -q exec:java -Dexec.mainClass=org.commoncrawl.whirlwind.ReadWARC -Dexec.args="data/testing.warc.gz" || /usr/bin/true
82+
@echo showing the records in the compressed warc - note the offsets of request and response are
83+
java -jar jwarc-0.33.0.jar ls data/testing.warc.gz
84+
@echo
85+
@echo access the request record - failing
86+
java -jar jwarc-0.33.0.jar extract data/testing.warc.gz 3734 || /usr/bin/true
87+
@echo
88+
@echo access the response record - failing
89+
java -jar jwarc-0.33.0.jar extract data/testing.warc.gz 3734 || /usr/bin/true
8390
@echo
8491
@echo "now let's do it the right way"
8592
gzip -d data/testing.warc.gz
8693
mvn -q exec:java -Dexec.mainClass=org.commoncrawl.whirlwind.RecompressWARC -Dexec.args="data/testing.warc data/testing.warc.gz"
8794
@echo
88-
@echo and now iterating works
89-
mvn -q exec:java -Dexec.mainClass=org.commoncrawl.whirlwind.ReadWARC -Dexec.args="data/testing.warc.gz"
95+
@echo showing the records in the compressed warc - note the skewed offsets of request and response
96+
java -jar jwarc-0.33.0.jar ls data/testing.warc.gz
97+
@echo
98+
@echo access the request record - works
99+
java -jar jwarc-0.33.0.jar extract data/testing.warc.gz 518 | head
100+
@echo
101+
@echo access the response record - works
102+
java -jar jwarc-0.33.0.jar extract data/testing.warc.gz 1027 | head -n 20
90103
@echo

README.md

Lines changed: 68 additions & 28 deletions
Original file line numberDiff line numberDiff line change
@@ -205,11 +205,13 @@ TBA
205205
As mentioned earlier, WARC/WET/WAT files look like they're gzipped, but they're actually gzipped in a particular way that allows random access. This means that you can't `gunzip` and then `gzip` a warc without wrecking random access. This example:
206206

207207
* creates a copy of one of the warc files in the repo
208+
* using JWARC we list the records and their respective offsets
209+
* we access one of the records in the middle of the archive to show that it works
208210
* uncompresses it
209211
* recompresses it the wrong way
210-
* runs `org.commoncrawl.whirlwind.ReadWARC` over it to show that it triggers an error (in fact in java it does not trigger an error... )
212+
* access one of the records in the middle of the archive of the compressed file showing that it fails
211213
* recompresses it the right way using `org.commoncrawl.whirlwind.RecompressWARC`
212-
* shows that this compressed file works
214+
* show that it works now accessing one of the records in the middle of the archive
213215

214216
Run
215217

@@ -226,40 +228,78 @@ cp data/whirlwind.warc.gz data/testing.warc.gz
226228
rm -f data/testing.warc
227229
gzip -d data/testing.warc.gz # windows gunzip no work-a
228230
229-
iterate over this uncompressed warc: works
230-
mvn -q exec:java -Dexec.mainClass=org.commoncrawl.whirlwind.ReadWARC -Dexec.args="data/testing.warc"
231-
WARC-Type: warcinfo
232-
WARC-Type: request
233-
WARC-Target-URI https://an.wikipedia.org/wiki/Escopete
234-
WARC-Type: response
235-
WARC-Target-URI https://an.wikipedia.org/wiki/Escopete
236-
WARC-Type: metadata
237-
WARC-Target-URI https://an.wikipedia.org/wiki/Escopete
238-
239231
compress it the wrong way
240232
gzip data/testing.warc
241233
242-
iterating over this compressed warc fails
243-
mvn -q exec:java -Dexec.mainClass=org.commoncrawl.whirlwind.ReadWARC -Dexec.args="data/testing.warc.gz" || /usr/bin/true
244-
This file is probably not a multi-member gzip but a single gzip file.
245-
To allow seek, a gzipped WARC must have each record compressed into a single gzip member and concatenated together.
246-
247-
This file is likely still valid and can be fixed by running:
248-
mvn -q exec:java -Dexec.mainClass=org.commoncrawl.whirlwind.RecompressWARC -Dexec.args="testing.warc testing.warc.gz"
234+
showing the records in the compressed warc - note the offsets of request and response are
235+
java -jar jwarc-0.33.0.jar ls data/testing.warc.gz
236+
0 warcinfo - -
237+
3734 request GET https://an.wikipedia.org/wiki/Escopete
238+
3734 response 200 https://an.wikipedia.org/wiki/Escopete
239+
18386 metadata - https://an.wikipedia.org/wiki/Escopete
240+
241+
access the request record - failing
242+
java -jar jwarc-0.33.0.jar extract data/testing.warc.gz 3734 || /usr/bin/true
243+
Exception in thread "main" org.netpreserve.jwarc.ParsingException: invalid WARC record at position 0: <-- HERE -->\xffffff87@\r\xffffffa1\xffffffca\xffffff84\x1d\xffffffca\x0f0\xffffffb4\xffffff93\xfffffff9\xffffffc5\xfffffff3\xffffff89\xffffffeb?\x1b\xffffff87,q\xffffffed\xffffffb3!s\xffffffc1\x08\xffffff83\\xffffffe0T\xffffffadG\xffffffdcd5\x02\xffffffbaQ... (offset 3734)
244+
at org.netpreserve.jwarc.WarcParser.parse(WarcParser.java:356)
245+
at org.netpreserve.jwarc.WarcReader.next(WarcReader.java:181)
246+
at org.netpreserve.jwarc.tools.ExtractTool.main(ExtractTool.java:141)
247+
at org.netpreserve.jwarc.tools.WarcTool.main(WarcTool.java:26)
248+
249+
access the response record - failing
250+
java -jar jwarc-0.33.0.jar extract data/testing.warc.gz 3734 || /usr/bin/true
251+
Exception in thread "main" org.netpreserve.jwarc.ParsingException: invalid WARC record at position 0: <-- HERE -->\xffffff87@\r\xffffffa1\xffffffca\xffffff84\x1d\xffffffca\x0f0\xffffffb4\xffffff93\xfffffff9\xffffffc5\xfffffff3\xffffff89\xffffffeb?\x1b\xffffff87,q\xffffffed\xffffffb3!s\xffffffc1\x08\xffffff83\\xffffffe0T\xffffffadG\xffffffdcd5\x02\xffffffbaQ... (offset 3734)
252+
at org.netpreserve.jwarc.WarcParser.parse(WarcParser.java:356)
253+
at org.netpreserve.jwarc.WarcReader.next(WarcReader.java:181)
254+
at org.netpreserve.jwarc.tools.ExtractTool.main(ExtractTool.java:141)
255+
at org.netpreserve.jwarc.tools.WarcTool.main(WarcTool.java:26)
249256
250257
now let's do it the right way
251258
gzip -d data/testing.warc.gz
252259
mvn -q exec:java -Dexec.mainClass=org.commoncrawl.whirlwind.RecompressWARC -Dexec.args="data/testing.warc data/testing.warc.gz"
253260
254-
and now iterating works
255-
mvn -q exec:java -Dexec.mainClass=org.commoncrawl.whirlwind.ReadWARC -Dexec.args="data/testing.warc.gz"
256-
WARC-Type: warcinfo
257-
WARC-Type: request
258-
WARC-Target-URI https://an.wikipedia.org/wiki/Escopete
259-
WARC-Type: response
260-
WARC-Target-URI https://an.wikipedia.org/wiki/Escopete
261-
WARC-Type: metadata
262-
WARC-Target-URI https://an.wikipedia.org/wiki/Escopete
261+
showing the records in the compressed warc - note the skewed offsets of request and response
262+
java -jar jwarc-0.33.0.jar ls data/testing.warc.gz
263+
0 warcinfo - -
264+
518 request GET https://an.wikipedia.org/wiki/Escopete
265+
1027 response 200 https://an.wikipedia.org/wiki/Escopete
266+
18383 metadata - https://an.wikipedia.org/wiki/Escopete
267+
268+
access the request record - works
269+
java -jar jwarc-0.33.0.jar extract data/testing.warc.gz 518 | head
270+
WARC/1.0
271+
Content-Length: 265
272+
Content-Type: application/http; msgtype=request
273+
WARC-Block-Digest: sha1:IE7NEN3QEJHUCYRRGVMHDDW3BEHFRQ6V
274+
WARC-Date: 2024-05-18T01:58:10Z
275+
WARC-IP-Address: 208.80.154.224
276+
WARC-Payload-Digest: sha1:3I42H3S6NNFQ2MSVX7XZKYAYSCX5QBYJ
277+
WARC-Record-ID: <urn:uuid:292f457d-203c-42f2-a1b5-69a4dabefd4f>
278+
WARC-Target-URI: https://an.wikipedia.org/wiki/Escopete
279+
WARC-Type: request
280+
281+
access the response record - works
282+
java -jar jwarc-0.33.0.jar extract data/testing.warc.gz 1027 | head -n 20
283+
WARC/1.0
284+
Content-Length: 74581
285+
Content-Type: application/http; msgtype=response
286+
WARC-Block-Digest: sha1:35FTUGFVNWRVTZQGCWIX2MQA3LMYC7X7
287+
WARC-Concurrent-To: <urn:uuid:292f457d-203c-42f2-a1b5-69a4dabefd4f>
288+
WARC-Date: 2024-05-18T01:58:10Z
289+
WARC-Identified-Payload-Type: text/html
290+
WARC-IP-Address: 208.80.154.224
291+
WARC-Payload-Digest: sha1:RY7PLBUFQNI2FFV5FTUQK72W6SNPXLQU
292+
WARC-Record-ID: <urn:uuid:2aabeff2-67f5-4608-8466-e87c6296e2b6>
293+
WARC-Target-URI: https://an.wikipedia.org/wiki/Escopete
294+
WARC-Type: response
295+
WARC-Warcinfo-ID: <urn:uuid:668d88fc-4208-41fc-b327-1aa6cb783331>
296+
297+
HTTP/1.1 200 OK
298+
date: Sat, 18 May 2024 01:58:10 GMT
299+
server: mw-web.eqiad.canary-bb67b76b8-jtwdb
300+
x-content-type-options: nosniff
301+
content-language: an
302+
origin-trial: AonOP4SwCrqpb0nhZbg554z9iJimP3DxUDB8V4yu9fyyepauGKD0NXqTknWi4gnuDfMG6hNb7TDUDTsl0mDw9gIAAABmeyJvcmlnaW4iOiJodHRwczovL3dpa2lwZWRpYS5vcmc6NDQzIiwiZmVhdHVyZSI6IlRvcExldmVsVHBjZCIsImV4cGlyeSI6MTczNTM0Mzk5OSwiaXNTdWJkb21haW4iOnRydWV9
263303
```
264304

265305
</details>

0 commit comments

Comments
 (0)