Skip to content

Commit 8244615

Browse files
authored
Task 5 (#5)
* ignore .idea, target * add pom.xml, Readme.md and the data files * add makefile * add read warc * add CI + spotless * add figures, editorconfig, .gitignore from the python repository brother * remove unclear make install, remove venv info from readme * update read class, add recompress, * cleanup, removing the rest of the python stuff for task 0,1,2 * fix missing make install * move data under 'data' directory * add Apache header in the code * make sure we build before running * update .gitignore * Implement WARC compression validation for Task 5 * Ignore gzip validation if is uncompressed * fix compression check, update Readme.md * add missing apache licence * add commons-compress library * Fix CI script * place Github Actions in the correct directory * Fix cache, update build description * Fix formatting * Fix method signature * remove non-implemented part - to avoid confusion * demonstrate compress-at-record access to WARC file using only JWARC * fix: typos
1 parent f0809a2 commit 8244615

7 files changed

Lines changed: 273 additions & 70 deletions

File tree

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -14,8 +14,8 @@ jobs:
1414
with:
1515
java-version: '11'
1616
distribution: 'temurin'
17-
cache: 'mvn'
18-
- name: Build with Gradle
17+
cache: maven
18+
- name: Build with Maven
1919
run: mvn -B clean compile
2020
- name: Check with spotless
21-
run: mvn spotless:check
21+
run: mvn spotless:check

Makefile

Lines changed: 16 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -63,19 +63,28 @@ wreck_the_warc: build get_jwarc
6363
rm -f data/testing.warc
6464
gzip -d data/testing.warc.gz # windows gunzip no work-a
6565
@echo
66-
@echo iterate over this uncompressed warc: works
67-
mvn -q exec:java -Dexec.mainClass=org.commoncrawl.whirlwind.ReadWARC -Dexec.args="data/testing.warc"
68-
@echo
6966
@echo compress it the wrong way
7067
gzip data/testing.warc
7168
@echo
72-
@echo iterating over this compressed warc fails
73-
mvn -q exec:java -Dexec.mainClass=org.commoncrawl.whirlwind.ReadWARC -Dexec.args="data/testing.warc.gz" || /usr/bin/true
69+
@echo showing the records in the compressed warc - note the offsets of request and response are
70+
java -jar jwarc-0.33.0.jar ls data/testing.warc.gz
71+
@echo
72+
@echo access the request record - failing
73+
java -jar jwarc-0.33.0.jar extract data/testing.warc.gz 3734 || /usr/bin/true
74+
@echo
75+
@echo access the response record - failing
76+
java -jar jwarc-0.33.0.jar extract data/testing.warc.gz 3734 || /usr/bin/true
7477
@echo
7578
@echo "now let's do it the right way"
7679
gzip -d data/testing.warc.gz
7780
mvn -q exec:java -Dexec.mainClass=org.commoncrawl.whirlwind.RecompressWARC -Dexec.args="data/testing.warc data/testing.warc.gz"
7881
@echo
79-
@echo and now iterating works
80-
mvn -q exec:java -Dexec.mainClass=org.commoncrawl.whirlwind.ReadWARC -Dexec.args="data/testing.warc.gz"
82+
@echo showing the records in the compressed warc - note the skewed offsets of request and response
83+
java -jar jwarc-0.33.0.jar ls data/testing.warc.gz
84+
@echo
85+
@echo access the request record - works
86+
java -jar jwarc-0.33.0.jar extract data/testing.warc.gz 518 | head
87+
@echo
88+
@echo access the response record - works
89+
java -jar jwarc-0.33.0.jar extract data/testing.warc.gz 1027 | head -n 20
8190
@echo

README.md

Lines changed: 103 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -422,7 +422,109 @@ TBA
422422

423423
## Task 5: Wreck the WARC by compressing it wrong
424424

425-
TBA
425+
As mentioned earlier, WARC/WET/WAT files look like they're normal gzipped files, but they're actually gzipped in a particular way that allows random access. This means that you can't `gunzip` and then `gzip` a warc without wrecking random access. This example:
426+
427+
* creates a copy of one of the warc files in the repo
428+
* using JWARC we list the records and their respective offsets
429+
* we access one of the records in the middle of the archive to show that it works
430+
* uncompresses it
431+
* recompresses it the wrong way
432+
* access one of the records in the middle of the archive of the compressed file showing that it fails
433+
* recompresses it the right way using `org.commoncrawl.whirlwind.RecompressWARC`
434+
* show that it works now accessing one of the records in the middle of the archive
435+
436+
Run
437+
438+
```make wreck_the_warc```
439+
440+
and read through the output. You should get something like the output below:
441+
442+
<details>
443+
<summary>Click to view output</summary>
444+
445+
```
446+
we will break and then fix this warc
447+
cp data/whirlwind.warc.gz data/testing.warc.gz
448+
rm -f data/testing.warc
449+
gzip -d data/testing.warc.gz # windows gunzip no work-a
450+
451+
compress it the wrong way
452+
gzip data/testing.warc
453+
454+
showing the records in the compressed warc - note the offsets of request and response are identical
455+
java -jar jwarc.jar ls data/testing.warc.gz
456+
0 warcinfo - -
457+
3734 request GET https://an.wikipedia.org/wiki/Escopete
458+
3734 response 200 https://an.wikipedia.org/wiki/Escopete
459+
18386 metadata - https://an.wikipedia.org/wiki/Escopete
460+
461+
access the request record - failing
462+
java -jar jwarc.jar extract data/testing.warc.gz 3734 || /usr/bin/true
463+
Exception in thread "main" org.netpreserve.jwarc.ParsingException: invalid WARC record at position 0: <-- HERE -->\xffffff87@\r\xffffffa1\xffffffca\xffffff84\x1d\xffffffca\x0f0\xffffffb4\xffffff93\xfffffff9\xffffffc5\xfffffff3\xffffff89\xffffffeb?\x1b\xffffff87,q\xffffffed\xffffffb3!s\xffffffc1\x08\xffffff83\\xffffffe0T\xffffffadG\xffffffdcd5\x02\xffffffbaQ... (offset 3734)
464+
at org.netpreserve.jwarc.WarcParser.parse(WarcParser.java:356)
465+
at org.netpreserve.jwarc.WarcReader.next(WarcReader.java:181)
466+
at org.netpreserve.jwarc.tools.ExtractTool.main(ExtractTool.java:141)
467+
at org.netpreserve.jwarc.tools.WarcTool.main(WarcTool.java:26)
468+
469+
access the response record - failing
470+
java -jar jwarc.jar extract data/testing.warc.gz 3734 || /usr/bin/true
471+
Exception in thread "main" org.netpreserve.jwarc.ParsingException: invalid WARC record at position 0: <-- HERE -->\xffffff87@\r\xffffffa1\xffffffca\xffffff84\x1d\xffffffca\x0f0\xffffffb4\xffffff93\xfffffff9\xffffffc5\xfffffff3\xffffff89\xffffffeb?\x1b\xffffff87,q\xffffffed\xffffffb3!s\xffffffc1\x08\xffffff83\\xffffffe0T\xffffffadG\xffffffdcd5\x02\xffffffbaQ... (offset 3734)
472+
at org.netpreserve.jwarc.WarcParser.parse(WarcParser.java:356)
473+
at org.netpreserve.jwarc.WarcReader.next(WarcReader.java:181)
474+
at org.netpreserve.jwarc.tools.ExtractTool.main(ExtractTool.java:141)
475+
at org.netpreserve.jwarc.tools.WarcTool.main(WarcTool.java:26)
476+
477+
now let's do it the right way
478+
gzip -d data/testing.warc.gz
479+
mvn -q exec:java -Dexec.mainClass=org.commoncrawl.whirlwind.RecompressWARC -Dexec.args="data/testing.warc data/testing.warc.gz"
480+
481+
showing the records in the compressed warc
482+
java -jar jwarc.jar ls data/testing.warc.gz
483+
0 warcinfo - -
484+
518 request GET https://an.wikipedia.org/wiki/Escopete
485+
1027 response 200 https://an.wikipedia.org/wiki/Escopete
486+
18383 metadata - https://an.wikipedia.org/wiki/Escopete
487+
488+
access the request record - works
489+
java -jar jwarc.jar extract data/testing.warc.gz 518 | head
490+
WARC/1.0
491+
Content-Length: 265
492+
Content-Type: application/http; msgtype=request
493+
WARC-Block-Digest: sha1:IE7NEN3QEJHUCYRRGVMHDDW3BEHFRQ6V
494+
WARC-Date: 2024-05-18T01:58:10Z
495+
WARC-IP-Address: 208.80.154.224
496+
WARC-Payload-Digest: sha1:3I42H3S6NNFQ2MSVX7XZKYAYSCX5QBYJ
497+
WARC-Record-ID: <urn:uuid:292f457d-203c-42f2-a1b5-69a4dabefd4f>
498+
WARC-Target-URI: https://an.wikipedia.org/wiki/Escopete
499+
WARC-Type: request
500+
501+
access the response record - works
502+
java -jar jwarc.jar extract data/testing.warc.gz 1027 | head -n 20
503+
WARC/1.0
504+
Content-Length: 74581
505+
Content-Type: application/http; msgtype=response
506+
WARC-Block-Digest: sha1:35FTUGFVNWRVTZQGCWIX2MQA3LMYC7X7
507+
WARC-Concurrent-To: <urn:uuid:292f457d-203c-42f2-a1b5-69a4dabefd4f>
508+
WARC-Date: 2024-05-18T01:58:10Z
509+
WARC-Identified-Payload-Type: text/html
510+
WARC-IP-Address: 208.80.154.224
511+
WARC-Payload-Digest: sha1:RY7PLBUFQNI2FFV5FTUQK72W6SNPXLQU
512+
WARC-Record-ID: <urn:uuid:2aabeff2-67f5-4608-8466-e87c6296e2b6>
513+
WARC-Target-URI: https://an.wikipedia.org/wiki/Escopete
514+
WARC-Type: response
515+
WARC-Warcinfo-ID: <urn:uuid:668d88fc-4208-41fc-b327-1aa6cb783331>
516+
517+
HTTP/1.1 200 OK
518+
date: Sat, 18 May 2024 01:58:10 GMT
519+
server: mw-web.eqiad.canary-bb67b76b8-jtwdb
520+
x-content-type-options: nosniff
521+
content-language: an
522+
origin-trial: AonOP4SwCrqpb0nhZbg554z9iJimP3DxUDB8V4yu9fyyepauGKD0NXqTknWi4gnuDfMG6hNb7TDUDTsl0mDw9gIAAABmeyJvcmlnaW4iOiJodHRwczovL3dpa2lwZWRpYS5vcmc6NDQzIiwiZmVhdHVyZSI6IlRvcExldmVsVHBjZCIsImV4cGlyeSI6MTczNTM0Mzk5OSwiaXNTdWJkb21haW4iOnRydWV9
523+
```
524+
525+
</details>
526+
527+
Make sure you compress WARCs the right way!
426528

427529
## Task 6: Use cdx_toolkit to query the full CDX index and download those captures from AWS S3
428530

pom.xml

Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -16,11 +16,17 @@
1616
</properties>
1717

1818
<dependencies>
19+
<dependency>
20+
<groupId>org.apache.commons</groupId>
21+
<artifactId>commons-compress</artifactId>
22+
<version>1.28.0</version>
23+
</dependency>
1924
<dependency>
2025
<groupId>org.netpreserve</groupId>
2126
<artifactId>jwarc</artifactId>
2227
<version>0.33.0</version>
2328
</dependency>
29+
2430
</dependencies>
2531

2632
<build>

src/main/java/org/commoncrawl/whirlwind/ReadWARC.java

Lines changed: 38 additions & 29 deletions
Original file line numberDiff line numberDiff line change
@@ -29,33 +29,42 @@
2929

3030
public class ReadWARC {
3131

32-
public static void main(String[] args) throws IOException {
33-
34-
if (args.length != 1) {
35-
System.err.println("Usage: java ReadWARC <input-warc-file>");
36-
System.exit(1);
37-
}
38-
39-
Path requested = Path.of(args[0]).toAbsolutePath().normalize();
40-
if (!Files.isRegularFile(requested)) {
41-
throw new SecurityException("Invalid WARC path");
42-
}
43-
44-
final List<String> RESPONSE_TYPES = Arrays.asList("request", "response", "conversion", "metadata");
45-
46-
try (
47-
InputStream in = Files.newInputStream(requested);
48-
WarcReader reader = new WarcReader(in)
49-
) {
50-
reader.records().forEach(record -> {
51-
System.out.println(" WARC-Type: " + record.type());
52-
if (RESPONSE_TYPES.contains(record.type())) {
53-
MessageHeaders headers = record.headers();
54-
for (String header : headers.all("WARC-Target-URI")) {
55-
System.out.println(" WARC-Target-URI " + header);
56-
}
57-
}
58-
});
59-
}
60-
}
32+
private static final List<String> RESPONSE_TYPES = Arrays.asList("request", "response", "conversion", "metadata");
33+
34+
public static void main(String[] args) throws IOException {
35+
36+
if (args.length != 1) {
37+
System.err.println("Usage: java ReadWARC <input-warc-file>");
38+
System.exit(1);
39+
}
40+
41+
Path requested = Path.of(args[0]).toAbsolutePath().normalize();
42+
if (!Files.isRegularFile(requested)) {
43+
throw new SecurityException("Invalid WARC path");
44+
}
45+
46+
if (requested.toString().endsWith("gz") || requested.toString().endsWith("gzip")) {
47+
try {
48+
ValidateWARC.validateRandomAccessWarcOrFail(requested);
49+
} catch (IOException e) {
50+
System.out.println("This file is probably not a multi-member gzip but a single gzip file." + "\n"
51+
+ "To allow seek, a gzipped WARC must have each record compressed into a single gzip member and concatenated together."
52+
+ "\n" + "\n" + "This file is likely still valid and can be fixed by running:" + "\n"
53+
+ "mvn -q exec:java -Dexec.mainClass=org.commoncrawl.whirlwind.RecompressWARC -Dexec.args=\"testing.warc testing.warc.gz\"");
54+
System.exit(-1);
55+
}
56+
}
57+
58+
try (InputStream in = Files.newInputStream(requested); WarcReader reader = new WarcReader(in)) {
59+
reader.records().forEach(record -> {
60+
System.out.println(" WARC-Type: " + record.type());
61+
if (RESPONSE_TYPES.contains(record.type())) {
62+
MessageHeaders headers = record.headers();
63+
for (String header : headers.all("WARC-Target-URI")) {
64+
System.out.println(" WARC-Target-URI " + header);
65+
}
66+
}
67+
});
68+
}
69+
}
6170
}

src/main/java/org/commoncrawl/whirlwind/RecompressWARC.java

Lines changed: 29 additions & 30 deletions
Original file line numberDiff line numberDiff line change
@@ -32,39 +32,38 @@
3232

3333
public class RecompressWARC {
3434

35-
public static void main(String[] args) throws IOException {
35+
public static void main(String[] args) throws IOException {
3636

37-
if (args.length != 2) {
38-
System.err.println("Usage: java RecompressWarc <input-uncompressed-warc-file> <output-compressed-warc-file>");
39-
System.exit(1);
40-
}
37+
if (args.length != 2) {
38+
System.err
39+
.println("Usage: java RecompressWarc <input-uncompressed-warc-file> <output-compressed-warc-file>");
40+
System.exit(1);
41+
}
4142

42-
Path inputPath = Path.of(args[0]).toAbsolutePath().normalize();
43-
Path outputPath = Path.of(args[1]).toAbsolutePath().normalize();
43+
Path inputPath = Path.of(args[0]).toAbsolutePath().normalize();
44+
Path outputPath = Path.of(args[1]).toAbsolutePath().normalize();
4445

45-
if (!Files.isRegularFile(inputPath)) {
46-
throw new SecurityException("Invalid input WARC path");
47-
}
46+
if (!Files.isRegularFile(inputPath)) {
47+
throw new SecurityException("Invalid input WARC path");
48+
}
4849

49-
if (inputPath.endsWith(".gz")) {
50-
System.out.println("Input WARC file is already compressed");
51-
System.exit(1);
52-
}
50+
if (inputPath.endsWith(".gz")) {
51+
System.out.println("Input WARC file is already compressed");
52+
System.exit(1);
53+
}
5354

54-
try (
55-
InputStream in = Files.newInputStream(inputPath);
56-
WarcReader reader = new WarcReader(in);
57-
OutputStream out = Files.newOutputStream(outputPath);
58-
WritableByteChannel outChannel = Channels.newChannel(out);
59-
WarcWriter writer = new WarcWriter(outChannel, WarcCompression.GZIP)
60-
) {
61-
reader.forEach(record -> {
62-
try {
63-
writer.write(record);
64-
} catch (IOException e) {
65-
throw new UncheckedIOException(e);
66-
}
67-
});
68-
}
69-
}
55+
try (InputStream in = Files.newInputStream(inputPath);
56+
WarcReader reader = new WarcReader(in);
57+
OutputStream out = Files.newOutputStream(outputPath);
58+
WritableByteChannel outChannel = Channels.newChannel(out);
59+
WarcWriter writer = new WarcWriter(outChannel, WarcCompression.GZIP)) {
60+
reader.forEach(record -> {
61+
try {
62+
writer.write(record);
63+
} catch (IOException e) {
64+
throw new UncheckedIOException(e);
65+
}
66+
});
67+
}
68+
}
7069
}
Lines changed: 78 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,78 @@
1+
/*
2+
* Licensed to the Apache Software Foundation (ASF) under one or more
3+
* contributor license agreements. See the NOTICE file distributed with
4+
* this work for additional information regarding copyright ownership.
5+
* The ASF licenses this file to You under the Apache License, Version 2.0
6+
* (the "License"); you may not use this file except in compliance with
7+
* the License. You may obtain a copy of the License at
8+
*
9+
* http://www.apache.org/licenses/LICENSE-2.0
10+
*
11+
* Unless required by applicable law or agreed to in writing, software
12+
* distributed under the License is distributed on an "AS IS" BASIS,
13+
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
14+
* See the License for the specific language governing permissions and
15+
* limitations under the License.
16+
*/
17+
package org.commoncrawl.whirlwind;
18+
19+
import org.apache.commons.compress.compressors.gzip.GzipCompressorInputStream;
20+
21+
import java.io.BufferedInputStream;
22+
import java.io.IOException;
23+
import java.io.InputStream;
24+
import java.nio.file.Files;
25+
import java.nio.file.Path;
26+
import java.util.concurrent.atomic.AtomicInteger;
27+
28+
public class ValidateWARC {
29+
public static void main(String[] args) throws Exception {
30+
if (args.length != 1) {
31+
System.err.println("Usage: java ValidateWARC <file.gz>");
32+
System.exit(2);
33+
}
34+
35+
Path requested = Path.of(args[0]).toAbsolutePath().normalize();
36+
if (!Files.isRegularFile(requested)) {
37+
throw new SecurityException("Invalid WARC path");
38+
}
39+
40+
int n = getWarcCompressionInformation(requested);
41+
if (n <= 1) {
42+
System.out.println("Single-member gzip (likely whole-file gzip). members=" + n);
43+
} else {
44+
System.out.println("Concatenated multi-member gzip (record-compressed). members=" + n);
45+
}
46+
47+
}
48+
49+
public static int getWarcCompressionInformation(Path inputWarc) throws IllegalArgumentException {
50+
final AtomicInteger memberCount = new AtomicInteger(0);
51+
52+
try (InputStream fis = Files.newInputStream(inputWarc);
53+
BufferedInputStream bis = new BufferedInputStream(fis);
54+
GzipCompressorInputStream gz = GzipCompressorInputStream.builder().setDecompressConcatenated(true)
55+
.setOnMemberEnd(x -> memberCount.incrementAndGet()).setInputStream(bis).get()) {
56+
57+
byte[] buf = new byte[64 * 1024];
58+
while (gz.read(buf) != -1) {
59+
// Read the entire stream to trigger member processing
60+
// We might not need to read the whole stream, just enough to get an idea
61+
}
62+
} catch (IOException e) {
63+
throw new IllegalArgumentException("The file is either not a gzip file or is corrupted.", e);
64+
}
65+
66+
return memberCount.get();
67+
}
68+
69+
public static void validateRandomAccessWarcOrFail(Path inputWarc) throws IOException {
70+
int n = getWarcCompressionInformation(inputWarc);
71+
72+
if (n <= 1) {
73+
throw new IOException(
74+
"Non-chunked gzip file detected, gzip block continues\n" + " beyond single record. " + n);
75+
}
76+
77+
}
78+
}

0 commit comments

Comments
 (0)