You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: README.md
+68-28Lines changed: 68 additions & 28 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -205,11 +205,13 @@ TBA
205
205
As mentioned earlier, WARC/WET/WAT files look like they're gzipped, but they're actually gzipped in a particular way that allows random access. This means that you can't `gunzip` and then `gzip` a warc without wrecking random access. This example:
206
206
207
207
* creates a copy of one of the warc files in the repo
208
+
* using JWARC we list the records and their respective offsets
209
+
* we access one of the records in the middle of the archive to show that it works
208
210
* uncompresses it
209
211
* recompresses it the wrong way
210
-
*runs `org.commoncrawl.whirlwind.ReadWARC` over it to show that it triggers an error (in fact in java it does not trigger an error... )
212
+
*access one of the records in the middle of the archive of the compressed file showing that it fails
211
213
* recompresses it the right way using `org.commoncrawl.whirlwind.RecompressWARC`
212
-
*shows that this compressed file works
214
+
*show that it works now accessing one of the records in the middle of the archive
Exception in thread "main" org.netpreserve.jwarc.ParsingException: invalid WARC record at position 0: <-- HERE -->\xffffff87@\r\xffffffa1\xffffffca\xffffff84\x1d\xffffffca\x0f0\xffffffb4\xffffff93\xfffffff9\xffffffc5\xfffffff3\xffffff89\xffffffeb?\x1b\xffffff87,q\xffffffed\xffffffb3!s\xffffffc1\x08\xffffff83\\xffffffe0T\xffffffadG\xffffffdcd5\x02\xffffffbaQ... (offset 3734)
244
+
at org.netpreserve.jwarc.WarcParser.parse(WarcParser.java:356)
245
+
at org.netpreserve.jwarc.WarcReader.next(WarcReader.java:181)
246
+
at org.netpreserve.jwarc.tools.ExtractTool.main(ExtractTool.java:141)
247
+
at org.netpreserve.jwarc.tools.WarcTool.main(WarcTool.java:26)
Exception in thread "main" org.netpreserve.jwarc.ParsingException: invalid WARC record at position 0: <-- HERE -->\xffffff87@\r\xffffffa1\xffffffca\xffffff84\x1d\xffffffca\x0f0\xffffffb4\xffffff93\xfffffff9\xffffffc5\xfffffff3\xffffff89\xffffffeb?\x1b\xffffff87,q\xffffffed\xffffffb3!s\xffffffc1\x08\xffffff83\\xffffffe0T\xffffffadG\xffffffdcd5\x02\xffffffbaQ... (offset 3734)
252
+
at org.netpreserve.jwarc.WarcParser.parse(WarcParser.java:356)
253
+
at org.netpreserve.jwarc.WarcReader.next(WarcReader.java:181)
254
+
at org.netpreserve.jwarc.tools.ExtractTool.main(ExtractTool.java:141)
255
+
at org.netpreserve.jwarc.tools.WarcTool.main(WarcTool.java:26)
0 commit comments