Skip to content

Task 6#12

Merged
lfoppiano merged 35 commits intomainfrom
luca/feature/part5
Jan 16, 2026
Merged

Task 6#12
lfoppiano merged 35 commits intomainfrom
luca/feature/part5

Conversation

@lfoppiano
Copy link
Copy Markdown
Collaborator

Description

This PR implements Task 6, using cURL and Jwarc instead of the cdx_toolkit.

Notes & open questions

  • I did not rename the make command, to avoid confusions, however cdx_toolkit it's not fully appropriate for the task in this edition of the whirlwind tour
  • I'm not sure how to handle the comment at line 390 of Readme.md that state that "cdx_toolkit allow searching over all crawls using the parameter -cc", it's probably a good idea to mention that the columnar index should be used in this case to avoid a very heavy set of operations. What do you think?
  • When fetching the WARC file from data.commoncrawl.org, there is no warcinfo record, so I omitted that part (line 400 of Readme.md)

@lfoppiano lfoppiano marked this pull request as ready for review January 9, 2026 17:30
Comment thread data/CC-MAIN-2024-22.warc.paths.gz
@lfoppiano lfoppiano merged commit 34c6c87 into main Jan 16, 2026
1 check passed
@lfoppiano lfoppiano deleted the luca/feature/part5 branch January 16, 2026 17:32
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants