Skip to content

Extend HostToDomainGraph to fold host-level graphs stripping the www. prefix #29

Description

@sebastian-nagel

Per commoncrawl/cc-pyspark#56 the Common Crawl web graphs preserve a www. prefix in the host name. A tool to convert the graph including the www. prefix to one without it, would be useful to compare 1:1 how this changed the graph structure and derived properties, such as the centrality rankings.

The class HostToDomainGraph already supports two aggregation levels:

  • (by default) registered domain - the domain name below the ICANN registry suffix defined by the public suffix list (PSL)
  • "private" domain (command-line flag --private-domains) - the domain name below any public suffix including those in the private section of the PSL.

Adding a third aggregation level "host without www.", is simple:

  1. Implement the stripping of the www. prefix following how it was done in cc-pyspark (see Host-level link extraction: preserve the www. prefix in host names cc-pyspark#57).
  2. Expose this aggregation level per command-line options.
    • Because two mutually exclusive boolean flags are cumbersome, we might refactor the code to use the option --aggregation-level with three supported values: registered-domain (default), private-domain, host-without-www.
    • But ensure backward-compatibility of the current options, to avoid that scripts and documentation need to be adapted immediately.
  3. Update the Javadoc and command-line help accordingly. Add note that the new aggregation level "stretches" the definition of a "domain".

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions