Skip to content

Project restart: Crawler first step #53

@brainless

Description

@brainless

Lets restart this project step by step. I have already created a new branch.

Here are the tasks:

  • Rewrite Claude.md, keeping development workflow and basic guidance for a Rust library and CLI
  • Clear the existing Rust codebase, TypeScript type bindings, type management scripts
  • Release management, website, GitHub workflows can stay
  • Write tests as mentioned below
  • Write the features, check tests

Tests for features to be built (described later)

  • Test text trimming, cleaning or CLI arguments or parts extracted from them (as described in features)
  • Test opening browser using WebDriver using https://example.com
  • Test error handling when WebDriver is not working (connect to a non-configured port)
  • Test opening https://example.com and its HTML title
  • Thorough tests for rules for HTML node tree creation from HTML sources

Features to develop

  • Create CLI to accept --link argument which can be one, multiple --link arguments can exist, check duplicates
  • URL browser that uses WebDriver to open given URL in local browser
  • Define storage for URLs per domain, unique URLs only with data per URL (described below)
  • Per URL, store the fetch status, HTML node tree (described below)
  • Per URL, load URL in browser as needed, extract the page's HTML source by executing JavaScript in the browser
  • Then parse the source into a node tree using rules mentioned below
  • When all links are fetched, show the HTML [head > title] per page, from the node tree

Rules for HTML node tree

  • Per element, save the name, class, id and content
  • Per element's class, split by space, trim - save as a list of words
  • Per element's id, ignore ids that are numeric and seem
  • Ignore nodes that are blank
  • Ignore tags like script, style, noscript, svg, path, img, video, audio, etc.
  • Ignore tags that have same parent nodes (including class and ID) and content
  • Merge contents of two or more immediate siblings if they have text only contents

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions