SmartCrawler provides a simple command-line interface to crawl web pages and extract structured HTML content.
smart-crawler --link <URL> [OPTIONS]- Description: URL to crawl (can be specified multiple times)
- Required: Yes
- Multiple values: Yes
Examples:
# Crawl a single URL
smart-crawler --link "https://example.com"
# Crawl multiple URLs
smart-crawler --link "https://example.com" --link "https://another.com"- Description: Enable verbose output showing filtered HTML node tree
- Type: Flag (no value required)
- Default: Disabled
When enabled, this option shows the complete HTML tree structure with duplicate node filtering applied.
Example:
smart-crawler --link "https://example.com" --verbose- Description: Enable template detection mode to identify patterns like '{count} comments' in HTML content
- Type: Flag (no value required)
- Default: Disabled
When enabled, this option:
- Detects variable patterns in text content (e.g., "42 comments" becomes "{count} comments")
- Skips domain-wide duplicate filtering to show template patterns clearly
- Useful for identifying common content patterns across pages
Example:
smart-crawler --link "https://example.com" --template --verbose- Description: Print help information
- Type: Flag
- Description: Print version information
- Type: Flag
smart-crawler --link "https://news-site.com" --template --verbosesmart-crawler \
--link "https://example.com" \
--link "https://another.com" \
--link "https://third-site.org" \
--verbose# Set log level for detailed debugging
RUST_LOG=debug smart-crawler --link "https://example.com" --verboseSmartCrawler outputs crawling results in a structured format:
=== Crawling Results ===
URL: https://example.com
Title: Example Domain
Domain: example.com
---
With --verbose enabled, it also shows the HTML tree structure with duplicate filtering applied.
With --template enabled, it shows template patterns instead of actual values and skips duplicate filtering.
0: Success1: Error (invalid arguments, connection failure, etc.)
RUST_LOG: Set logging level (debug,info,warn,error)
- URLs are automatically validated and deduplicated
- SmartCrawler requires a WebDriver server running on port 4444
- See the Getting Started guides for WebDriver setup instructions