This guide covers how to set up a development environment, build from source, contribute to the project, and understand the codebase.
- Rust: Install from rustup.rs
- Git: For version control
- WebDriver: Firefox (GeckoDriver) or Chrome (ChromeDriver) for testing
# Clone the repository
git clone https://github.com/pixlie/SmartCrawler.git
cd SmartCrawler
# Build in release mode
cargo build --release
# The binary will be in target/release/smart-crawler# Build in development mode (faster compilation, includes debug info)
cargo build
# Run directly with cargo
cargo run -- --link "https://example.com"# Run all tests
cargo test
# Run tests with output
cargo test -- --nocapture
# Run specific test
cargo test test_name
# Run tests in a specific module
cargo test html_parser::testsThe project includes several types of tests:
- Unit tests: Located in each module (
src/files) - Integration tests: In the
tests/directory - Real-world tests: For testing against actual websites (normally ignored)
# Run real-world tests (requires WebDriver)
cargo test --test real_world_tests -- --ignored# Format code
cargo fmt
# Check formatting
cargo fmt --check
# Run linter
cargo clippy
# Run linter with warnings as errors
cargo clippy -- -D warnings# Run with debug logging
RUST_LOG=debug cargo run -- --link "https://example.com"
# Run with specific module logging
RUST_LOG=smart_crawler::html_parser=debug cargo run -- --link "https://example.com"# Build with profiling
cargo build --release --features profiling
# Run with timing information
RUST_LOG=info cargo run --release -- --link "https://example.com"SmartCrawler/
├── src/
│ ├── main.rs # Main application entry point
│ ├── lib.rs # Library exports
│ ├── browser.rs # WebDriver browser integration
│ ├── cli.rs # Command-line argument parsing
│ ├── html_parser.rs # HTML parsing and tree building
│ ├── storage.rs # URL storage and duplicate detection
│ ├── template_detection.rs # Template pattern detection
│ └── utils.rs # Utility functions
├── tests/
│ └── real_world_tests.rs # Integration tests
├── docs/ # Documentation
├── CLAUDE.md # Development workflow guide
└── Cargo.toml # Project dependencies
- CLI Interface (
cli.rs): Parses command-line arguments - Browser Integration (
browser.rs): WebDriver integration for page loading - HTML Parser (
html_parser.rs): Parses HTML into structured tree - Storage System (
storage.rs): Manages URLs and duplicate detection - Template Detection (
template_detection.rs): Identifies content patterns - Utilities (
utils.rs): Common helper functions
CLI Arguments → Browser → HTML Source → HTML Parser → Storage → Duplicate Analysis → Output
↓
Template Detection
- Domain-level duplicate detection: Identifies similar content across pages
- Template pattern recognition: Converts variable content to template patterns
- HTML tree filtering: Shows structured view with duplicate marking
- Multi-URL crawling: Processes multiple URLs with smart discovery
- Fork the repository on GitHub
- Clone your fork:
git clone https://github.com/your-username/SmartCrawler.git cd SmartCrawler - Create a feature branch:
git checkout -b feature/your-feature-name
Follow the workflow described in CLAUDE.md:
- Create a new branch for each feature/fix
- Add tests for any new functionality
- Run formatters and linters before committing
- Write descriptive commit messages
- Push to your fork and create a pull request
- Follow Rust standard formatting (
cargo fmt) - Address all clippy warnings (
cargo clippy) - Write comprehensive tests for new features
- Document public APIs with doc comments
- Use meaningful variable and function names
type: brief description
- Detailed explanation of changes
- Reference any related issues
- Include any breaking changes
🤖 Generated with [Claude Code](https://claude.ai/code)
Co-Authored-By: Claude <noreply@anthropic.com>
- Ensure tests pass: Run
cargo test - Check code quality: Run
cargo fmtandcargo clippy - Update documentation: Add or update relevant docs
- Describe your changes: Write a clear PR description
- Link related issues: Reference any GitHub issues
- Be constructive and respectful
- Focus on code quality and maintainability
- Test the changes locally when possible
- Ask questions if something isn't clear
Build failures:
- Ensure Rust is up to date:
rustup update - Check dependencies:
cargo update
Test failures:
- Ensure WebDriver is running for integration tests
- Check that target websites are accessible
WebDriver issues:
- Verify WebDriver version matches browser version
- Check that port 4444 is available
- Ensure browser is installed and accessible
# Check dependencies
cargo tree
# Update dependencies
cargo update
# Clean build artifacts
cargo clean
# Verbose build output
cargo build --verbose- Never commit API keys or secrets
- Validate all user inputs
- Respect robots.txt and website terms of service
- Use HTTPS for all network requests where possible
If you discover a security vulnerability:
- Do NOT create a public GitHub issue
- Email the maintainers directly
- Provide detailed information about the vulnerability
- Allow time for a fix before public disclosure
- Use
--releasebuilds for performance testing - Profile with
perfor similar tools - Monitor memory usage during development
- Test with various website sizes and structures
# Build optimized version
cargo build --release
# Time execution
time target/release/smart-crawler --link "https://example.com"
# Profile memory usage
valgrind target/release/smart-crawler --link "https://example.com"Areas for contribution:
- Performance optimization: Faster HTML parsing and duplicate detection
- Additional output formats: JSON, CSV, XML export options
- Enhanced filtering: More sophisticated duplicate detection algorithms
- UI improvements: Better progress reporting and error messages
- Platform support: Windows-specific optimizations
- Documentation: More examples and use cases
- GitHub Issues: For bug reports and feature requests
- Discussions: For general questions and ideas
- Code Review: For feedback on contributions
- Documentation: For clarification on usage
Remember to search existing issues before creating new ones!
This development guide is continuously updated. If you find any information missing or outdated, please contribute improvements!