Skip to content

devalexanderdaza/crawlee-scraper-toolkit

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

71 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

Crawlee Scraper Toolkit

A comprehensive TypeScript toolkit for building robust web scrapers with Crawlee, featuring maximum configurability, plugin system, and CLI generator.

πŸš€ Features

  • 🎯 TypeScript First: Full TypeScript support with comprehensive type definitions
  • πŸ”§ Maximum Configurability: Flexible configuration system with profiles and environment variables
  • πŸ”Œ Plugin System: Extensible architecture with built-in plugins for retry, caching, metrics, and more
  • πŸ› οΈ CLI Generator: Interactive command-line tool to generate scraper templates
  • 🌐 Multiple Navigation Strategies: Support for direct navigation, form submission, and API interception
  • ⚑ Browser Pool Management: Efficient browser instance pooling and resource management
  • πŸ“Š Built-in Monitoring: Metrics collection, logging, and error handling
  • πŸ”„ Retry Logic: Configurable retry strategies with exponential backoff
  • πŸ’Ύ Result Caching: Optional caching system to avoid redundant requests
  • 🎨 Multiple Templates: Pre-built templates for common scraping scenarios

πŸ“¦ Installation

Prerequisites

  • Node.js >= 20.0.0
  • pnpm >= 8.0.0 (recommended package manager)

Install with pnpm (recommended)

pnpm add crawlee-scraper-toolkit

Install with npm

npm install crawlee-scraper-toolkit

πŸƒ Quick Start

1. Initialize a New Project

pnpm dlx crawlee-scraper init my-scraper-project
cd my-scraper-project
pnpm install

2. Generate Your First Scraper

After initializing and installing dependencies, you can use the CLI bundled with the project:

pnpm crawlee-scraper generate
# or npx crawlee-scraper generate

Follow the interactive prompts to configure your scraper.

3. Run the Scraper

pnpm crawlee-scraper run --scraper=my-scraper --input="search term"
# or npx crawlee-scraper run --scraper=my-scraper --input="search term"

πŸ”§ Programmatic Usage

Basic Example

import { CrawleeScraperEngine, ScraperDefinition, configManager } from 'crawlee-scraper-toolkit';

// Define your scraper
const myScraper: ScraperDefinition<string, any> = {
  id: 'example-scraper',
  name: 'Example Scraper',
  url: 'https://example.com',
  navigation: { type: 'direct' },
  waitStrategy: { type: 'timeout', config: { duration: 2000 } },
  requiresCaptcha: false,
  parse: async (context) => {
    const { page } = context;
    const title = await page.textContent('h1');
    const description = await page.textContent('p');
    
    return {
      title,
      description,
      url: page.url(),
      timestamp: new Date().toISOString(),
    };
  },
};

// Initialize the engine
const engine = new CrawleeScraperEngine(
  configManager.getConfig(),
  configManager.getLogger()
);

// Register and execute
engine.register(myScraper);
const result = await engine.execute(myScraper, 'input-data');

console.log(result.data);
await engine.shutdown();

Advanced Configuration

import { createConfig, CrawleeScraperEngine, RetryPlugin, CachePlugin } from 'crawlee-scraper-toolkit';

// Build custom configuration
const config = createConfig()
  .browserPool({
    maxSize: 10,
    maxAge: 30 * 60 * 1000,
    launchOptions: {
      headless: true,
      args: ['--no-sandbox', '--disable-setuid-sandbox'],
    },
  })
  .defaultOptions({
    retries: 5,
    timeout: 60000,
    loadImages: false,
  })
  .logging({
    level: 'info',
    format: 'json',
  })
  .build();

// Create engine with custom config
// const logger = configManager.getLogger(); // Assuming you want to use the logger from configManager
// Or, create a new logger instance if you have specific needs for this engine
const engine = new CrawleeScraperEngine(config, configManager.getLogger());

// Install plugins
engine.use(new RetryPlugin({ maxBackoffDelay: 30000 }));
engine.use(new CachePlugin({ defaultTtl: 10 * 60 * 1000 }));

πŸ“‹ Scraper Templates

The toolkit includes several pre-built templates to get you started quickly:

Basic Template

Ideal for simple page scraping tasks. Allows configuration of selectors to extract data from static or dynamically rendered pages.

API Template

Designed for scenarios where data can be more reliably extracted by intercepting and processing API (e.g., JSON) responses triggered by website interactions.

Form Template

Useful for scrapers that need to interact with forms. Provides a structure for filling input fields, submitting forms, and then extracting data from the results.

Advanced Template

A comprehensive template that includes setup for custom hooks, plugins, and more complex configurations. Suitable for sophisticated scraping tasks requiring fine-grained control.

Infinite Scroll Template

Handles pages with infinite scrolling pagination. Ideal for sites that load more content as the user scrolls.

JS-Heavy Site Template

Provides a robust setup for sites that rely heavily on JavaScript for rendering content, focusing on advanced waitFor strategies and interaction patterns.

βš™οΈ Configuration

Configuration File

Create a scraper.config.yaml file:

browserPool:
  maxSize: 5
  maxAge: 1800000  # 30 minutes
  launchOptions:
    headless: true
    args:
      - "--no-sandbox"
      - "--disable-setuid-sandbox"

defaultOptions:
  retries: 3
  timeout: 30000
  loadImages: false

logging:
  level: info
  format: text

profiles:
  development:
    name: development
    config:
      browserPool:
        maxSize: 2
      logging:
        level: debug
  
  production:
    name: production
    config:
      browserPool:
        maxSize: 10
      logging:
        level: error

Environment Variables

BROWSER_POOL_SIZE=5
BROWSER_HEADLESS=true
SCRAPING_MAX_RETRIES=3
LOG_LEVEL=info

πŸ”Œ Plugins

Built-in Plugins

  • RetryPlugin: Exponential backoff retry logic
  • CachePlugin: Result caching with TTL
  • ProxyPlugin: Proxy rotation support
  • RateLimitPlugin: Request rate limiting
  • MetricsPlugin: Performance metrics collection

Custom Plugins

import { ScraperPlugin, ScraperEngine } from 'crawlee-scraper-toolkit';

class MyCustomPlugin implements ScraperPlugin {
  name = 'my-plugin';
  version = '1.0.0';

  install(scraper: ScraperEngine): void {
    scraper.addHook('beforeRequest', async (context) => {
      // Custom logic before each request
      console.log(`Processing: ${context.input}`);
    });
  }
}

engine.use(new MyCustomPlugin());

🎯 Navigation Strategies

Direct Navigation

navigation: {
  type: 'direct'
}

Form Submission

navigation: {
  type: 'form',
  config: {
    inputSelector: 'input[name="search"]',
    submitSelector: 'button[type="submit"]'
  }
}

API Interception

navigation: {
  type: 'api',
  config: {
    paramName: 'q'
  }
}

⏱️ Wait Strategies

Wait for Selector

waitStrategy: {
  type: 'selector',
  config: {
    selector: '.results'
  }
}

Wait for Response

waitStrategy: {
  type: 'response',
  config: {
    urlPattern: '/api/search'
  }
}

Wait for Timeout

waitStrategy: {
  type: 'timeout',
  config: {
    duration: 5000
  }
}

🎣 Hooks

Add custom logic at different stages of scraping:

const scraper: ScraperDefinition = {
  // ... other config
  hooks: {
    beforeRequest: [
      async (context) => {
        // Set custom headers
        await context.page.setExtraHTTPHeaders({
          'X-Custom-Header': 'value'
        });
      }
    ],
    afterRequest: [
      async (context) => {
        // Save screenshot
        await context.page.screenshot({
          path: `screenshots/${context.input}.png`
        });
      }
    ],
    onError: [
      async (context) => {
        console.error('Scraping failed:', context.error);
      }
    ]
  }
};

πŸ“Š Monitoring and Metrics

import { MetricsPlugin } from 'crawlee-scraper-toolkit';

const metricsPlugin = new MetricsPlugin();
engine.use(metricsPlugin);

// After scraping
const metrics = metricsPlugin.getMetrics();
console.log(`Success rate: ${metrics.successfulRequests / metrics.totalRequests * 100}%`);

πŸ› οΈ CLI Commands

The crawlee-scraper CLI provides several commands to manage your scraping projects. These commands can be run using pnpm crawlee-scraper <command> or npx crawlee-scraper <command> if crawlee-scraper-toolkit is a project dependency (as shown in the Quick Start). If you've installed crawlee-scraper-toolkit globally (e.g., pnpm add -g crawlee-scraper-toolkit), you can invoke the commands directly: crawlee-scraper <command>. When installing from a local clone of this repository, run pnpm install && pnpm run build beforehand so the compiled CLI is generated with its path aliases properly resolved. Alternatively, pnpm dlx crawlee-scraper <command> can be used to ensure you're executing the latest version of the CLI without local installation.

Generate Scraper

pnpm crawlee-scraper generate --template=basic --name=my-scraper

Initialize Project

pnpm dlx crawlee-scraper init --name=my-project --template=advanced
# (Using pnpm dlx for init is recommended to get the latest project scaffolder)

Validate Configuration

pnpm crawlee-scraper validate --config=./config/scraper.yaml

Run Scraper

pnpm crawlee-scraper run --scraper=my-scraper --input="search term" --profile=production

Update Project Configurations (In Development)

pnpm crawlee-scraper update

Helps migrate existing project and scraper configurations to the latest version of the toolkit. This command is currently under development.

πŸ” Examples

Check the examples/ directory for complete working examples:

  • Basic web scraping
  • API data extraction
  • Form-based scraping
  • E-commerce product scraping
  • News article extraction

πŸ“š Documentation

Comprehensive documentation is available in multiple formats:

πŸ”— Quick Links

πŸš€ Generate Documentation

# Generate all documentation
pnpm run docs:build

# Generate specific formats
pnpm run docs          # Markdown API docs
pnpm run docs:html     # HTML documentation
pnpm run docs:json     # JSON API schema

# Serve documentation locally
pnpm run docs:serve    # Available at http://localhost:8080

πŸ“– Documentation Scripts

Command Description
docs:build Generate complete documentation suite
docs:watch Watch mode for development
docs:serve Serve docs locally with HTTP server
docs:clean Clean documentation directory
docs:preview Build and serve in one command
docs:package Create compressed documentation archive

πŸš€ Release Process

This project uses semantic-release for fully automated versioning and publishing. All releases are handled automatically by CI/CD based on conventional commits.

For a more comprehensive guide to the release process, including advanced troubleshooting, emergency procedures, and configuration details, please see the Detailed Release Process Documentation.

πŸ“‹ Conventional Commits

Use conventional commit messages to trigger automatic releases. The commit message should be structured as follows:

<type>(<scope>): <description>

[optional body]

[optional footer(s) e.g. BREAKING CHANGE:, Closes #issue]

Commit Types & Version Impact:

Type Description Version Bump Example
feat A new feature Minor (1.0.0 β†’ 1.1.0) feat: add caching plugin
fix A bug fix Patch (1.0.0 β†’ 1.0.1) fix: resolve memory leak
feat! A commit that introduces a breaking API change (correlates with feat) Major (1.0.0 β†’ 2.0.0) feat!: redesign configuration API
fix! A commit that introduces a breaking API change (correlates with fix) Major (1.0.0 β†’ 2.0.0) fix!: change default behavior of existing method
perf A code change that improves performance Patch (sometimes None) perf: improve scraping speed by 20%
docs Documentation only changes None docs: update README examples
style Changes that do not affect the meaning of the code (white-space, etc) None style: fix code formatting
refactor A code change that neither fixes a bug nor adds a feature None refactor: simplify parser logic
test Adding missing tests or correcting existing tests None test: add unit tests for browser pool
chore Changes to the build process or auxiliary tools and libraries None chore: update dependencies
ci Changes to CI configuration files and scripts None ci: update GitHub Actions workflow

Scope Examples: core, cli, docs, api, utils. e.g., feat(core): add browser pool management.

Breaking Changes: Must be indicated at the very beginning of the commit body or footer, starting with BREAKING CHANGE:.

feat!: redesign configuration API

BREAKING CHANGE: The configuration schema has changed.
Old format is no longer supported. See MIGRATION.md for upgrade guide.

Or (for non-! commits):

refactor: internal restructuring of user module

BREAKING CHANGE: User session handling has been modified.
Refer to the updated documentation for details.

πŸ”„ Automated Release Flow

  1. Development: Create feature branch, make changes with conventional commits
  2. Pull Request: Open PR to main branch
  3. CI Validation: Automated tests, linting, examples validation
  4. Release Preview: Comment on PR shows what would be released
  5. Merge: Merge PR to main branch
  6. Automatic Release: CI/CD automatically:
    • Analyzes commits since last release
    • Determines next version (patch/minor/major)
    • Generates CHANGELOG.md
    • Creates GitHub release with notes
    • Publishes to npm registry
    • Deploys documentation to GitHub Pages
    • Sends notifications

πŸ› οΈ Release Commands

# πŸ” Local Analysis (Recommended for developers)
pnpm run release:analyze        # Fast offline analysis of commits

# πŸ§ͺ Release Simulation
pnpm run release:dry           # Safe offline dry-run (no auth required)
pnpm run release:dry-full      # Full dry-run with GitHub/npm simulation (may fail locally)

# πŸ”„ CI/CD Integration  
pnpm run release:preview       # CI-based release preview
pnpm run release               # Automated release (CI only)

# πŸ› οΈ Manual Tools
pnpm run changelog             # Generate changelog manually
pnpm run release:legacy        # Emergency manual release
pnpm run health-check          # Validate CI/CD configuration

Command Details

Command Environment Auth Required Output
release:analyze Local ❌ No Commit analysis + version prediction
release:dry Local ❌ No Basic semantic-release simulation
release:dry-full Local ⚠️ Optional Full simulation (may fail without auth)
release:preview CI/CD βœ… Yes Complete release preview
release CI/CD βœ… Yes Actual release

πŸ“Š Release Validation

Every release automatically runs:

  • βœ… ESLint code quality checks
  • βœ… Prettier formatting validation
  • βœ… Jest test suite with coverage
  • βœ… TypeScript compilation
  • βœ… Examples validation
  • βœ… Documentation generation
  • βœ… Security audit
  • βœ… Bundle size analysis

🎯 Best Practices

  • Use conventional commits for all changes
  • Write descriptive commit messages with clear scope
  • Include breaking change notes when applicable
  • Keep commits atomic and focused
  • Test locally before pushing
  • Let CI/CD handle releases (avoid manual versioning)

πŸ”§ Troubleshooting Common Release Issues

  • Release Not Triggered:
    • Ensure your commit messages strictly follow the Conventional Commits format.
    • Verify the commit was pushed to the main branch (for production releases) or the relevant branch for previews.
    • Check that the commit message does not include [skip ci] or similar tags that prevent CI execution.
  • Incorrect Version Bump:
    • Carefully review your commit history to ensure feat, fix, and BREAKING CHANGE: annotations are used correctly according to the version impact table.
    • Ensure any breaking changes are clearly documented in the commit body or footer, starting with BREAKING CHANGE:.

🀝 Contributing

We welcome contributions! Please follow these guidelines:

πŸ’» Development Setup

  1. Fork and Clone

    git clone https://github.com/devalexanderdaza/crawlee-scraper-toolkit.git
    cd crawlee-scraper-toolkit
  2. Core Prerequisites & Environment Setup

    • Node.js: This project uses Node.js version >=20.0.0 (as specified in .nvmrc). We recommend using nvm (Node Version Manager).
      # Install and use the project's Node.js version (reads .nvmrc)
      nvm install
      nvm use
      # Verify
      node --version
    • pnpm: This project uses pnpm (version >=8.0.0, see package.json packageManager field) as the package manager.
      # Install pnpm globally (if you don't have it)
      npm install -g pnpm@latest
      # Or, enable corepack (recommended, comes with Node.js >=16.10)
      # corepack enable && corepack prepare pnpm@latest --activate
      # Verify
      pnpm --version
      This project includes an .npmrc file with settings like auto-install-peers=true. Key structure files: .nvmrc, .npmrc, pnpm-lock.yaml, tsconfig.json.
  3. Install Dependencies & Setup Hooks

    # Install all project dependencies using pnpm
    pnpm install
    
    # Setup Husky git hooks (for linting, commit messages, etc.)
    pnpm run prepare
  4. Verify Setup

    # Run tests to ensure everything is working
    pnpm test

πŸ› οΈ Common Development Commands

A list of common commands you might use during development:

  • pnpm dev: Run the application in development mode (e.g., with hot reloading, if applicable).
  • pnpm build: Build the project for production.
  • pnpm test: Run the full test suite.
  • pnpm test:watch: Run tests in interactive watch mode.
  • pnpm test:coverage: Generate a test coverage report.
  • pnpm lint: Check code for linting issues using ESLint.
  • pnpm lint:fix: Attempt to automatically fix linting issues.
  • pnpm format: Format code using Prettier.
  • pnpm example:news: Run the basic news scraper example.
  • pnpm example:products: Run the advanced product scraper example.
  • pnpm cli -- --help: Run the toolkit's CLI locally to see its commands. (Note: pnpm cli itself might run a default command; use -- to pass arguments to the CLI script).

🎯 Development Workflow

  1. Create Feature Branch

    git checkout -b feat/your-feature-name
  2. Make Changes following conventional commits:

    # Examples of good commit messages
    git commit -m "feat: add retry mechanism to browser pool"
    git commit -m "fix: resolve memory leak in scraper engine"  
    git commit -m "docs: add configuration examples"
    git commit -m "test: add integration tests for API scraper"
  3. Validate Changes

    pnpm run lint      # Check code style
    pnpm run test      # Run test suite
    pnpm run build     # Verify build works
  4. Submit Pull Request

    • Push your branch: git push origin feat/your-feature-name
    • Open PR against main branch
    • CI will automatically validate and show release preview
    • Address any feedback from reviewers

πŸ“ Commit Message Convention

We use Conventional Commits for automatic versioning:

<type>(<scope>): <description>

[optional body]

[optional footer(s)]

Types:

  • feat: New feature (minor version bump)
  • fix: Bug fix (patch version bump)
  • docs: Documentation changes
  • style: Code style changes (formatting, etc.)
  • refactor: Code refactoring
  • test: Adding or updating tests
  • chore: Maintenance tasks
  • perf: Performance improvements
  • ci: CI/CD changes

Breaking Changes:

feat!: redesign configuration API

BREAKING CHANGE: The configuration schema has changed.
Migration guide available in MIGRATION.md

πŸ›‘οΈ Quality Gates

All contributions must pass:

  • Pre-commit hooks: Linting and formatting
  • Commit message validation: Conventional commit format
  • CI pipeline: Tests, build, examples validation
  • Code review: Maintainer approval required

🚫 What Not to Do

  • ❌ Don't manually update version numbers
  • ❌ Don't edit CHANGELOG.md directly
  • ❌ Don't skip conventional commit format
  • ❌ Don't push directly to main branch
  • ❌ Don't bypass pre-commit hooks

πŸ“„ License

This project is licensed under the MIT License - see the LICENSE file for details.

πŸ™ Acknowledgments

  • Built on top of Crawlee and Playwright
  • Inspired by the need for a more configurable and extensible scraping toolkit
  • Thanks to all contributors and the open-source community

πŸ“ž Support


Made with ❀️ by devalexanderdaza

About

A comprehensive TypeScript toolkit for building robust web scrapers with Crawlee, featuring maximum configurability, plugin system, and CLI generator.

Resources

License

Stars

Watchers

Forks

Contributors