Skip to content

Latest commit

 

History

History
156 lines (116 loc) · 4.79 KB

File metadata and controls

156 lines (116 loc) · 4.79 KB

TTB COLA Data Scraping Pipeline

A Python pipeline for scraping Certificate of Label Approval (COLA) data and images from the TTB Public COLA Registry, storing structured data in DuckDB with optional Google Cloud Storage backup.

Features

  • Automated COLA Discovery: Searches TTB's public COLA registry by date ranges
  • Comprehensive Data Extraction: Downloads COLA metadata, form details, and label images
  • Smart Resumption: Automatically resumes from where previous runs left off
  • Cloud Storage: Optional Google Cloud Storage integration of label image storage
  • Robust Error Handling: Retry logic, connection management, and detailed logging
  • Batch Processing: Efficient bulk operations for database updates

Quick Start

Prerequisites

  • Python 3.13+
  • uv package manager
  • MotherDuck account and token
  • (Optional) Google Cloud Storage service account

Installation

  1. Clone and setup environment:

    git clone <repository-url>
    cd cola-data-scraping
    uv sync
  2. Configure environment variables:

    cp example.env .env
    # Edit .env with your MotherDuck token
  3. Setup Google Cloud Storage (Optional):

    # Place your service account key file
    cp your-service-account.json gcp-service-account.json

Basic Usage

Run the full pipeline:

uv run code/scrape_cola_images.py

Common options:

# Limit processing for testing
uv run code/scrape_cola_images.py --limit 100

# Adjust batch sizes
uv run code/scrape_cola_images.py --cola-batch-size 500 --image-batch-size 500

Project Structure

cola-data-scraping/
├── code/
│   ├── scrape_cola_images.py   # Main scraping pipeline
│   ├── parse_cola_html.py      # HTML form parsing utilities
│   ├── process_cola_htmls.py   # Batch HTML processing
│   ├── upsert_to_db.py         # Database operations
│   └── utils.py                # Common utilities
├── docs/
│   ├── cola_data_model.dbml    # Database schema (DBML)
│   └── cola_data_model.sql     # Database schema (SQL)
├── data/                       # Downloaded images and data
├── logs/                       # Application logs
└── pyproject.toml              # Project dependencies

Configuration

Environment Variables

Variable Description Required
MOTHERDUCK_TOKEN MotherDuck database access token Yes

Script Configuration

Key configuration variables in scrape_cola_images.py:

# Date range (leave None for automatic detection)
COMPLETED_START_DATE = None  # Format: "MM/DD/YYYY"
COMPLETED_END_DATE = None    # Format: "MM/DD/YYYY"

# Processing limits
LIMIT_FOR_TESTING = None     # Limit to this number of records for testing or resource usage

# Batch sizes
COLA_BATCH_SIZE = 1000      # COLA records per batch
IMAGE_BATCH_SIZE = 1000     # Image records per batch

# Google Cloud Storage
GCS_ENABLED = True          # Enable/disable GCS uploads
BUCKET_NAME = 'ttb-cola-images'

Data Pipeline

The pipeline performs these operations:

  1. Date Range Detection: Automatically determines date ranges based on existing data
  2. COLA Search: Queries TTB registry for COLAs in specified date ranges
  3. Metadata Extraction: Downloads and parses COLA search results
  4. Detail Page Processing: Fetches individual COLA pages for complete data
  5. Image Downloads: Downloads label images with metadata
  6. Database Storage: Stores all data in MotherDuck database
  7. Cloud Backup: (Optional) Uploads images to Google Cloud Storage

Database Schema

The pipeline creates these tables in MotherDuck:

  • colas: Core COLA metadata (ID, permit, brand, dates, etc.)
  • cola_images: Image metadata and download information
  • cola_parsed_data: Structured form data extracted from HTML

See docs/cola_data_model.sql for complete schema.

Command Line Options

uv run code/scrape_cola_images.py [OPTIONS]

Options:
  --limit INTEGER              Limit number of COLAs to process (for testing)
  --cola-batch-size INTEGER    Batch size for COLA record inserts (default: 1000)
  --image-batch-size INTEGER   Batch size for image record inserts (default: 1000)
  --help                       Show help message

Logging

  • File logs: logs/cola_scraper.log (INFO level, rotated at 10MB)
  • Console logs: ERROR level only
  • Debug files: Saved to data/cola_images/DEBUG_*.html on errors

Dependencies

Key dependencies managed via pyproject.toml:

  • beautifulsoup4: HTML parsing
  • requests: HTTP client
  • duckdb: Database operations
  • pandas: Data manipulation
  • google-cloud-storage: Cloud storage (optional)
  • tqdm: Progress bars