Skip to content

TTB-OA/cola-data-scraping

Repository files navigation

TTB COLA Data Scraping Pipeline

A Python pipeline for scraping Certificate of Label Approval (COLA) data and images from the TTB Public COLA Registry, storing structured data in DuckDB with optional Google Cloud Storage backup.

Features

  • Automated COLA Discovery: Searches TTB's public COLA registry by date ranges
  • Comprehensive Data Extraction: Downloads COLA metadata, form details, and label images
  • Smart Resumption: Automatically resumes from where previous runs left off
  • Cloud Storage: Optional Google Cloud Storage integration of label image storage
  • Robust Error Handling: Retry logic, connection management, and detailed logging
  • Batch Processing: Efficient bulk operations for database updates

Quick Start

Prerequisites

  • Python 3.13+
  • uv package manager
  • MotherDuck account and token
  • (Optional) Google Cloud Storage service account

Installation

  1. Clone and setup environment:

    git clone <repository-url>
    cd cola-data-scraping
    uv sync
  2. Configure environment variables:

    cp example.env .env
    # Edit .env with your MotherDuck token
  3. Setup Google Cloud Storage (Optional):

    # Place your service account key file
    cp your-service-account.json gcp-service-account.json

Basic Usage

Run the full pipeline:

uv run code/scrape_cola_images.py

Common options:

# Limit processing for testing
uv run code/scrape_cola_images.py --limit 100

# Adjust batch sizes
uv run code/scrape_cola_images.py --cola-batch-size 500 --image-batch-size 500

Project Structure

cola-data-scraping/
├── code/
│   ├── scrape_cola_images.py   # Main scraping pipeline
│   ├── parse_cola_html.py      # HTML form parsing utilities
│   ├── process_cola_htmls.py   # Batch HTML processing
│   ├── upsert_to_db.py         # Database operations
│   └── utils.py                # Common utilities
├── docs/
│   ├── cola_data_model.dbml    # Database schema (DBML)
│   └── cola_data_model.sql     # Database schema (SQL)
├── data/                       # Downloaded images and data
├── logs/                       # Application logs
└── pyproject.toml              # Project dependencies

Configuration

Environment Variables

Variable Description Required
MOTHERDUCK_TOKEN MotherDuck database access token Yes

Script Configuration

Key configuration variables in scrape_cola_images.py:

# Date range (leave None for automatic detection)
COMPLETED_START_DATE = None  # Format: "MM/DD/YYYY"
COMPLETED_END_DATE = None    # Format: "MM/DD/YYYY"

# Processing limits
LIMIT_FOR_TESTING = None     # Limit to this number of records for testing or resource usage

# Batch sizes
COLA_BATCH_SIZE = 1000      # COLA records per batch
IMAGE_BATCH_SIZE = 1000     # Image records per batch

# Google Cloud Storage
GCS_ENABLED = True          # Enable/disable GCS uploads
BUCKET_NAME = 'ttb-cola-images'

Data Pipeline

The pipeline performs these operations:

  1. Date Range Detection: Automatically determines date ranges based on existing data
  2. COLA Search: Queries TTB registry for COLAs in specified date ranges
  3. Metadata Extraction: Downloads and parses COLA search results
  4. Detail Page Processing: Fetches individual COLA pages for complete data
  5. Image Downloads: Downloads label images with metadata
  6. Database Storage: Stores all data in MotherDuck database
  7. Cloud Backup: (Optional) Uploads images to Google Cloud Storage

Database Schema

The pipeline creates these tables in MotherDuck:

  • colas: Core COLA metadata (ID, permit, brand, dates, etc.)
  • cola_images: Image metadata and download information
  • cola_parsed_data: Structured form data extracted from HTML

See docs/cola_data_model.sql for complete schema.

Command Line Options

uv run code/scrape_cola_images.py [OPTIONS]

Options:
  --limit INTEGER              Limit number of COLAs to process (for testing)
  --cola-batch-size INTEGER    Batch size for COLA record inserts (default: 1000)
  --image-batch-size INTEGER   Batch size for image record inserts (default: 1000)
  --help                       Show help message

Logging

  • File logs: logs/cola_scraper.log (INFO level, rotated at 10MB)
  • Console logs: ERROR level only
  • Debug files: Saved to data/cola_images/DEBUG_*.html on errors

Dependencies

Key dependencies managed via pyproject.toml:

  • beautifulsoup4: HTML parsing
  • requests: HTTP client
  • duckdb: Database operations
  • pandas: Data manipulation
  • google-cloud-storage: Cloud storage (optional)
  • tqdm: Progress bars

About

Scrapes label characteristics and image files from the TTB Public COLA Registry and saves to database.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages