TTB COLA Data Scraping Pipeline

A Python pipeline for scraping Certificate of Label Approval (COLA) data and images from the TTB Public COLA Registry, storing structured data in DuckDB with optional Google Cloud Storage backup.

Features

Automated COLA Discovery: Searches TTB's public COLA registry by date ranges
Comprehensive Data Extraction: Downloads COLA metadata, form details, and label images
Smart Resumption: Automatically resumes from where previous runs left off
Cloud Storage: Optional Google Cloud Storage integration of label image storage
Robust Error Handling: Retry logic, connection management, and detailed logging
Batch Processing: Efficient bulk operations for database updates

Quick Start

Prerequisites

Python 3.13+
uv package manager
MotherDuck account and token
(Optional) Google Cloud Storage service account

Installation

Clone and setup environment:

git clone <repository-url>
cd cola-data-scraping
uv sync

Configure environment variables:

cp example.env .env
# Edit .env with your MotherDuck token

Setup Google Cloud Storage (Optional):

# Place your service account key file
cp your-service-account.json gcp-service-account.json

Basic Usage

Run the full pipeline:

uv run code/scrape_cola_images.py

Common options:

# Limit processing for testing
uv run code/scrape_cola_images.py --limit 100

# Adjust batch sizes
uv run code/scrape_cola_images.py --cola-batch-size 500 --image-batch-size 500

Project Structure

cola-data-scraping/
├── code/
│   ├── scrape_cola_images.py   # Main scraping pipeline
│   ├── parse_cola_html.py      # HTML form parsing utilities
│   ├── process_cola_htmls.py   # Batch HTML processing
│   ├── upsert_to_db.py         # Database operations
│   └── utils.py                # Common utilities
├── docs/
│   ├── cola_data_model.dbml    # Database schema (DBML)
│   └── cola_data_model.sql     # Database schema (SQL)
├── data/                       # Downloaded images and data
├── logs/                       # Application logs
└── pyproject.toml              # Project dependencies

Configuration

Environment Variables

Variable	Description	Required
`MOTHERDUCK_TOKEN`	MotherDuck database access token	Yes

Script Configuration

Key configuration variables in scrape_cola_images.py:

# Date range (leave None for automatic detection)
COMPLETED_START_DATE = None  # Format: "MM/DD/YYYY"
COMPLETED_END_DATE = None    # Format: "MM/DD/YYYY"

# Processing limits
LIMIT_FOR_TESTING = None     # Limit to this number of records for testing or resource usage

# Batch sizes
COLA_BATCH_SIZE = 1000      # COLA records per batch
IMAGE_BATCH_SIZE = 1000     # Image records per batch

# Google Cloud Storage
GCS_ENABLED = True          # Enable/disable GCS uploads
BUCKET_NAME = 'ttb-cola-images'

Data Pipeline

The pipeline performs these operations:

Date Range Detection: Automatically determines date ranges based on existing data
COLA Search: Queries TTB registry for COLAs in specified date ranges
Metadata Extraction: Downloads and parses COLA search results
Detail Page Processing: Fetches individual COLA pages for complete data
Image Downloads: Downloads label images with metadata
Database Storage: Stores all data in MotherDuck database
Cloud Backup: (Optional) Uploads images to Google Cloud Storage

Database Schema

The pipeline creates these tables in MotherDuck:

colas: Core COLA metadata (ID, permit, brand, dates, etc.)
cola_images: Image metadata and download information
cola_parsed_data: Structured form data extracted from HTML

See docs/cola_data_model.sql for complete schema.

Command Line Options

uv run code/scrape_cola_images.py [OPTIONS]

Options:
  --limit INTEGER              Limit number of COLAs to process (for testing)
  --cola-batch-size INTEGER    Batch size for COLA record inserts (default: 1000)
  --image-batch-size INTEGER   Batch size for image record inserts (default: 1000)
  --help                       Show help message

Logging

File logs: logs/cola_scraper.log (INFO level, rotated at 10MB)
Console logs: ERROR level only
Debug files: Saved to data/cola_images/DEBUG_*.html on errors

Dependencies

Key dependencies managed via pyproject.toml:

beautifulsoup4: HTML parsing
requests: HTTP client
duckdb: Database operations
pandas: Data manipulation
google-cloud-storage: Cloud storage (optional)
tqdm: Progress bars

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

TTB COLA Data Scraping Pipeline

Features

Quick Start

Prerequisites

Installation

Basic Usage

Project Structure

Configuration

Environment Variables

Script Configuration

Data Pipeline

Database Schema

Command Line Options

Logging

Dependencies

FilesExpand file tree

README.md

Latest commit

History

README.md

File metadata and controls

TTB COLA Data Scraping Pipeline

Features

Quick Start

Prerequisites

Installation

Basic Usage

Project Structure

Configuration

Environment Variables

Script Configuration

Data Pipeline

Database Schema

Command Line Options

Logging

Dependencies