A Python pipeline for scraping Certificate of Label Approval (COLA) data and images from the TTB Public COLA Registry, storing structured data in DuckDB with optional Google Cloud Storage backup.
- Automated COLA Discovery: Searches TTB's public COLA registry by date ranges
- Comprehensive Data Extraction: Downloads COLA metadata, form details, and label images
- Smart Resumption: Automatically resumes from where previous runs left off
- Cloud Storage: Optional Google Cloud Storage integration of label image storage
- Robust Error Handling: Retry logic, connection management, and detailed logging
- Batch Processing: Efficient bulk operations for database updates
- Python 3.13+
- uv package manager
- MotherDuck account and token
- (Optional) Google Cloud Storage service account
-
Clone and setup environment:
git clone <repository-url> cd cola-data-scraping uv sync
-
Configure environment variables:
cp example.env .env # Edit .env with your MotherDuck token -
Setup Google Cloud Storage (Optional):
# Place your service account key file cp your-service-account.json gcp-service-account.json
Run the full pipeline:
uv run code/scrape_cola_images.pyCommon options:
# Limit processing for testing
uv run code/scrape_cola_images.py --limit 100
# Adjust batch sizes
uv run code/scrape_cola_images.py --cola-batch-size 500 --image-batch-size 500cola-data-scraping/
├── code/
│ ├── scrape_cola_images.py # Main scraping pipeline
│ ├── parse_cola_html.py # HTML form parsing utilities
│ ├── process_cola_htmls.py # Batch HTML processing
│ ├── upsert_to_db.py # Database operations
│ └── utils.py # Common utilities
├── docs/
│ ├── cola_data_model.dbml # Database schema (DBML)
│ └── cola_data_model.sql # Database schema (SQL)
├── data/ # Downloaded images and data
├── logs/ # Application logs
└── pyproject.toml # Project dependencies
| Variable | Description | Required |
|---|---|---|
MOTHERDUCK_TOKEN |
MotherDuck database access token | Yes |
Key configuration variables in scrape_cola_images.py:
# Date range (leave None for automatic detection)
COMPLETED_START_DATE = None # Format: "MM/DD/YYYY"
COMPLETED_END_DATE = None # Format: "MM/DD/YYYY"
# Processing limits
LIMIT_FOR_TESTING = None # Limit to this number of records for testing or resource usage
# Batch sizes
COLA_BATCH_SIZE = 1000 # COLA records per batch
IMAGE_BATCH_SIZE = 1000 # Image records per batch
# Google Cloud Storage
GCS_ENABLED = True # Enable/disable GCS uploads
BUCKET_NAME = 'ttb-cola-images'The pipeline performs these operations:
- Date Range Detection: Automatically determines date ranges based on existing data
- COLA Search: Queries TTB registry for COLAs in specified date ranges
- Metadata Extraction: Downloads and parses COLA search results
- Detail Page Processing: Fetches individual COLA pages for complete data
- Image Downloads: Downloads label images with metadata
- Database Storage: Stores all data in MotherDuck database
- Cloud Backup: (Optional) Uploads images to Google Cloud Storage
The pipeline creates these tables in MotherDuck:
colas: Core COLA metadata (ID, permit, brand, dates, etc.)cola_images: Image metadata and download informationcola_parsed_data: Structured form data extracted from HTML
See docs/cola_data_model.sql for complete schema.
uv run code/scrape_cola_images.py [OPTIONS]
Options:
--limit INTEGER Limit number of COLAs to process (for testing)
--cola-batch-size INTEGER Batch size for COLA record inserts (default: 1000)
--image-batch-size INTEGER Batch size for image record inserts (default: 1000)
--help Show help message- File logs:
logs/cola_scraper.log(INFO level, rotated at 10MB) - Console logs: ERROR level only
- Debug files: Saved to
data/cola_images/DEBUG_*.htmlon errors
Key dependencies managed via pyproject.toml:
- beautifulsoup4: HTML parsing
- requests: HTTP client
- duckdb: Database operations
- pandas: Data manipulation
- google-cloud-storage: Cloud storage (optional)
- tqdm: Progress bars