Test Suite Documentation

This document provides detailed information about the test suite for the Embucket Integration Test Framework.

Overview

The test suite is located in the tests/ directory and includes various integration tests and benchmarks for validating Embucket functionality against industry-standard datasets and use cases.

Test Files

tests/example.sh

Purpose: Basic integration test demonstrating the standard test workflow.

What it tests:

Docker service startup
Database and schema initialization
ClickBench data loading (partitioned)
Spark Iceberg table creation
Query execution on both Embucket and Spark
Data equality verification between catalogs

Usage:

sh tests/example.sh

Flow:

Start Docker services with up
Initialize database with setup
Load ClickBench partitioned data
Create corresponding Spark tables
Run sample queries on both catalogs
Verify data equality
Clean up with down

When to use: As a template for creating new integration tests or for quick validation that the basic framework is working correctly.

tests/clickbench.sh

Purpose: Comprehensive ClickBench benchmark test.

What it tests:

Full ClickBench dataset loading
All 43 ClickBench query execution
Performance measurement and benchmarking
Embucket's web analytics query performance

Usage:

sh tests/clickbench.sh

Data:

Uses 100 partitioned parquet files
Total dataset: ~100 million rows
Queries: 43 web analytics queries from clickbench/queries.sql

Output:

Query execution times
Results saved to clickbench/results.csv

When to use: For comprehensive performance testing of Embucket against the ClickBench standard or when validating web analytics query patterns.

tests/clickbench_file.sh

Purpose: Test file-based storage integration (local filesystem) instead of S3.

What it tests:

File-based external volume creation
Data loading from local filesystem paths
Comparison with S3-based loading approach

Usage:

sh tests/clickbench_file.sh

Key difference: Uses volume_file and database_file commands from make.sh instead of the S3-based equivalents.

When to use:

Testing file-based storage backends
Validating that Embucket works with local filesystem access
Comparing performance between S3 and file-based ingestion

tests/tpch.sh

Purpose: TPC-H decision support benchmark test.

What it tests:

TPC-H data loading (scale factor 10 or 100)
All 22 TPC-H query execution
Complex join and aggregation performance
Business intelligence query patterns

Usage:

sh tests/tpch.sh

Data:

Scale factor 10: ~10GB of data
Scale factor 100: ~100GB of data
Queries: 22 TPC-H decision support queries

Output:

Query execution times
Results saved to tpch/results.csv

When to use:

Comprehensive performance testing of decision support workloads
Validating complex joins, aggregations, and analytical queries
Comparing Embucket performance against TPC-H standards

tests/merge.sh

Purpose: Test MERGE (upsert) operations.

What it tests:

MERGE statement functionality
Update and insert operations in a single statement
Data consistency after MERGE operations
Incremental data loading patterns

Usage:

sh tests/merge.sh

What it validates:

Correct handling of matching rows (updates)
Correct handling of non-matching rows (inserts)
Data integrity after merge operations

When to use:

Testing incremental data loading workflows
Validating MERGE statement compatibility
Testing upsert patterns common in data warehousing

Python Helper Scripts

Located in the scripts/ directory, these scripts assist with Iceberg table creation for Spark compatibility testing.

scripts/create_iceberg.py

Purpose: Create a single Iceberg table from the ClickBench data.

Usage:

python scripts/create_iceberg.py

What it does:

Connects to Spark with Iceberg catalog support
Creates a single Iceberg table from ClickBench data
Used for comparing Embucket against Spark Iceberg

scripts/create_iceberg_partitioned.py

Purpose: Create partitioned Iceberg tables from ClickBench data.

Usage:

python scripts/create_iceberg_partitioned.py

What it does:

Creates partitioned Iceberg tables
Uses the same partitioning scheme as the source data
Enables partition pruning tests and performance comparisons

Test Workflow Guidelines

Standard Test Structure

Most tests follow this pattern:

#!/bin/bash
source ./make.sh
source ./clickbench.sh  # or tpch.sh

# 1. Start infrastructure
up

# 2. Initialize database
setup

# 3. Load test data
clickbench_partitioned  # or other data loading function

# 4. Run test operations
# Your test-specific logic here

# 5. Verify results
equality table1 table2  # or custom verification

# 6. Cleanup
down

Creating Custom Tests

To create a new test:

Copy an existing test as a template:
```
cp tests/example.sh tests/my_test.sh
```

Source the required script libraries:

source ./make.sh
source ./clickbench.sh  # and/or tpch.sh

Implement your test logic following the standard structure
Make the test executable:
```
chmod +x tests/my_test.sh
```
Run your test:
```
sh tests/my_test.sh
```

Benchmarking

Running Benchmarks

ClickBench:

# After loading data
sh clickbench.sh benchmark

TPC-H:

# After loading data
sh tpch.sh benchmark

Interpreting Results

Benchmark results are saved to CSV files:

clickbench/results.csv - ClickBench query timings
tpch/results.csv - TPC-H query timings

Format:

query_number,execution_time_seconds,status
1,0.234,success
2,0.567,success

Performance Considerations

First run: May be slower due to cold caches
Subsequent runs: Typically faster due to warm caches
Scale factors: Higher scale factors (TPC-H 100 vs 10) significantly impact query times
Partitioning: Partitioned data may improve query performance with partition pruning

Troubleshooting

Common Issues

Test fails at Docker startup:

Check that Docker is running: docker ps
Check port conflicts: Ensure ports 3000, 9000, 9001, 8474 are available
Review Docker logs: docker-compose logs

Data loading fails:

Verify data exists in correct directory (clickbench/ or tpch/)
Check MinIO is running: docker ps | grep minio
Verify S3 credentials are set: source s3.sh && echo $AWS_ACCESS_KEY_ID

Query execution fails:

Check Embucket is running: docker ps | grep embucket
Verify connection: sh make.sh snowsql "SELECT 1"
Check Snowflake CLI is installed: snow --version

Data equality check fails:

Expected for some test scenarios (different data representations)
Investigate with manual queries to understand differences
Check for precision/data type mismatches

Advanced Testing Patterns

Testing Storage Backends

Compare S3 vs file-based storage:

# S3-based (default)
sh tests/example.sh

# File-based
sh tests/clickbench_file.sh

Testing Different Data Sizes

For ClickBench:

# Small dataset (first partition only)
sh clickbench.sh clickbench_partitioned_small

# Full dataset (all 100 partitions)
sh clickbench.sh clickbench_partitioned

For TPC-H:

# Smaller scale (10GB)
sh tpch.sh load 10

# Larger scale (100GB)
sh tpch.sh load 100

Testing Cross-Catalog Queries

# Create tables in both Embucket and Spark
clickbench_partitioned
clickbench_spark_partitioned

# Query both catalogs
snowsql "SELECT COUNT(*) FROM demo.embucket.hits"
sparksql "SELECT COUNT(*) FROM demo.spark.hits"

# Verify equality
equality demo.embucket.hits demo.spark.hits

Continuous Integration

For CI/CD pipelines, consider:

Quick smoke test:
```
sh tests/example.sh
```

Comprehensive benchmark:

sh tests/clickbench.sh && sh tests/tpch.sh

Clean up:
```
sh make.sh down
```
Resource monitoring:
- Monitor Docker container memory and CPU usage
- Track query execution times over builds
- Alert on performance regressions

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Test Suite Documentation

Overview

Test Files

tests/example.sh

tests/clickbench.sh

tests/clickbench_file.sh

tests/tpch.sh

tests/merge.sh

Python Helper Scripts

scripts/create_iceberg.py

scripts/create_iceberg_partitioned.py

Test Workflow Guidelines

Standard Test Structure

Creating Custom Tests

Benchmarking

Running Benchmarks

Interpreting Results

Performance Considerations

Troubleshooting

Common Issues

Advanced Testing Patterns

Testing Storage Backends

Testing Different Data Sizes

Testing Cross-Catalog Queries

Continuous Integration

FilesExpand file tree

TESTING.md

Latest commit

History

TESTING.md

File metadata and controls

Test Suite Documentation

Overview

Test Files

tests/example.sh

tests/clickbench.sh

tests/clickbench_file.sh

tests/tpch.sh

tests/merge.sh

Python Helper Scripts

scripts/create_iceberg.py

scripts/create_iceberg_partitioned.py

Test Workflow Guidelines

Standard Test Structure

Creating Custom Tests

Benchmarking

Running Benchmarks

Interpreting Results

Performance Considerations

Troubleshooting

Common Issues

Advanced Testing Patterns

Testing Storage Backends

Testing Different Data Sizes

Testing Cross-Catalog Queries

Continuous Integration