Adaptive Compression System

The S3 Cross-Region Compressor includes an intelligent adaptive compression system that automatically optimizes compression levels based on historical performance and cost-benefit analysis. This system is implemented by the Source Region Service component and represents one of the core features of the solution. This document explains how this system works and how to monitor and tune its performance.

Note: For implementation details specific to the Source Region Service's use of adaptive compression, see the Source Region documentation.

Overview

Rather than using a fixed compression level, the system dynamically adjusts compression settings to find the optimal balance between:

Compression ratio: Higher compression reduces transfer costs
Compression time: Lower CPU usage reduces compute costs
Data characteristics: Different data types benefit from different compression levels

The goal is to maximize the "benefit score" - a metric that balances transfer cost savings against computational costs.

System Architecture

flowchart TB
    NewObject[New Object] --> CompressionManager
    CompressionManager --> GetSettings[Get Settings from DynamoDB]
    GetSettings --> Decision{Settings Exist?}
    Decision -->|Yes| GetLevel[Get Optimal Level]
    Decision -->|No| CreateDefault[Create Default Settings]
    CreateDefault --> DefaultLevel[Use Default Level 12]
    GetLevel --> VersionCount[Calculate Version-Based Decay]
    VersionCount --> DecayRate[Adjust Exploration Rate]
    DecayRate --> ExplorationCheck{Run Exploration?}
    ExplorationCheck -->|Yes 12.5-25%| ExploreType[Exploration Tier]
    ExplorationCheck -->|No 75-87.5%| UseOptimal[Use Optimal Level]
    ExploreType -->|60% of budget ±1| TestAdjacent[Try Adjacent Level]
    ExploreType -->|25% of budget ±2| TestDouble[Try Double Jump]
    ExploreType -->|15% of budget ±3| TestTriple[Try Triple Jump]
    DefaultLevel --> Compress[Compress Object]
    TestAdjacent --> Compress
    TestDouble --> Compress
    TestTriple --> Compress
    UseOptimal --> Compress
    Compress --> MeasureResults[Measure Results]
    MeasureResults --> CalculateCostBenefit[Calculate Cost-Benefit]
    CalculateCostBenefit --> UpdateDynamoDB[Update DynamoDB]
    UpdateDynamoDB --> CheckOptimize{Need to Recalculate?}
    CheckOptimize -->|Yes| RecalculateOptimal[Calculate New Optimal Level]
    CheckOptimize -->|No| Complete[Complete]

Components

1. Compression Manager (`compression_manager.py`)

The Compression Manager acts as a facade for the adaptive compression system. It:

Retrieves optimal compression levels from DynamoDB
Updates metrics after compression operations
Implements retry logic and error handling

Key functions:

get_compression_level(): Retrieves the optimal level for a bucket/prefix
update_compression_metrics(): Records metrics and updates optimal levels
occasionally_test_new_level(): Implements the exploration strategy

2. Compression Optimizer (`compression_optimizer.py`)

The Compression Optimizer analyzes historical compression metrics to determine the optimal compression level for each bucket/prefix pattern. It:

Calculates weighted averages of benefit scores
Trims outliers to improve stability
Implements exploration vs. exploitation strategy

Key functions:

calculate_optimal_level(): Analyzes metrics to find the best compression level
occasionally_test_new_level(): Occasionally tests adjacent levels (exploration)

3. Cost-Benefit Calculator (`cost_benefit_calculator.py`)

The Cost-Benefit Calculator evaluates the tradeoff between:

Compute costs (time spent compressing)
Transfer savings (reduced data size)

Key functions:

calculate_metrics(): Computes all metrics for a compression operation
format_for_dynamodb(): Formats metrics for storage in DynamoDB

4. Compression Settings Repository (`compression_settings_repository.py`)

The Compression Settings Repository handles:

Storage of compression settings in DynamoDB
Retrieval of settings with caching
Optimistic concurrency control for updates

Key functions:

get_settings(): Retrieves settings for a specific key
update_settings(): Updates settings with optimistic locking
create_settings(): Creates initial settings for a new key

How Compression Level Selection Works

1. Initial Compression Level Selection

When an object is processed for the first time:

The Compression Manager checks if settings exist for the bucket/prefix
If no settings exist, it creates a new entry with the default level (12)
If settings exist, it retrieves the current optimal level based on per-object benefit calculations

2. CPU-Based Level Adjustment

The system selects compression levels based on relative CPU performance:

First, it identifies the best performing level based on average benefit per object
Then it adjusts this level based on the container's CPU performance relative to the fleet average:
- For faster CPUs (< 90% of average): Use best level + 1 (higher compression)
- For slower CPUs (> 110% of average): Use best level - 1 (lower compression)
- For average CPUs: Use the best level as-is
This ensures optimal resource utilization across a heterogeneous fleet of containers with varying performance characteristics.

3. Multi-Tier Adaptive Exploration with Version-Based Decay

To continuously improve and adapt to changing data patterns, the system uses a multi-tier exploration approach that balances discovery of new optimal levels with avoiding local optima. This approach also features an adaptive decay mechanism that gradually reduces exploration as more data is gathered:

Base Exploration Rate: The system starts with a 25% total exploration rate divided proportionally:
- 15% chance to explore adjacent levels (±1)
- 6.25% chance to explore double-jump levels (±2)
- 3.75% chance to explore triple-jump levels (±3)
Version-Based Decay: As more operations are processed (tracked by the version count):
- The exploration rate decays by 2% per 1000 versions processed
- Maximum decay is capped at 50% (reducing to minimum 12.5% exploration at 25,000+ versions)
- The proportional distribution across tiers is maintained throughout the decay
Balanced Exploration-Exploitation: This approach ensures:
- Aggressive exploration in the early stages when statistical confidence is low
- Gradual shift toward exploitation as the system builds confidence in its optimal levels
- Persistent minimum exploration to adapt to potential changes in data patterns
The remaining portion (75% initially, increasing to 87.5% over time) is used for exploitation of the chosen optimal level

3. Metrics Collection & Analysis

After each compression operation, the system:

Records original size, compressed size, and processing time
Uses AWS transfer costs to calculate transfer savings
Computes the benefit score as: net_benefit (transfer savings minus compute cost)
Stores these metrics in DynamoDB for future level selection

4. Optimal Level Recalculation

Optimal levels are recalculated:

Always when exploring a different compression level
Using average benefit per object calculation based on file count weighting
Requires minimum 10 trials per level for statistical significance

Compression Settings Storage

Compression settings are stored in a DynamoDB table with the following aggregated structure:

Attribute	Type	Description
BucketPrefix	String	Primary key - bucket/prefix pattern
SumCpuFactor	Number	Sum of all CPU factors recorded (for fleet average calculation)
LevelStats	Map	Map of level statistics as described below
TotalProcessedFiles	Number	Total number of objects processed
LastUpdated	Number	Timestamp of last update
Version	Number	Version number for optimistic concurrency control

The LevelStats map contains entries for each compression level that has been used:

"LevelStats": {
  "10": {
    "sum_benefit": 1.2345,
    "trials": 120,
    "objects": 1200
  },
  "11": {
    "sum_benefit": 2.3456,
    "trials": 210,
    "objects": 2100
  }
}

Each level statistic includes:

sum_benefit: Running sum of benefit scores for this level
trials: Number of batches processed with this level
objects: Total number of individual objects processed with this level

Cost-Benefit Calculation

The cost-benefit calculation uses the following factors:

Transfer Cost Calculation

transfer_savings = bytes_saved * transfer_cost_per_byte * num_regions

Where:

bytes_saved = original_size - compressed_size
transfer_cost_per_byte is derived from AWS pricing (configured via TRANSFER_COST_FACTOR)
num_regions is the number of target regions

Compute Cost Calculation

compute_cost = normalized_processing_time * compute_cost_per_minute / 60 * 1.025

Where:

normalized_processing_time is measured in seconds and normalized using CPU benchmarking
compute_cost_per_minute is derived from Fargate pricing (configured via FARGATE_COST_PER_MINUTE)
1.025 is a 2.5% overhead factor for compute operations

CPU Performance Normalization

The system automatically benchmarks CPU performance at startup to normalize metrics across different instance types:

Run a standardized zstd compression benchmark at startup
Calculate a CPU factor (higher for slower CPUs, lower for faster CPUs)
Apply this factor to normalize processing times across different hardware
Use the normalized time for cost-benefit calculations

This ensures consistent optimization decisions regardless of the underlying instance type or Fargate hardware generation.

Net Benefit

net_benefit = transfer_savings - compute_cost

Benefit Score

The final benefit score calculation is straightforward:

benefit_score = net_benefit

This focuses on the actual cost savings (transfer savings minus compute cost) as the primary optimization metric.

Memory Optimization Techniques

The compression system employs several memory optimization techniques to ensure efficient use of resources:

Dynamic Buffer Sizing

def get_available_memory():
    """Detect available memory in the container environment."""
    try:
        return psutil.virtual_memory().available
    except Exception:
        return 2 * (1024 ** 3)  # Assume 2GB as safe default

def calculate_buffer_sizes(available_memory):
    """Calculate optimal buffer sizes based on available memory."""
    # Use up to 15% of available memory for compression
    max_buffer_memory = available_memory * 0.15
    
    # 45% for read buffer, 55% for write buffer
    read_size = int(max_buffer_memory * 0.45)
    write_size = int(max_buffer_memory * 0.55)
    return read_size, write_size

At module initialization, the system:

Detects available system memory
Allocates up to 15% of available memory for compression buffers
Splits this allocation between read (45%) and write (55%) buffers
Uses these dynamic buffer sizes for all compression operations

Stream Processing

Rather than loading entire files into memory:

Files are processed in chunks using stream-based compression
Files are deleted immediately after being added to the TAR archive
The TAR file is deleted immediately after successful compression
The compressed archive is uploaded as soon as compression is complete

These techniques minimize memory usage and allow the system to handle files of any size without running out of memory.

Optimistic Concurrency Control

When updating compression settings in DynamoDB, the system uses optimistic concurrency control with automatic retries:

Read the current settings with a consistent read
Extract the current version number
Calculate new optimal level and update history
Attempt to update DynamoDB with a condition on the version number
If there's a version conflict (another process updated simultaneously):
- Back off with exponential delay (0.1s * retry count)
- Retry up to 5 times
- Log warning if all retries fail

This approach handles concurrent updates gracefully in distributed environments where multiple instances might process different objects with the same bucket/prefix pattern simultaneously.

Compression Level Guidelines

ZSTD compression levels range from 1-22:

Level	Compression	Speed	Recommended Use Case
1-3	Low	Very Fast	Time-critical, already compressed data
4-9	Medium	Fast	General purpose, balanced approach
10-15	High	Medium	Default range, good balance (12 is default)
16-19	Very High	Slow	Archival, text-heavy content
20-22	Ultra High	Very Slow	Maximum compression regardless of time

Statistical Approach for Optimal Level Calculation

When calculating the optimal compression level, the system employs statistical methods to ensure reliable results:

Grouping by Level: Metrics are grouped by compression level
Minimum Sample Size: Requires at least 10 trials per level for statistical significance

Weighted Averaging: Calculates average benefit per object using file count weighting

# For each level:
avg_benefit = stats['sum_benefit'] / objects  # where objects is the total number of objects processed with this level

Best Level Selection: Chooses the level with the highest average benefit per object

This approach balances statistical reliability with adaptation to changing data patterns, ensuring the system makes informed decisions even with varying file sizes and types.

Example Adaptive Behavior

Below is an example of how the system might adapt to different data types over time:

Data Type	Initial Level	Final Level	Comp. Ratio	Time per MB	Benefit Score
JSON/Text	12	16	8.3:1	0.12s	0.087
Images (JPG)	12	3	1.1:1	0.02s	0.012
Mixed Content	12	9	3.4:1	0.05s	0.042
Video Files	12	1	1.02:1	0.01s	0.007

As shown above, the system automatically selects:

Higher compression levels for highly compressible data (text/JSON)
Lower, faster levels for already compressed data (images/video)
Balanced levels for mixed content
Files are weighted by size, so larger files influence decisions more than smaller ones

Real-world Compression Ratio Examples

The following table shows real-world compression ratios achieved by the adaptive compression system across various file types and sizes. These measurements represent average compression ratios observed in a testing environment and demonstrate how the system performs with different data characteristics.

Prefix	Average Compression Ratio
100mb-snappy-parquet	1.2
10mb-snappy-parquet	1.2
100mb-mixed*	2.22
10mb-csv	2.71
100mb-csv	2.72
10mb-mixed*	2.93
10mb-ocr	3.91
100mb-ocr	3.97
1000mb-xml	3.98
100mb-xml	4.06
10mb-xml	4.22

*Mixed is a combination of all the other file formats

These metrics reveal several key insights:

File Format Impact: Already compressed formats like Snappy Parquet show minimal additional compression (1.2:1), while structured text formats like XML achieve much higher ratios (up to 4.22:1).
Size Consistency: Similar file types maintain relatively consistent compression ratios regardless of size (e.g., 10mb-csv vs 100mb-csv).
Mixed Content Behavior: Mixed content achieves moderate compression (2.22:1 to 2.93:1), reflecting the adaptive system's ability to balance different data types within the same files.
OCR Content: OCR-generated content (typically text with formatting) shows excellent compression potential (3.91:1 to 3.97:1).
XML Efficiency: XML files achieve the highest compression ratios (3.98:1 to 4.22:1) due to their verbose, tag-heavy structure.

These results were obtained using the dynamic compression optimizer, which automatically selected appropriate compression levels based on historical performance data for each file type. Note that your actual compression ratios may vary depending on the specific content characteristics and data patterns in your environment.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Adaptive Compression System

Overview

System Architecture

Components

1. Compression Manager (`compression_manager.py`)

2. Compression Optimizer (`compression_optimizer.py`)

3. Cost-Benefit Calculator (`cost_benefit_calculator.py`)

4. Compression Settings Repository (`compression_settings_repository.py`)

How Compression Level Selection Works

1. Initial Compression Level Selection

2. CPU-Based Level Adjustment

3. Multi-Tier Adaptive Exploration with Version-Based Decay

3. Metrics Collection & Analysis

4. Optimal Level Recalculation

Compression Settings Storage

Cost-Benefit Calculation

Transfer Cost Calculation

Compute Cost Calculation

CPU Performance Normalization

Net Benefit

Benefit Score

Memory Optimization Techniques

Dynamic Buffer Sizing

Stream Processing

Optimistic Concurrency Control

Compression Level Guidelines

Statistical Approach for Optimal Level Calculation

Example Adaptive Behavior

Real-world Compression Ratio Examples

FilesExpand file tree

ADAPTIVE_COMPRESSION.md

Latest commit

History

ADAPTIVE_COMPRESSION.md

File metadata and controls

Adaptive Compression System

Overview

System Architecture

Components

1. Compression Manager (compression_manager.py)

2. Compression Optimizer (compression_optimizer.py)

3. Cost-Benefit Calculator (cost_benefit_calculator.py)

4. Compression Settings Repository (compression_settings_repository.py)

How Compression Level Selection Works

1. Initial Compression Level Selection

2. CPU-Based Level Adjustment

3. Multi-Tier Adaptive Exploration with Version-Based Decay

3. Metrics Collection & Analysis

4. Optimal Level Recalculation

Compression Settings Storage

Cost-Benefit Calculation

Transfer Cost Calculation

Compute Cost Calculation

CPU Performance Normalization

Net Benefit

Benefit Score

Memory Optimization Techniques

Dynamic Buffer Sizing

Stream Processing

Optimistic Concurrency Control

Compression Level Guidelines

Statistical Approach for Optimal Level Calculation

Example Adaptive Behavior

Real-world Compression Ratio Examples

1. Compression Manager (`compression_manager.py`)

2. Compression Optimizer (`compression_optimizer.py`)

3. Cost-Benefit Calculator (`cost_benefit_calculator.py`)

4. Compression Settings Repository (`compression_settings_repository.py`)