Indexing Throughput Optimization Guide

Target Metrics

Single Client: > 1,000 reports/sec
Multi-Client: > 5,000 reports/sec

Optimizations Implemented

1. Elasticsearch Bulk API

Replaced individual index() calls with bulk() API
Processes documents in chunks of 1000
Reduces network overhead and improves throughput by 10-50x

2. Parallel Document Preparation

Uses ThreadPoolExecutor to prepare documents in parallel
CPU-bound operations (risk score calculation) run concurrently
Reduces preparation time significantly

3. Async Operations

All Elasticsearch operations run in thread pool
Non-blocking I/O for maximum concurrency
FastAPI async endpoints handle multiple requests efficiently

4. Skip Duplicate Checks

Added skip_duplicate_check parameter for maximum throughput
When True, skips URL existence checks (saves ~50-100ms per document)
Use for bulk data ingestion where duplicates are acceptable

5. Batch Size Limits

Increased from 1,000 to 5,000 reports per request
Allows larger batches for better throughput
Elasticsearch handles large batches efficiently

6. Optimized Refresh Settings

Uses refresh=False during bulk operations
Only refreshes at the end (or not at all)
Reduces index overhead during high-volume ingestion

Usage

High-Throughput Bulk Endpoint

# Single request with 5000 reports (skip duplicate checks)
curl -X POST "http://localhost:8000/report/bulk" \
  -H "Content-Type: application/json" \
  -d '{
    "reports": [...5000 reports...],
    "skip_duplicate_check": true
  }'

Benchmark Script

# Test single client throughput
python3 scripts/benchmark_throughput.py \
  --mode single \
  --total 10000 \
  --batch-size 1000

# Test multi-client throughput
python3 scripts/benchmark_throughput.py \
  --mode multi \
  --total 50000 \
  --clients 5 \
  --batch-size 1000

# Test both
python3 scripts/benchmark_throughput.py \
  --mode both \
  --total 10000

Performance Tips

1. For Maximum Throughput

Set skip_duplicate_check: true in bulk requests
Use batch sizes of 1000-5000 reports
Send from multiple clients in parallel
Ensure Elasticsearch cluster has adequate resources

2. Elasticsearch Configuration

{
  "settings": {
    "refresh_interval": "30s",  // Reduce refresh frequency
    "number_of_shards": 3,      // Distribute load
    "number_of_replicas": 1      // Balance performance vs redundancy
  }
}

3. API Configuration

Use multiple API instances behind a load balancer
Increase FastAPI worker count: uvicorn main_cloud_ready:app --workers 4
Ensure adequate CPU and memory resources

4. Network Optimization

Use HTTP/2 for better connection multiplexing
Keep connections alive (connection pooling)
Minimize request/response payload size

Expected Performance

Single Client

Baseline (old implementation): ~50-100 reports/sec
Optimized (new implementation): >1,000 reports/sec ✅
Improvement: 10-20x faster

Multi-Client (5 clients)

Baseline: ~200-500 reports/sec
Optimized: >5,000 reports/sec ✅
Improvement: 10-25x faster

Monitoring

The bulk endpoint returns throughput metrics:

{
  "status": "completed",
  "total": 5000,
  "successful": 5000,
  "failed": 0,
  "throughput_reports_per_sec": 1250.5,
  "elapsed_seconds": 3.998
}

Troubleshooting

Low Throughput Issues

Check Elasticsearch cluster health

curl http://localhost:9200/_cluster/health

Monitor API server resources
- CPU usage
- Memory usage
- Network I/O

Check Elasticsearch indexing rate

curl http://localhost:9200/_cat/indices/phish-*?v&s=indexing.total.index_time_in_millis:desc

Verify batch sizes
- Too small: More overhead
- Too large: Memory issues
- Optimal: 1000-5000 per batch

Common Bottlenecks

Network latency: Use local Elasticsearch or low-latency connection
CPU-bound operations: Risk score calculation (use rule-based for speed)
Elasticsearch refresh: Too frequent refreshes slow down indexing
Memory: Insufficient heap size in Elasticsearch

Design Choices Supporting Throughput

1. 3-Node Cluster

Write load distributed across multiple nodes
Parallel indexing on different shards
Horizontal scalability

2. Hash-Based Sharding

Documents distributed evenly across primary shards
Prevents bottlenecks on single shard
Enables parallel processing

3. Asynchronous FastAPI

Non-blocking I/O operations
Handles concurrent requests efficiently
Thread pool for CPU-bound tasks

4. Bulk API

Single network round-trip for multiple documents
Reduced overhead per document
Elasticsearch optimizes bulk operations

Testing Results

Run the benchmark script to verify performance:

python3 scripts/benchmark_throughput.py --mode both --total 20000

Expected output:

Single client: >1000 reports/sec ✅
Multi-client (5): >5000 reports/sec ✅

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Indexing Throughput Optimization Guide

Target Metrics

Optimizations Implemented

1. Elasticsearch Bulk API

2. Parallel Document Preparation

3. Async Operations

4. Skip Duplicate Checks

5. Batch Size Limits

6. Optimized Refresh Settings

Usage

High-Throughput Bulk Endpoint

Benchmark Script

Performance Tips

1. For Maximum Throughput

2. Elasticsearch Configuration

3. API Configuration

4. Network Optimization

Expected Performance

Single Client

Multi-Client (5 clients)

Monitoring

Troubleshooting

Low Throughput Issues

Common Bottlenecks

Design Choices Supporting Throughput

1. 3-Node Cluster

2. Hash-Based Sharding

3. Asynchronous FastAPI

4. Bulk API

Testing Results

FilesExpand file tree

THROUGHPUT_OPTIMIZATION.md

Latest commit

History

THROUGHPUT_OPTIMIZATION.md

File metadata and controls

Indexing Throughput Optimization Guide

Target Metrics

Optimizations Implemented

1. Elasticsearch Bulk API

2. Parallel Document Preparation

3. Async Operations

4. Skip Duplicate Checks

5. Batch Size Limits

6. Optimized Refresh Settings

Usage

High-Throughput Bulk Endpoint

Benchmark Script

Performance Tips

1. For Maximum Throughput

2. Elasticsearch Configuration

3. API Configuration

4. Network Optimization

Expected Performance

Single Client

Multi-Client (5 clients)

Monitoring

Troubleshooting

Low Throughput Issues

Common Bottlenecks

Design Choices Supporting Throughput

1. 3-Node Cluster

2. Hash-Based Sharding

3. Asynchronous FastAPI

4. Bulk API

Testing Results