Skip to content

Latest commit

 

History

History
277 lines (206 loc) · 7.87 KB

File metadata and controls

277 lines (206 loc) · 7.87 KB

Horizontal Scalability Measurement Guide

Target Metric

Performance gain with added nodes: ~50% throughput increase with 50% node increase

Definition

Horizontal scalability measures how well the system performs when additional nodes are added to the cluster. Ideally, adding 50% more nodes should result in approximately 50% more throughput (linear scaling).

How to Measure

Method 1: Using the Measurement Script

# Basic measurement (uses current cluster configuration)
python3 scripts/measure_horizontal_scalability.py

# With custom test size
python3 scripts/measure_horizontal_scalability.py --test-reports 20000

# Save to specific file
python3 scripts/measure_horizontal_scalability.py --output scalability_results.json

Method 2: Manual Testing with Different Node Configurations

Since you're using Elastic Cloud, you can:

  1. Provision a baseline cluster (e.g., 2 nodes)
  2. Measure baseline throughput
  3. Scale up to 3 nodes (50% increase)
  4. Measure throughput again
  5. Calculate scalability efficiency

Step-by-Step Process:

# Step 1: Get baseline cluster info
curl "http://localhost:8000/cluster/health" | jq '.nodes'

# Step 2: Measure baseline throughput
python3 scripts/benchmark_throughput.py \
  --mode multi \
  --total 50000 \
  --clients 5 \
  --batch-size 1000

# Step 3: Scale up cluster in ES Cloud console
# (Add 50% more nodes: 2 → 3 nodes)

# Step 4: Wait for cluster to stabilize
curl "http://localhost:8000/cluster/health" | jq '.status'
# Wait until status is "green"

# Step 5: Measure throughput with new node count
python3 scripts/benchmark_throughput.py \
  --mode multi \
  --total 50000 \
  --clients 5 \
  --batch-size 1000

# Step 6: Compare results
# Calculate: (new_throughput - old_throughput) / old_throughput * 100

Expected Results

Ideal Scenario (Linear Scaling)

Nodes Throughput (reports/sec) Node Increase Throughput Increase
2 5,000 Baseline Baseline
3 7,500 +50% +50% ✅
4 10,000 +100% +100% ✅

Scalability Efficiency

Formula:

Scalability Efficiency = (Throughput Increase % / Node Increase %) × 100

Target: ~100% (linear scaling)

Example:

  • Node increase: 50% (2 → 3 nodes)
  • Throughput increase: 50% (5,000 → 7,500 reports/sec)
  • Efficiency: (50% / 50%) × 100 = 100%

Design Choices Supporting Horizontal Scalability

1. Hash-Based Sharding

  • How it works: Documents distributed across primary shards using hash of document ID
  • Benefit: Even load distribution across nodes
  • Result: Adding nodes allows more shards to be allocated, increasing parallel processing

2. Distributed Indexing

  • How it works: Each node can handle indexing requests independently
  • Benefit: Parallel indexing across multiple nodes
  • Result: Throughput scales with number of nodes

3. Regional Index Distribution

  • How it works: Separate indices per region (phish-us, phish-eu, phish-asia)
  • Benefit: Load distributed across multiple indices
  • Result: Better utilization of cluster resources

4. Bulk API Optimization

  • How it works: Batch multiple documents in single request
  • Benefit: Reduced network overhead, better throughput
  • Result: Higher efficiency per node, better scaling

5. Asynchronous Processing

  • How it works: FastAPI async endpoints, parallel document preparation
  • Benefit: Non-blocking I/O, better resource utilization
  • Result: Can handle more concurrent requests per node

Measurement Script Output

The script provides:

  1. Cluster Information:

    • Total nodes
    • Data nodes
    • Active shards
    • Cluster status
  2. Throughput Metrics:

    • Single client throughput
    • Multi-client throughput
    • Success/failure rates
  3. Scalability Analysis:

    • Node increase percentage
    • Throughput increase percentage
    • Scalability efficiency

Example Output

================================================================================
HORIZONTAL SCALABILITY DEMONSTRATION
================================================================================
Target: ~50% throughput increase with 50% node increase
================================================================================

Current Cluster Configuration:
  Total Nodes: 3
  Data Nodes: 2
  Cluster Status: green

TESTING CONFIGURATION: 3-node-cluster
================================================================================

[1] Single Client Throughput Test
  Throughput: 1,753.45 reports/sec
  Successful: 10,000
  Time: 5.71 seconds

[2] Multi-Client Throughput Test (5 clients)
  Throughput: 10,105.23 reports/sec
  Successful: 50,000
  Time: 4.95 seconds

================================================================================
SCALABILITY ANALYSIS
================================================================================

2-node-cluster → 3-node-cluster:
  Node Increase: 50.0% (2 → 3 nodes)
  Throughput Increase: 48.5% (6,800 → 10,105 reports/sec)
  Scalability Efficiency: 97.0% (target: ~100%)
  Performance Gain: 48.5% throughput ↑ with 50.0% node ↑
  Status: PASS (near-linear scaling)

Limitations and Considerations

1. ES Cloud Node Provisioning

  • Cannot dynamically add/remove nodes in ES Cloud
  • Need to provision different cluster sizes
  • May require cluster recreation or scaling plan

2. Network Overhead

  • More nodes = more network communication
  • Replication overhead increases
  • May see diminishing returns beyond certain node count

3. Shard Distribution

  • Optimal shard count depends on node count
  • Too many shards can reduce performance
  • Too few shards can limit parallelism

4. Resource Constraints

  • Each node needs CPU, memory, disk
  • API server may become bottleneck
  • Network bandwidth limitations

Best Practices

  1. Test with Realistic Load:

    • Use production-like data volumes
    • Test with multiple concurrent clients
    • Run tests for sufficient duration
  2. Monitor Resource Usage:

    • CPU utilization per node
    • Memory usage
    • Network I/O
    • Disk I/O
  3. Baseline Measurements:

    • Measure baseline before scaling
    • Ensure cluster is healthy (green status)
    • Wait for cluster to stabilize after scaling
  4. Multiple Test Runs:

    • Run tests multiple times
    • Calculate average throughput
    • Account for variance

Troubleshooting

Low Scalability Efficiency (< 80%)

Possible Causes:

  • API server bottleneck
  • Network bandwidth limitations
  • Inefficient shard distribution
  • Resource constraints (CPU/memory)

Solutions:

  • Scale API servers as well
  • Optimize shard allocation
  • Check network bandwidth
  • Monitor resource usage

Throughput Not Increasing

Possible Causes:

  • Single bottleneck (API server, network)
  • Inefficient query patterns
  • Resource constraints

Solutions:

  • Add API server instances
  • Optimize queries
  • Check cluster health
  • Review shard allocation

Verification

After running the measurement:

  1. Check Results File:

    cat scalability_results_*.json | jq '.scalability_analysis'
  2. Verify Efficiency:

    • Efficiency should be ~80-100% for good scaling
    • Lower efficiency indicates bottlenecks
  3. Compare Configurations:

    • Review throughput per node
    • Check if throughput/node is consistent

Summary

Horizontal scalability demonstrates that the distributed system can handle increased load by adding more nodes. The target is approximately linear scaling: 50% more nodes → 50% more throughput.

This is achieved through:

  • Hash-based sharding for even distribution
  • Parallel processing across nodes
  • Optimized bulk operations
  • Efficient resource utilization