Horizontal Scalability Measurement Guide

Target Metric

Performance gain with added nodes: ~50% throughput increase with 50% node increase

Definition

Horizontal scalability measures how well the system performs when additional nodes are added to the cluster. Ideally, adding 50% more nodes should result in approximately 50% more throughput (linear scaling).

How to Measure

Method 1: Using the Measurement Script

# Basic measurement (uses current cluster configuration)
python3 scripts/measure_horizontal_scalability.py

# With custom test size
python3 scripts/measure_horizontal_scalability.py --test-reports 20000

# Save to specific file
python3 scripts/measure_horizontal_scalability.py --output scalability_results.json

Method 2: Manual Testing with Different Node Configurations

Since you're using Elastic Cloud, you can:

Provision a baseline cluster (e.g., 2 nodes)
Measure baseline throughput
Scale up to 3 nodes (50% increase)
Measure throughput again
Calculate scalability efficiency

Step-by-Step Process:

# Step 1: Get baseline cluster info
curl "http://localhost:8000/cluster/health" | jq '.nodes'

# Step 2: Measure baseline throughput
python3 scripts/benchmark_throughput.py \
  --mode multi \
  --total 50000 \
  --clients 5 \
  --batch-size 1000

# Step 3: Scale up cluster in ES Cloud console
# (Add 50% more nodes: 2 → 3 nodes)

# Step 4: Wait for cluster to stabilize
curl "http://localhost:8000/cluster/health" | jq '.status'
# Wait until status is "green"

# Step 5: Measure throughput with new node count
python3 scripts/benchmark_throughput.py \
  --mode multi \
  --total 50000 \
  --clients 5 \
  --batch-size 1000

# Step 6: Compare results
# Calculate: (new_throughput - old_throughput) / old_throughput * 100

Expected Results

Ideal Scenario (Linear Scaling)

Nodes	Throughput (reports/sec)	Node Increase	Throughput Increase
2	5,000	Baseline	Baseline
3	7,500	+50%	+50% ✅
4	10,000	+100%	+100% ✅

Scalability Efficiency

Formula:

Scalability Efficiency = (Throughput Increase % / Node Increase %) × 100

Target: ~100% (linear scaling)

Example:

Node increase: 50% (2 → 3 nodes)
Throughput increase: 50% (5,000 → 7,500 reports/sec)
Efficiency: (50% / 50%) × 100 = 100% ✅

Design Choices Supporting Horizontal Scalability

1. Hash-Based Sharding

How it works: Documents distributed across primary shards using hash of document ID
Benefit: Even load distribution across nodes
Result: Adding nodes allows more shards to be allocated, increasing parallel processing

2. Distributed Indexing

How it works: Each node can handle indexing requests independently
Benefit: Parallel indexing across multiple nodes
Result: Throughput scales with number of nodes

3. Regional Index Distribution

How it works: Separate indices per region (phish-us, phish-eu, phish-asia)
Benefit: Load distributed across multiple indices
Result: Better utilization of cluster resources

4. Bulk API Optimization

How it works: Batch multiple documents in single request
Benefit: Reduced network overhead, better throughput
Result: Higher efficiency per node, better scaling

5. Asynchronous Processing

How it works: FastAPI async endpoints, parallel document preparation
Benefit: Non-blocking I/O, better resource utilization
Result: Can handle more concurrent requests per node

Measurement Script Output

The script provides:

Cluster Information:
- Total nodes
- Data nodes
- Active shards
- Cluster status
Throughput Metrics:
- Single client throughput
- Multi-client throughput
- Success/failure rates
Scalability Analysis:
- Node increase percentage
- Throughput increase percentage
- Scalability efficiency

Example Output

================================================================================
HORIZONTAL SCALABILITY DEMONSTRATION
================================================================================
Target: ~50% throughput increase with 50% node increase
================================================================================

Current Cluster Configuration:
  Total Nodes: 3
  Data Nodes: 2
  Cluster Status: green

TESTING CONFIGURATION: 3-node-cluster
================================================================================

[1] Single Client Throughput Test
  Throughput: 1,753.45 reports/sec
  Successful: 10,000
  Time: 5.71 seconds

[2] Multi-Client Throughput Test (5 clients)
  Throughput: 10,105.23 reports/sec
  Successful: 50,000
  Time: 4.95 seconds

================================================================================
SCALABILITY ANALYSIS
================================================================================

2-node-cluster → 3-node-cluster:
  Node Increase: 50.0% (2 → 3 nodes)
  Throughput Increase: 48.5% (6,800 → 10,105 reports/sec)
  Scalability Efficiency: 97.0% (target: ~100%)
  Performance Gain: 48.5% throughput ↑ with 50.0% node ↑
  Status: PASS (near-linear scaling)

Limitations and Considerations

1. ES Cloud Node Provisioning

Cannot dynamically add/remove nodes in ES Cloud
Need to provision different cluster sizes
May require cluster recreation or scaling plan

2. Network Overhead

More nodes = more network communication
Replication overhead increases
May see diminishing returns beyond certain node count

3. Shard Distribution

Optimal shard count depends on node count
Too many shards can reduce performance
Too few shards can limit parallelism

4. Resource Constraints

Each node needs CPU, memory, disk
API server may become bottleneck
Network bandwidth limitations

Best Practices

Test with Realistic Load:
- Use production-like data volumes
- Test with multiple concurrent clients
- Run tests for sufficient duration
Monitor Resource Usage:
- CPU utilization per node
- Memory usage
- Network I/O
- Disk I/O
Baseline Measurements:
- Measure baseline before scaling
- Ensure cluster is healthy (green status)
- Wait for cluster to stabilize after scaling
Multiple Test Runs:
- Run tests multiple times
- Calculate average throughput
- Account for variance

Troubleshooting

Low Scalability Efficiency (< 80%)

Possible Causes:

API server bottleneck
Network bandwidth limitations
Inefficient shard distribution
Resource constraints (CPU/memory)

Solutions:

Scale API servers as well
Optimize shard allocation
Check network bandwidth
Monitor resource usage

Throughput Not Increasing

Possible Causes:

Single bottleneck (API server, network)
Inefficient query patterns
Resource constraints

Solutions:

Add API server instances
Optimize queries
Check cluster health
Review shard allocation

Verification

After running the measurement:

Check Results File:

cat scalability_results_*.json | jq '.scalability_analysis'

Verify Efficiency:
- Efficiency should be ~80-100% for good scaling
- Lower efficiency indicates bottlenecks
Compare Configurations:
- Review throughput per node
- Check if throughput/node is consistent

Summary

Horizontal scalability demonstrates that the distributed system can handle increased load by adding more nodes. The target is approximately linear scaling: 50% more nodes → 50% more throughput.

This is achieved through:

Hash-based sharding for even distribution
Parallel processing across nodes
Optimized bulk operations
Efficient resource utilization

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Horizontal Scalability Measurement Guide

Target Metric

Definition

How to Measure

Method 1: Using the Measurement Script

Method 2: Manual Testing with Different Node Configurations

Step-by-Step Process:

Expected Results

Ideal Scenario (Linear Scaling)

Scalability Efficiency

Design Choices Supporting Horizontal Scalability

1. Hash-Based Sharding

2. Distributed Indexing

3. Regional Index Distribution

4. Bulk API Optimization

5. Asynchronous Processing

Measurement Script Output

Example Output

Limitations and Considerations

1. ES Cloud Node Provisioning

2. Network Overhead

3. Shard Distribution

4. Resource Constraints

Best Practices

Troubleshooting

Low Scalability Efficiency (< 80%)

Throughput Not Increasing

Verification

Summary

FilesExpand file tree

HORIZONTAL_SCALABILITY_GUIDE.md

Latest commit

History

HORIZONTAL_SCALABILITY_GUIDE.md

File metadata and controls

Horizontal Scalability Measurement Guide

Target Metric

Definition

How to Measure

Method 1: Using the Measurement Script

Method 2: Manual Testing with Different Node Configurations

Step-by-Step Process:

Expected Results

Ideal Scenario (Linear Scaling)

Scalability Efficiency

Design Choices Supporting Horizontal Scalability

1. Hash-Based Sharding

2. Distributed Indexing

3. Regional Index Distribution

4. Bulk API Optimization

5. Asynchronous Processing

Measurement Script Output

Example Output

Limitations and Considerations

1. ES Cloud Node Provisioning

2. Network Overhead

3. Shard Distribution

4. Resource Constraints

Best Practices

Troubleshooting

Low Scalability Efficiency (< 80%)

Throughput Not Increasing

Verification

Summary