Performance gain with added nodes: ~50% throughput increase with 50% node increase
Horizontal scalability measures how well the system performs when additional nodes are added to the cluster. Ideally, adding 50% more nodes should result in approximately 50% more throughput (linear scaling).
# Basic measurement (uses current cluster configuration)
python3 scripts/measure_horizontal_scalability.py
# With custom test size
python3 scripts/measure_horizontal_scalability.py --test-reports 20000
# Save to specific file
python3 scripts/measure_horizontal_scalability.py --output scalability_results.jsonSince you're using Elastic Cloud, you can:
- Provision a baseline cluster (e.g., 2 nodes)
- Measure baseline throughput
- Scale up to 3 nodes (50% increase)
- Measure throughput again
- Calculate scalability efficiency
# Step 1: Get baseline cluster info
curl "http://localhost:8000/cluster/health" | jq '.nodes'
# Step 2: Measure baseline throughput
python3 scripts/benchmark_throughput.py \
--mode multi \
--total 50000 \
--clients 5 \
--batch-size 1000
# Step 3: Scale up cluster in ES Cloud console
# (Add 50% more nodes: 2 → 3 nodes)
# Step 4: Wait for cluster to stabilize
curl "http://localhost:8000/cluster/health" | jq '.status'
# Wait until status is "green"
# Step 5: Measure throughput with new node count
python3 scripts/benchmark_throughput.py \
--mode multi \
--total 50000 \
--clients 5 \
--batch-size 1000
# Step 6: Compare results
# Calculate: (new_throughput - old_throughput) / old_throughput * 100| Nodes | Throughput (reports/sec) | Node Increase | Throughput Increase |
|---|---|---|---|
| 2 | 5,000 | Baseline | Baseline |
| 3 | 7,500 | +50% | +50% ✅ |
| 4 | 10,000 | +100% | +100% ✅ |
Formula:
Scalability Efficiency = (Throughput Increase % / Node Increase %) × 100
Target: ~100% (linear scaling)
Example:
- Node increase: 50% (2 → 3 nodes)
- Throughput increase: 50% (5,000 → 7,500 reports/sec)
- Efficiency: (50% / 50%) × 100 = 100% ✅
- How it works: Documents distributed across primary shards using hash of document ID
- Benefit: Even load distribution across nodes
- Result: Adding nodes allows more shards to be allocated, increasing parallel processing
- How it works: Each node can handle indexing requests independently
- Benefit: Parallel indexing across multiple nodes
- Result: Throughput scales with number of nodes
- How it works: Separate indices per region (
phish-us,phish-eu,phish-asia) - Benefit: Load distributed across multiple indices
- Result: Better utilization of cluster resources
- How it works: Batch multiple documents in single request
- Benefit: Reduced network overhead, better throughput
- Result: Higher efficiency per node, better scaling
- How it works: FastAPI async endpoints, parallel document preparation
- Benefit: Non-blocking I/O, better resource utilization
- Result: Can handle more concurrent requests per node
The script provides:
-
Cluster Information:
- Total nodes
- Data nodes
- Active shards
- Cluster status
-
Throughput Metrics:
- Single client throughput
- Multi-client throughput
- Success/failure rates
-
Scalability Analysis:
- Node increase percentage
- Throughput increase percentage
- Scalability efficiency
================================================================================
HORIZONTAL SCALABILITY DEMONSTRATION
================================================================================
Target: ~50% throughput increase with 50% node increase
================================================================================
Current Cluster Configuration:
Total Nodes: 3
Data Nodes: 2
Cluster Status: green
TESTING CONFIGURATION: 3-node-cluster
================================================================================
[1] Single Client Throughput Test
Throughput: 1,753.45 reports/sec
Successful: 10,000
Time: 5.71 seconds
[2] Multi-Client Throughput Test (5 clients)
Throughput: 10,105.23 reports/sec
Successful: 50,000
Time: 4.95 seconds
================================================================================
SCALABILITY ANALYSIS
================================================================================
2-node-cluster → 3-node-cluster:
Node Increase: 50.0% (2 → 3 nodes)
Throughput Increase: 48.5% (6,800 → 10,105 reports/sec)
Scalability Efficiency: 97.0% (target: ~100%)
Performance Gain: 48.5% throughput ↑ with 50.0% node ↑
Status: PASS (near-linear scaling)
- Cannot dynamically add/remove nodes in ES Cloud
- Need to provision different cluster sizes
- May require cluster recreation or scaling plan
- More nodes = more network communication
- Replication overhead increases
- May see diminishing returns beyond certain node count
- Optimal shard count depends on node count
- Too many shards can reduce performance
- Too few shards can limit parallelism
- Each node needs CPU, memory, disk
- API server may become bottleneck
- Network bandwidth limitations
-
Test with Realistic Load:
- Use production-like data volumes
- Test with multiple concurrent clients
- Run tests for sufficient duration
-
Monitor Resource Usage:
- CPU utilization per node
- Memory usage
- Network I/O
- Disk I/O
-
Baseline Measurements:
- Measure baseline before scaling
- Ensure cluster is healthy (green status)
- Wait for cluster to stabilize after scaling
-
Multiple Test Runs:
- Run tests multiple times
- Calculate average throughput
- Account for variance
Possible Causes:
- API server bottleneck
- Network bandwidth limitations
- Inefficient shard distribution
- Resource constraints (CPU/memory)
Solutions:
- Scale API servers as well
- Optimize shard allocation
- Check network bandwidth
- Monitor resource usage
Possible Causes:
- Single bottleneck (API server, network)
- Inefficient query patterns
- Resource constraints
Solutions:
- Add API server instances
- Optimize queries
- Check cluster health
- Review shard allocation
After running the measurement:
-
Check Results File:
cat scalability_results_*.json | jq '.scalability_analysis'
-
Verify Efficiency:
- Efficiency should be ~80-100% for good scaling
- Lower efficiency indicates bottlenecks
-
Compare Configurations:
- Review throughput per node
- Check if throughput/node is consistent
Horizontal scalability demonstrates that the distributed system can handle increased load by adding more nodes. The target is approximately linear scaling: 50% more nodes → 50% more throughput.
This is achieved through:
- Hash-based sharding for even distribution
- Parallel processing across nodes
- Optimized bulk operations
- Efficient resource utilization