This guide explains how to interpret the results from the InferMesh discrete-event simulator and understand the performance characteristics of different routing strategies.
When you run a simulation, the results are organized as follows:
results/
├── config.yaml # Copy of simulation configuration
├── comparison.csv # Summary comparison of all strategies
├── baseline_rr.json # Detailed metrics for baseline round-robin
├── heuristic.json # Detailed metrics for heuristic strategy
├── mesh.json # Detailed metrics for mesh strategy
├── mesh_hedge.json # Detailed metrics for mesh with hedging
├── adaptive_mesh.json # Detailed metrics for adaptive mesh
├── predictive_mesh.json # Detailed metrics for predictive mesh
├── hybrid_mesh.json # Detailed metrics for hybrid mesh
└── ml_enhanced_mesh.json # Detailed metrics for ML-enhanced mesh
- p50_latency: 50th percentile (median) latency in milliseconds
- p95_latency: 95th percentile latency - most requests complete within this time
- p99_latency: 99th percentile latency - captures tail latency behavior
- avg_latency: Average latency across all requests
Interpretation:
- Lower values are better
- P95/P99 are more important than average for user experience
- Large gaps between P95 and P99 indicate inconsistent performance
- utilization: Standard deviation of utilization across nodes (lower = better load balancing)
- avg_utilization: Average GPU utilization across all nodes
- max_utilization: Peak utilization observed on any node
Interpretation:
- Lower utilization variance indicates better load balancing
- Higher average utilization means better resource efficiency
- Max utilization shows if any nodes are overloaded
- cost_per_1k_tokens: Cost per 1,000 tokens processed (in dollars)
- total_cost: Total simulation cost
- total_gpu_hours: Total GPU compute hours consumed
Interpretation:
- Lower cost per 1k tokens is better for efficiency
- Reflects both latency and resource utilization
- Important for production cost planning
- hedge_win_rate: Percentage of hedge requests that completed first
- total_hedges: Total number of hedge requests sent
- hedge_wins: Number of successful hedge completions
- hedge_timeouts: Hedge requests that timed out
- hedge_cancellations: Hedge requests that were cancelled
Interpretation:
- Higher win rate indicates effective hedging
- Balance between latency improvement and resource waste
- Only applies to
mesh_hedgestrategy
- total_requests: Total number of requests processed
- requests_per_second: Average request processing rate
- tokens_per_second: Average token processing rate
| Strategy | P95 Latency | P99 Latency | Cost/1K Tokens | Utilization Dev | Performance Grade |
|---|---|---|---|---|---|
| hybrid_mesh | 183ms | 218ms | $0.00032 | 0.0044 | A+ |
| predictive_mesh | 287ms | 315ms | $0.00066 | 0.0004 | A |
| baseline_rr | 384ms | 639ms | $0.00055 | 0.0079 | B+ |
| heuristic | 441ms | 2877ms | $0.00113 | 0.0259 | B |
| adaptive_mesh | 491ms | 1373ms | $0.00106 | 0.0222 | B |
| mesh_hedge | 551ms | 2563ms | $0.00092 | 0.0256 | B- |
| mesh | 663ms | 2365ms | $0.00093 | 0.0251 | C+ |
| ml_enhanced_mesh | 1894ms | 3605ms | $0.00405 | 0.0849 | D |
- Strengths: Lowest latency, lowest cost, excellent consistency
- Use Case: Production deployments requiring optimal performance
- Trade-offs: Moderate complexity, good balance of all factors
- Strengths: Very low latency, excellent consistency, low utilization variance
- Use Case: Workloads with predictable patterns, cost-sensitive deployments
- Trade-offs: Slightly higher cost than hybrid_mesh
- Strengths: Simple, predictable, good cost efficiency
- Use Case: Simple deployments, when complexity is a concern
- Trade-offs: Higher latency than advanced strategies
- Strengths: Advanced ML features, continuous learning
- Use Case: Research, long-running deployments where learning pays off
- Trade-offs: Significant computational overhead impacts latency
Each strategy's JSON file contains detailed metrics:
{
"latency": {
"p50": 150.5,
"p95": 183.2,
"p99": 218.7,
"avg": 145.8
},
"utilization": {
"avg": 0.004360958440161376,
"std_dev": 0.004360958440161376,
"max": 0.15
},
"cost": {
"total_cost": 0.12345,
"cost_per_1k_tokens": 0.0003208122371509796,
"total_gpu_hours": 42.67
},
"throughput": {
"total_requests": 15000,
"requests_per_second": 50.0,
"tokens_per_second": 125000.0
},
"hedge_metrics": {
"total_hedges": 1500,
"hedge_wins": 267,
"hedge_win_rate": 0.17767634140039618,
"hedge_timeouts": 45,
"hedge_cancellations": 1188
}
}# Look at the comparison CSV for overview
cat results/comparison.csv | column -t -s,# Examine top performers in detail
cat results/hybrid_mesh.json | jq '.latency'
cat results/predictive_mesh.json | jq '.cost'# Compare across different scales
cargo run -p mesh-sim -- run --config small.yaml --output small_results/
cargo run -p mesh-sim -- run --config medium.yaml --output medium_results/
cargo run -p mesh-sim -- run --config large.yaml --output large_results/
# Compare results
diff small_results/comparison.csv medium_results/comparison.csvFor Production Use:
- Start with
hybrid_mesh- best overall performance - Test
predictive_meshif you have predictable workloads - Use
baseline_rrfor simple, reliable deployments
For Research/Development:
- Compare all strategies to understand trade-offs
- Focus on
ml_enhanced_meshfor long-term learning scenarios - Experiment with different scales and workload patterns
For Cost Optimization:
- Compare
cost_per_1k_tokensacross strategies - Consider utilization efficiency vs. latency trade-offs
- Factor in operational complexity costs
- Queue buildup: Check utilization variance
- Poor load balancing: Look for high std_dev in utilization
- Computational overhead: ML strategies may have high decision costs
- Network penalties: Inter-cell routing overhead
- Lower utilization variance → Better resource efficiency
- Faster routing decisions → Lower computational overhead
- Fewer hedge requests → Less wasted work
- Better load prediction → More efficient resource allocation
- Small scale (512 nodes): Simple strategies often competitive
- Medium scale (8K nodes): Advanced strategies show benefits
- Large scale (131K nodes): Network-aware routing becomes critical
- Check configuration: Verify workload parameters match expectations
- Examine utilization: High variance indicates load balancing issues
- Review hedge metrics: Excessive hedging may indicate routing problems
- Compare scales: Some strategies perform differently at different scales
- High P99 latency: Look for queue buildup or poor load balancing
- High cost: Check for inefficient resource utilization
- Low throughput: Examine bottlenecks in routing or processing
- Inconsistent results: Verify RNG seed for reproducibility
# Extract specific metrics across all strategies
for file in results/*.json; do
strategy=$(basename "$file" .json)
p95=$(jq -r '.latency.p95' "$file")
cost=$(jq -r '.cost.cost_per_1k_tokens' "$file")
echo "$strategy,$p95,$cost"
doneimport json
import pandas as pd
# Load all strategy results
strategies = {}
for strategy in ['hybrid_mesh', 'predictive_mesh', 'baseline_rr']:
with open(f'results/{strategy}.json') as f:
strategies[strategy] = json.load(f)
# Create comparison DataFrame
df = pd.DataFrame({
strategy: {
'p95_latency': data['latency']['p95'],
'cost_per_1k': data['cost']['cost_per_1k_tokens'],
'utilization': data['utilization']['avg']
}
for strategy, data in strategies.items()
}).T
print(df)This analysis framework helps you understand routing strategy performance and make informed decisions for your InferMesh deployment.