Both evaluators now provide comprehensive performance metrics, making it easy to compare baseline performance against concurrent execution benefits.
| Feature | Sequential | Concurrent |
|---|---|---|
| Execution | One request at a time | Multiple parallel requests |
| Throughput | 1-3 prompts/sec (typical) | 5-20x sequential |
| Use Case | Baseline performance, small datasets | Load testing, large datasets |
| Output Suffix | _results |
_concurrent_results |
| Extra Metrics | - | Queue time, worker distribution |
- Establishing baseline API performance
- Working with small datasets (< 100 prompts)
- Testing scanner accuracy without load concerns
- Don't want to risk triggering auto-scaling
- Need simple, straightforward execution
- Load testing API capacity
- Processing large datasets efficiently (> 100 prompts)
- Simulating multi-user scenarios
- Understanding API behavior under concurrent load
- Need to complete evaluation quickly
Both evaluators report:
- Throughput (prompts/sec)
- Request Latency (avg, min, max, p50, p95, p99)
- Total Elapsed Time
- Prompt Size Statistics
Concurrent evaluator adds:
- Queue Time: Time waiting for available worker
- Worker Distribution: How work is spread across threads
- Configuration Details: Worker count, rate limits
Performance Metrics:
===================
Throughput:
-----------
Overall: 2.11 prompts/sec
Total Prompts: 5
Total Elapsed Time: 2.37 seconds
Request Latency (API call time):
--------------------------------
Average: 472.01 ms
Minimum: 324.34 ms
Maximum: 659.15 ms
p50 (median): 410.45 ms
p95: 648.98 ms
p99: 657.12 ms
Total Requests: 5/5 valid responses
Concurrent Performance Metrics:
==============================
Configuration:
--------------
Concurrent Workers: 10
Rate Limit: None requests/sec
Total Prompts: 100
Total Elapsed Time: 15.32 seconds
Throughput:
-----------
Overall: 6.53 prompts/sec
Request Latency (API call time):
--------------------------------
Average: 245.12 ms
Minimum: 89.34 ms
Maximum: 1523.45 ms
p50 (median): 221.67 ms
p95: 456.78 ms
p99: 1234.56 ms
Queue Time (waiting for worker):
--------------------------------
Average: 12.34 ms
Maximum: 45.67 ms
Total Time (queue + request):
-----------------------------
Average: 257.46 ms
Expected behavior:
- Concurrent throughput should be 5-20x higher than sequential
- If speedup is less than 3x, API may be bottlenecked
Example:
Sequential: 2.11 prompts/sec
Concurrent (10 workers): 15.5 prompts/sec
Speedup: 7.3x ✓ Good
Expected behavior:
- Average latency should be similar in both modes
- If concurrent latency is significantly higher, API is overloaded
Example:
Sequential avg: 472 ms
Concurrent avg: 485 ms
Difference: 2.7% ✓ Excellent (API not overloaded)
Sequential avg: 472 ms
Concurrent avg: 1250 ms
Difference: 165% ⚠️ Warning (API struggling under load)
What they tell you:
- p95/p99 reveal worst-case user experience
- Large gap between p50 and p99 indicates inconsistent performance
- Concurrent p99 much higher than sequential suggests queuing issues
Example:
Sequential:
p50: 410 ms
p95: 649 ms
p99: 657 ms
Gap (p99-p50): 247 ms ✓ Consistent
Concurrent:
p50: 221 ms
p95: 456 ms
p99: 1523 ms
Gap (p99-p50): 1302 ms ⚠️ High variance
Run sequential first to establish baseline:
python prompt_evaluator.py -i datasets/test.jsonlDon't jump straight to high concurrency:
# Start low
python prompt_evaluator_concurrent.py -i datasets/test.jsonl -c 5
# Increase gradually
python prompt_evaluator_concurrent.py -i datasets/test.jsonl -c 10
python prompt_evaluator_concurrent.py -i datasets/test.jsonl -c 20Look for these patterns:
- ✓ Good: Concurrent throughput 5-20x sequential, similar latency
⚠️ Caution: Throughput improves but latency increases significantly- ❌ Overload: High error rates, throughput doesn't improve
Keep a log of performance characteristics:
Dataset: combined_dataset.jsonl (1000 prompts)
Sequential:
- Throughput: 2.1 prompts/sec
- Avg latency: 465 ms
- p99: 890 ms
Concurrent (10 workers):
- Throughput: 14.3 prompts/sec (6.8x speedup)
- Avg latency: 478 ms (+2.8%)
- p99: 1205 ms (+35%)
- Conclusion: Sweet spot for this API
A: Some increase is normal due to:
- Queue time waiting for workers
- API processing multiple requests simultaneously
- Network contention
If increase is > 50%, the API may be overloaded.
A: Usually yes, but:
- Check with infrastructure team first (auto-scaling concerns)
- Start with low concurrency (5-10 workers)
- Use rate limiting if API has known limits
A: Normal variance due to:
- Network conditions
- API server load
- Other users on the system
Run multiple times and average results for stability.
A: Yes, but consider:
- API performance may have changed
- Scanner models may have been updated
- Network conditions vary
Best to re-run both sequential and concurrent on same day.
- Both evaluators now report consistent metrics for easy comparison
- Start with sequential to establish baseline
- Use concurrent for large datasets and load testing
- Compare throughput, latency, and percentiles to understand API behavior
- Document findings for capacity planning
See also:
- Evaluation Metrics Guide - Detailed metric explanations
- Concurrent Evaluation - Advanced concurrent usage