The concurrent prompt evaluator (prompt_evaluator_concurrent.py) simulates real-world multi-user scenarios by sending multiple API requests in parallel. This provides insights into:
- API performance under load: How response times change with concurrent requests
- Throughput metrics: Requests per second your API can handle
- Latency distributions: p50, p95, p99 percentiles for response times
- Rate limiting behavior: Identify throttling thresholds
- Queue time analysis: Time waiting for available workers
Running the concurrent evaluator can trigger infrastructure auto-scaling and may impact shared resources. Before using this tool in any environment, especially non-local development:
-
Check with your platform/infrastructure team to:
- Understand auto-scaling policies
- Determine safe concurrency levels for your environment
- Schedule the test during maintenance windows if necessary
- Get approval for load testing
-
Start conservatively:
- Begin with
--concurrency 5or lower - Use
--linesto limit dataset size - Monitor infrastructure metrics during the run
- Begin with
-
Use rate limiting to smooth load:
--rate-limit 20 # Maximum 20 requests per second -
Avoid peak hours when running high concurrency tests
**Not following these guidelines could cause:
- Unexpected cloud infrastructure costs
- Service disruptions to other teams or applications
- Cascading failures if auto-scaling can't keep up**
# Run with 10 concurrent workers (default)
python prompt_evaluator_concurrent.py -i datasets/combined_dataset.jsonl
# Run with 20 concurrent workers
python prompt_evaluator_concurrent.py -i datasets/combined_dataset.jsonl --concurrency 20
# Run with rate limiting (50 requests/sec)
python prompt_evaluator_concurrent.py -i datasets/combined_dataset.jsonl --rate-limit 50
# Test with small dataset
python prompt_evaluator_concurrent.py -i datasets/combined_dataset.jsonl -l 50 -c 5# Run concurrent evaluation
make run-concurrent DATASET=datasets/combined_dataset.jsonl
# With custom settings
make run-concurrent DATASET=datasets/combined_dataset.jsonl CONCURRENCY=20 LINES=100
# Quick test
make test-concurrentThe concurrent evaluator uses Python's ThreadPoolExecutor to manage a pool of worker threads that send API requests in parallel:
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Load Dataset β
ββββββββββββββββββββββ¬βββββββββββββββββββββββββββββββββββββ
β
βΌ
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β ConcurrentScanPromptsNode β
β ββββββββββββ ββββββββββββ ββββββββββββ β
β β Worker 1 β β Worker 2 β β Worker N β β
β ββββββ¬ββββββ ββββββ¬ββββββ ββββββ¬ββββββ β
β β β β β
β βββββββββββββββ΄βββββββββββββββ β
β Shared Result Pool β
ββββββββββββββββββββββ¬βββββββββββββββββββββββββββββββββββββ
β
βΌ
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Calculate Metrics & Generate Reports β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
- Thread Pool: Configurable number of concurrent workers
- Connection Pooling: Reuses HTTP connections for efficiency
- Thread-Safe Progress Tracking: Lock-protected counters
- Enhanced Timing: Tracks queue time, request time, and total time
Calculates performance metrics:
- Throughput: Total prompts processed per second
- Latency Percentiles: p50 (median), p95, p99
- Queue Statistics: Time waiting for available worker
- Request Distribution: How work is spread across workers
Concurrent evaluation adds these fields to each result:
| Field | Description |
|---|---|
queue_time_ms |
Time waiting for an available worker (milliseconds) |
request_time_ms |
Actual API call duration (milliseconds) |
total_time_ms |
Queue time + request time (milliseconds) |
worker_id |
Which worker processed this request (0 to N-1) |
timestamp |
ISO 8601 timestamp when request was sent |
All output files include _concurrent suffix to distinguish from sequential runs:
results/
βββ dataset_concurrent_results.jsonl
βββ dataset_concurrent_false_positives.jsonl
βββ dataset_concurrent_false_negatives.jsonl
Concurrent Performance Metrics:
==============================
Configuration:
--------------
Concurrent Workers: 10
Rate Limit: None requests/sec
Total Prompts: 100
Total Elapsed Time: 15.32 seconds
Throughput:
-----------
Overall: 6.53 prompts/sec
Request Latency (API call time):
--------------------------------
Average: 245.12 ms
Minimum: 89.34 ms
Maximum: 1523.45 ms
p50 (median): 221.67 ms
p95: 456.78 ms
p99: 1234.56 ms
Queue Time (waiting for worker):
--------------------------------
Average: 12.34 ms
Maximum: 45.67 ms
Total Time (queue + request):
-----------------------------
Average: 257.46 ms
Determine how many concurrent requests your API can handle:
# Start with low concurrency
python prompt_evaluator_concurrent.py -i datasets/test.jsonl -c 5
# Gradually increase
python prompt_evaluator_concurrent.py -i datasets/test.jsonl -c 10
python prompt_evaluator_concurrent.py -i datasets/test.jsonl -c 20
python prompt_evaluator_concurrent.py -i datasets/test.jsonl -c 50Watch for:
- Increasing latency: p95/p99 times grow significantly
- Error rates: API starts returning errors
- Throughput plateau: Adding workers doesn't improve throughput
Run the same dataset with both evaluators:
# Sequential baseline
python prompt_evaluator.py -i datasets/test.jsonl
# Concurrent with 10 workers
python prompt_evaluator_concurrent.py -i datasets/test.jsonl -c 10Compare:
- Total execution time
- Average latency (should be similar if API is not overloaded)
- Throughput improvement
Identify API rate limiting thresholds:
# Without rate limiting
python prompt_evaluator_concurrent.py -i datasets/test.jsonl -c 20
# With rate limiting
python prompt_evaluator_concurrent.py -i datasets/test.jsonl -c 20 --rate-limit 50If the unlimited version shows errors or throttling, use --rate-limit to stay within API limits.
Simulate your production traffic patterns:
# Example: API receives ~100 requests/sec from ~50 concurrent users
python prompt_evaluator_concurrent.py -i datasets/large_dataset.jsonl \
--concurrency 50 \
--rate-limit 100Before running concurrent evaluation against shared environments:
- Notify your platform/infrastructure team
- Get explicit approval for the load test
- Provide concurrency levels and duration
- Confirm no scheduled maintenance or production deployments
This prevents unexpected service disruptions and cost overruns.
Begin with low concurrency and small datasets:
python prompt_evaluator_concurrent.py -i datasets/test.jsonl -l 20 -c 5- Watch CPU usage (threading overhead)
- Monitor network connections
- Check memory consumption with large datasets
- Use
--rate-limitto avoid overwhelming the API - Start with conservative concurrency levels
- Gradually increase to find optimal settings
- Check for errors: Review results for failed requests
- Look at latency trends: Compare p50 vs p95 vs p99
- Watch queue times: High queue time indicates worker bottleneck
- Compare with sequential: Ensure accuracy metrics match
| Argument | Short | Default | Description |
|---|---|---|---|
--input |
-i |
Required | Input dataset file |
--lines |
-l |
None | Limit number of prompts to process |
--concurrency |
-c |
10 | Number of concurrent workers |
--rate-limit |
None | Maximum requests per second | |
--format |
-f |
json | Output format for FP/FN files |
--results-format |
-r |
json | Output format for main results |
Guidelines:
- 5-10 workers: Safe starting point, good for testing
- 10-20 workers: Typical production simulation
- 20-50 workers: Heavy load testing
- 50+ workers: Stress testing (monitor closely)
Factors to consider:
- API rate limits
- Available CPU cores
- Network bandwidth
- Dataset size
Use --rate-limit to control request frequency:
# 50 requests per second maximum
python prompt_evaluator_concurrent.py -i dataset.jsonl --rate-limit 50When to use:
- API has documented rate limits
- Testing specific throughput levels
- Avoiding overwhelming shared infrastructure
Symptoms: Many results show "outcome": "error: ..."
Solutions:
- Reduce concurrency:
--concurrency 5 - Add rate limiting:
--rate-limit 20 - Check API credentials and network connectivity
Symptoms: queue_time_ms >> request_time_ms
Solutions:
- Workers are waiting for each other (good!)
- Consider increasing concurrency if API can handle it
- This indicates workers are kept busy
Symptoms: Adding workers doesn't increase prompts/sec
Solutions:
- API may have rate limiting
- Network bottleneck
- Try rate limiting to smooth out traffic
Symptoms: Script crashes or slows down significantly
Solutions:
- Process smaller chunks:
--lines 100 - Reduce concurrency
- Use JSONL format for streaming (less memory)
Create a shell script to test multiple configurations:
#!/bin/bash
DATASET="datasets/test_dataset.jsonl"
for concurrency in 5 10 20 50; do
echo "Testing with concurrency=$concurrency"
python prompt_evaluator_concurrent.py \
-i "$DATASET" \
-c $concurrency \
-l 100
doneimport json
# Load concurrent results
with open('results/dataset_concurrent_results.jsonl') as f:
results = [json.loads(line) for line in f]
# Analyze latency by worker
worker_latencies = {}
for result in results:
worker_id = result['worker_id']
if worker_id not in worker_latencies:
worker_latencies[worker_id] = []
worker_latencies[worker_id].append(result['request_time_ms'])
# Print average latency per worker
for worker_id, latencies in sorted(worker_latencies.items()):
avg = sum(latencies) / len(latencies)
print(f"Worker {worker_id}: {avg:.2f}ms average")| Feature | Sequential | Concurrent |
|---|---|---|
| Processing | One at a time | Parallel workers |
| Speed | Baseline | 5-10x faster (typical) |
| Metrics | Basic latency | Enhanced timing & throughput |
| Use Case | Baseline accuracy | Load testing, performance |
| API Load | Minimal | Simulates production |
| Complexity | Simple | More complex |
Potential improvements for the concurrent evaluator:
- Comparison Mode: Run sequential and concurrent in one command
- Enhanced Reporting: PDF reports with latency histograms
- Timeline Visualization: Chart showing requests over time
- Async Support: Use
asynciofor even higher concurrency - Retry Logic: Automatic retry of failed requests
- Worker Affinity: Pin certain prompts to specific workers
When modifying the concurrent evaluator:
- Maintain thread safety (use locks for shared state)
- Update timing metrics consistently
- Add tests to
test_concurrent_basic.py - Update this documentation
- Follow existing code patterns from
prompt_evaluator.py
- Main evaluator:
prompt_evaluator.py - Project documentation:
WARP.md - pocketflow library: https://github.com/pocketflow/pocketflow
- Python ThreadPoolExecutor: https://docs.python.org/3/library/concurrent.futures.html