Extreme Performance Variability (CV > 10%) in Function Workloads Despite CPU Pinning
Environment
- Firecracker Version: v1.1.0
- Host Kernel: 6.8.0-101-generic
- Guest OS: Ubuntu 22.04
- CPU: Intel(R) Xeon(R) Gold 6248 CPU @ 2.50GHz
- Setup: FaaS with gRPC, Firecracker + fc_vcpu threads pinned to dedicated core
Problem Description
When running serverless functions via gRPC in Firecracker VMs, we observe significant performance instability despite strict CPU core pinning. The BFS graph traversal algorithm shows a Coefficient of Variation (CV) of 212%, making it impossible to provide performance SLO guarantees.
Performance Data (50 iterations per function)
| Function |
Cycles CV |
Instructions CV |
IPC |
Kernel Mode % |
| fibonacci (compute-bound) |
8.90% |
15.35% |
1.123 |
98.9% |
| matmul (compute-bound) |
9.26% |
13.88% |
1.032 |
98.8% |
| bfs (memory-intensive) |
212.96% ⚠️ |
325.50% ⚠️ |
0.661 |
99.9% ⚠️ |
| image_processing |
5.57% |
11.63% |
0.639 |
99.7% |
| video_processing |
13.06% |
32.79% |
0.495 |
99.3% |
Key Observations
- Unacceptable variance: BFS shows 3x variation in execution time for identical inputs
- Kernel mode dominance: All workloads spend 98-99% of time in kernel mode (expected: <20%)
- User mode cycles: ~20K
- Kernel mode cycles: ~1.5M - 27M
Suspected Root Causes
The extreme kernel mode percentage suggests:
- Excessive VM exits - triggered by gRPC network I/O (likely EPT_VIOLATION, EXTERNAL_INTERRUPT)
- Interrupt injection overhead - network packets triggering frequent guest interrupts
- Virtio-net inefficiency - possible lack of interrupt coalescing or ring buffer batching
For BFS specifically:
- Random memory access patterns amplify timing jitter from VM exits
- Hypothesis: Unpredictable VM exit timing disrupts cache locality
Reproduction Steps
#Start containerd
firecracker-containerd --config /etc/firecracker-containerd/config.toml
#start a container
numactl --physcpubind=150 --membind=1 firecracker-ctr --address /run/firecracker-containerd/containerd.sock \
run \
--snapshotter devmapper \
--runtime aws.firecracker \
--rm --tty --net-host \
docker.io/library/fibonacci:latest fibonacci-test
# Pin Firecracker to core
taskset -cp 222 $FC_PID
for tid in $(ps -T -p $FC_PID | grep fc_vcpu | awk '{print $2}'); do
taskset -cp 222 $tid
done
# Measure variance (repeat 50 times)
perf stat -e cycles:u,cycles:k,instructions:u,instructions:k \
-p $FC_PID -- numactl --physcpubind=50 --membind=0 /home/h00918771/vSwarm/tools/bin/grpcurl --proto proto/fibonacci.proto -plaintext -d '{"name":"256"}' ${VM_IP}:50051 fibonacci.Greeter.SayHello
Questions
- Is this level of kernel mode overhead expected for gRPC workloads?
- What approach would you recommend for profiling VM exit reasons? (
perf kvm stat?)
- Are there virtio-net tuning parameters for latency predictability?
- Ring sizes, interrupt coalescing, rate limiters?
- Should we consider using
vhost-user for this use case?
Expected Behavior
For deterministic workloads (identical inputs):
- CV should be < 20% even for memory-intensive functions
- Application code should run primarily in user mode
- Performance should be reproducible across invocations
Next Steps
We are prepared to:
- Run detailed VM exit profiling with
perf kvm stat
- Test suggested configuration changes
- Share guest-side profiling data
- Contribute patches if we identify fixes
Any guidance would be greatly appreciated! 🙏
Extreme Performance Variability (CV > 10%) in Function Workloads Despite CPU Pinning
Environment
Problem Description
When running serverless functions via gRPC in Firecracker VMs, we observe significant performance instability despite strict CPU core pinning. The BFS graph traversal algorithm shows a Coefficient of Variation (CV) of 212%, making it impossible to provide performance SLO guarantees.
Performance Data (50 iterations per function)
Key Observations
Suspected Root Causes
The extreme kernel mode percentage suggests:
For BFS specifically:
Reproduction Steps
Questions
perf kvm stat?)vhost-userfor this use case?Expected Behavior
For deterministic workloads (identical inputs):
Next Steps
We are prepared to:
perf kvm statAny guidance would be greatly appreciated! 🙏