You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: src/vmaware.hpp
+62-7Lines changed: 62 additions & 7 deletions
Original file line number
Diff line number
Diff line change
@@ -5141,9 +5141,27 @@ struct VM {
5141
5141
* The hypervisor must return the time debt between the first loop and the second loop: always after step 3 and always before step 6
5142
5142
* To counter this, VMAware simply keeps track of latency between iterations, so no matter when the hypervisor restores the time debt, it sees the hidden latency
5143
5143
*
5144
+
* - Statistical Check -
5145
+
* Now, they might try to downscale TSC sometimes, and pay the debt much later, so for example, when vmaware runs cpuid, it sees:
5146
+
* fast fast fast fast fast fast fast fast fast fast fast fast fast fast slow (time debt of all previous iterations is paid here in slow)
5147
+
* VMAware sees: 94% fast, 6% slow, thinks 6% slow are normal kernel noise, and discards them, so it gets bypassed...
5148
+
*
5149
+
* Thus, VMAware when detects a spike that seems to be a time debt payment, it redistributes them into previous samples, so it becomes:
* After the hypervisor have paid all the time debt, ALL the hidden latency will be redistributed into all samples, and will mathematically always cross the latency threshold
5157
+
* They might try to keep playing with the time debt redistribution, like not paying 20-30% of the time debt, but thresholds in VMAware's code (and the excesive number of iterations)
5158
+
* are calculated in purpose so that it's mathematically impossible (even with a patch that only downscales 1 cycle) to contaminate samples selectively so that it doesn't trigger any check
5159
+
* If a hypervisor doesn't pay sometimes the time debt (enough so that sample redistribution is not larger than the latency threshold), it trips the local or global ratio check
5160
+
* and if you do pay the time debt but at any interval (lets say at loop iteration 15, or 32, or 47...), you trip the cpuid latency check
5161
+
*
5144
5162
* So, now what they can do?
5145
5163
* - Low Latency Check -
5146
-
* If they cant do technique 1 (hiding the latency), they might attempt technique 2 (making the VM as fast as possible)
5164
+
* If they can't do technique 1 (hiding the latency), they might attempt technique 2 (making the VM as fast as possible)
5147
5165
* Remember that we use the cpuid instruction, which always haves high latency in a VM, which return results containing info about the CPU
5148
5166
*
5149
5167
* To do this, they might cache results and give them back instantly when cpuid is executed, or recode the whole kernel to just make it handle the cpuid quickly
@@ -5516,12 +5534,20 @@ struct VM {
5516
5534
u64 acc = 0;
5517
5535
size_t idx = 0;
5518
5536
5537
+
// Track sparse latency spikes that may represent "time debt repayment".
5538
+
// Hypervisors hiding VMEXIT cost often subtract cycles on most CPUID calls
5539
+
// and restore them periodically, producing rare but large spikes.
5540
+
u64 last_spike_idx = 0;
5541
+
u64 spike_count = 0;
5542
+
5519
5543
// for each leaf do CPUID_ITER samples, then repeat
5520
5544
while (state.load(std::memory_order_acquire) != 2) {
5521
5545
for (size_t li = 0; li < n_leaves; ++li) {
5546
+
5522
5547
const unsigned int leaf = leaves[li];
5523
5548
5524
5549
for (unsigned i = 0; i < CPUID_ITER; ++i) {
5550
+
5525
5551
// read rdtsc and accumulate delta
5526
5552
const u64 now = __rdtsc();
5527
5553
@@ -5535,10 +5561,11 @@ struct VM {
5535
5561
5536
5562
// store latency if buffer has space
5537
5563
if (idx < samples.size()) {
5564
+
5538
5565
u64 lat = cpuid(leaf);
5539
5566
5540
-
// If a VMX Preemption Timer is delayed (firing after cpuid returns to bypass t3 inside our cpuid lambda),
5541
-
// OR if the hypervisor intercepts RDTSC to restore the time debt instead of CPUID,
5567
+
// If a VMX Preemption Timer is delayed (firing after cpuid returns to bypass t3 inside our cpuid lambda),
5568
+
// OR if the hypervisor intercepts RDTSC to restore the time debt instead of CPUID,
5542
5569
// this outer total_overhead will include that hidden latency because it's at the end of this for loop
5543
5570
// if the hypervisor delays it even more, now will catch it, making the total_overhead check still detect it
5544
5571
// this means that no matter where the time debt is restored, VMAware will always be able to see the hidden latency
@@ -5551,21 +5578,49 @@ struct VM {
5551
5578
lat = total_overhead;
5552
5579
}
5553
5580
5581
+
// If the hypervisor hides VMEXIT latency statistically, most CPUID calls will appear
5582
+
// very fast while occasional spikes repay the hidden cycles. MAD filtering in calculate_latency() later
5583
+
// will discard these spikes as outliers, so we detect and redistribute them here
5584
+
if (lat > cycle_threshold) {
5585
+
5586
+
const size_t gap = idx - last_spike_idx;
5587
+
last_spike_idx = idx;
5588
+
spike_count++;
5589
+
5590
+
// If spikes appear periodically relative to CPUID frequency,
5591
+
// treat them as time debt repayment instead of real interrupts
5592
+
if (gap > 1 && gap < 64 && spike_count > 2) {
5593
+
const u64 repay = lat / gap;
5594
+
5595
+
// distribute hidden cycles backwards across recent samples, exactly the ones were spikes didnt occur
5596
+
for (size_t j = 1; j < gap && (idx >= j); ++j) {
5597
+
samples[idx - j] += repay;
5598
+
}
5599
+
5600
+
lat = repay;
5601
+
}
5602
+
}
5603
+
5554
5604
samples[idx] = lat;
5555
5605
}
5606
+
5556
5607
++idx;
5557
5608
5558
5609
// if thread 1 finished
5559
-
if (state.load(std::memory_order_acquire) == 2) break;
5610
+
if (state.load(std::memory_order_acquire) == 2)
5611
+
break;
5560
5612
}
5561
5613
5562
-
if (state.load(std::memory_order_acquire) == 2) break;
5614
+
if (state.load(std::memory_order_acquire) == 2)
5615
+
break;
5563
5616
}
5564
5617
}
5565
5618
5566
5619
// final rdtsc after detecting finish
5567
5620
const u64 final_now = __rdtsc();
5568
-
if (final_now >= last) acc += (final_now - last);
5621
+
if (final_now >= last)
5622
+
acc += (final_now - last);
5623
+
5569
5624
t2_end.store(acc, std::memory_order_release);
5570
5625
});
5571
5626
@@ -5614,7 +5669,7 @@ struct VM {
5614
5669
// this logic can be bypassed if the hypervisor downscales TSC in both cores, and that's precisely why we do now a Global Ratio
0 commit comments