Skip to content

Commit 2993f3c

Browse files
committed
Harden virtio-blk and virtio-net
Address ten correctness and safety issues across the virtqueue chain walking, descriptor validation, and back-pressure paths. Chain walking. Both devices walked exactly one (virtio-net) or three (virtio-blk) descriptors per request, so a guest that split a packet or a >4 KiB block I/O across multiple descriptors got partial transfers and the next chain descriptor was reinterpreted as a new request head. Both device emulators now snapshot the entire chain via VRING_DESC_F_NEXT, sized to the VIRTQ_SIZE-bounded stack array, and route data through readv/writev (net) or per-segment diskimg I/O (blk). Descriptor direction. Neither device validated VRING_DESC_F_WRITE before touching guest memory, so a buggy or malicious guest could hand a "readable-only" status descriptor and have the device write into memory it never offered for writes. Every consumer site now checks direction matches the request type and rejects mismatches with VIRTIO_BLK_S_IOERR or an empty USED publication. NULL guards. virtq_get_avail can legitimately return NULL on a malformed chain; the prior code dereferenced the result on virtio-blk's data and status reads, which is a host segfault, not a queue stall. Every consumer site now NULL-checks. On a malformed chain the device stalls the queue rather than publishing USED with chain[0].id (the buffer ID lives on the chain's last descriptor in packed virtqueues, so publishing with a head id risks pointing the driver at an unrelated in-flight chain). Always-publish-USED. Every error path in virtio-blk/virtio-net used to return without flipping USED on the head, leaking descriptors and eventually drifting the device's view of the ring out of sync with the driver's. Successful walks now always publish USED — IOERR with len=1 on validation failure, the real status byte plus device-writable byte count on success. Stack-array bounds. Per-iteration desc_snap[VIRTQ_SIZE] / iov[VIRTQ_SIZE] arrays are sized to the host-advertised queue depth. virtio-pci now clamps guest writes to queue_size at VIRTQ_SIZE and the walkers cap at VIRTQ_SIZE, so a guest cannot make the walk overrun the stack arrays. Atomic event flag stores. virtq_handle_avail reads guest_event->flags with __ATOMIC_ACQUIRE; the worker threads paired this with plain stores, so the compiler was free to tear or reorder against the surrounding completion writes. A new virtq_set_guest_event_flags helper replaces every plain store with __atomic_store_n / __ATOMIC_RELEASE. virtio-blk SG arithmetic. data_size widened from uint16_t to uint32_t so >64 KiB segments don't silently truncate before reaching diskimg I/O. Per-segment writable_total now accumulates in uint64_t with __builtin_add_overflow, capping at UINT32_MAX-1 so the reported used.len plus the trailing status byte fits the packed-ring uint32_t. VIRTIO_BLK_F_FLUSH. Without the negotiated FLUSH feature the Linux virtio-blk driver runs in writeback-without-barriers mode, so guest fsync(2) returns success while data sits in the host page cache. The device now advertises VIRTIO_BLK_F_FLUSH, dispatches T_FLUSH to a new diskimg_flush() (fdatasync), and fsyncs the backing file at exit. TX back-pressure. The TX worker previously required tapfd POLLOUT in its poll predicate; with a level-triggered ioeventfd POLLIN that meant 100% CPU until the TAP drained. The poll set now drops the TAP predicate by default, drains the ioeventfd on a guest kick, and arms POLLOUT only after a transient writev() EAGAIN. On EAGAIN the chain's next_avail_idx and used_wrap_count are rolled back so the TAP-blocked chain is retried on POLLOUT rather than silently dropped.
1 parent 2231254 commit 2993f3c

9 files changed

Lines changed: 441 additions & 156 deletions

File tree

src/diskimg.c

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -22,6 +22,11 @@ ssize_t diskimg_write(struct diskimg *diskimg,
2222
return write(diskimg->fd, data, size);
2323
}
2424

25+
int diskimg_flush(struct diskimg *diskimg)
26+
{
27+
return fdatasync(diskimg->fd);
28+
}
29+
2530
int diskimg_init(struct diskimg *diskimg, const char *file_path)
2631
{
2732
diskimg->fd = open(file_path, O_RDWR);

src/diskimg.h

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -17,5 +17,6 @@ ssize_t diskimg_write(struct diskimg *diskimg,
1717
void *data,
1818
off_t offset,
1919
size_t size);
20+
int diskimg_flush(struct diskimg *diskimg);
2021
int diskimg_init(struct diskimg *diskimg, const char *file_path);
2122
void diskimg_exit(struct diskimg *diskimg);

src/virtio-blk.c

Lines changed: 167 additions & 83 deletions
Original file line numberDiff line numberDiff line change
@@ -70,106 +70,182 @@ static void virtio_blk_enable_vq(struct virtq *vq)
7070
dev->vq_thread_started = true;
7171
}
7272

73-
static ssize_t virtio_blk_write(struct virtio_blk_dev *dev,
74-
void *data,
75-
off_t offset,
76-
size_t size)
73+
/* Snapshot of one descriptor in a chain. We copy the volatile guest fields
74+
* once so subsequent decisions cannot tear against a concurrent guest write.
75+
*/
76+
struct desc_snap {
77+
uint64_t addr;
78+
uint32_t len;
79+
uint16_t flags;
80+
uint16_t id;
81+
};
82+
83+
/* Walk a chain starting at the supplied head, snapshotting each descriptor
84+
* into out[]. cap is the maximum supported chain length; the caller passes
85+
* vq->info.size to mirror the "seen >= size" guard in the reference VMM and
86+
* defend against a malformed chain. Returns the count on success, or 0 if the
87+
* chain is malformed (NULL on virtq_get_avail mid-chain, or longer than cap).
88+
* On any return the head has been consumed, so the caller is still obligated
89+
* to publish USED.
90+
*/
91+
static size_t virtio_blk_walk_chain(struct virtq *vq,
92+
struct vring_packed_desc *head,
93+
struct desc_snap *out,
94+
size_t cap)
7795
{
78-
return diskimg_write(dev->diskimg, data, offset, size);
96+
out[0].addr = head->addr;
97+
out[0].len = head->len;
98+
out[0].flags = head->flags;
99+
out[0].id = head->id;
100+
size_t n = 1;
101+
while (out[n - 1].flags & VRING_DESC_F_NEXT) {
102+
if (n >= cap)
103+
return 0;
104+
struct vring_packed_desc *next = virtq_get_avail(vq);
105+
if (!next)
106+
return 0;
107+
out[n].addr = next->addr;
108+
out[n].len = next->len;
109+
out[n].flags = next->flags;
110+
out[n].id = next->id;
111+
n++;
112+
}
113+
return n;
79114
}
80115

81-
static ssize_t virtio_blk_read(struct virtio_blk_dev *dev,
82-
void *data,
83-
off_t offset,
84-
size_t size)
116+
static uint8_t virtio_blk_handle_io(struct virtio_blk_dev *dev,
117+
vm_t *v,
118+
const struct virtio_blk_req *req,
119+
const struct desc_snap *chain,
120+
size_t n,
121+
bool needs_write,
122+
uint32_t *out_written)
85123
{
86-
return diskimg_read(dev->diskimg, data, offset, size);
124+
/* sector * 512 must not overflow before any segment is dispatched. */
125+
uint64_t cur_off;
126+
if (__builtin_mul_overflow(req->sector, (uint64_t) 512, &cur_off))
127+
return VIRTIO_BLK_S_IOERR;
128+
129+
/* writable_total is reported as used.len (uint32_t in the packed-ring
130+
* descriptor) plus one byte for the status descriptor we add later, so
131+
* accumulate in 64 bits and reject any total that wouldn't fit.
132+
*/
133+
uint64_t writable_total = 0;
134+
for (size_t i = 1; i < n - 1; i++) {
135+
const struct desc_snap *seg = &chain[i];
136+
bool is_writable = (seg->flags & VRING_DESC_F_WRITE) != 0;
137+
if (is_writable != needs_write)
138+
return VIRTIO_BLK_S_IOERR;
139+
140+
void *buf = vm_guest_buf(v, seg->addr, seg->len);
141+
if (!buf)
142+
return VIRTIO_BLK_S_IOERR;
143+
144+
uint64_t end;
145+
if (__builtin_add_overflow(cur_off, (uint64_t) seg->len, &end) ||
146+
end > (uint64_t) dev->diskimg->size)
147+
return VIRTIO_BLK_S_IOERR;
148+
149+
ssize_t got;
150+
if (needs_write)
151+
got = diskimg_read(dev->diskimg, buf, (off_t) cur_off, seg->len);
152+
else
153+
got = diskimg_write(dev->diskimg, buf, (off_t) cur_off, seg->len);
154+
if (got < 0 || (size_t) got != (size_t) seg->len)
155+
return VIRTIO_BLK_S_IOERR;
156+
157+
cur_off += seg->len;
158+
if (needs_write) {
159+
if (__builtin_add_overflow(writable_total, (uint64_t) seg->len,
160+
&writable_total) ||
161+
writable_total > (uint64_t) UINT32_MAX - 1)
162+
return VIRTIO_BLK_S_IOERR;
163+
}
164+
}
165+
*out_written = (uint32_t) writable_total;
166+
return VIRTIO_BLK_S_OK;
87167
}
88168

89169
static void virtio_blk_complete_request(struct virtq *vq)
90170
{
91171
struct virtio_blk_dev *dev = (struct virtio_blk_dev *) vq->dev;
92172
vm_t *v = container_of(dev, vm_t, virtio_blk_dev);
93-
uint8_t status;
94-
struct vring_packed_desc *desc;
95-
struct virtio_blk_req req;
173+
struct vring_packed_desc *head;
96174

97-
/* Wire-format header is type/reserved/sector only; the rest of struct
98-
* virtio_blk_req is host bookkeeping and must not be overwritten by the
99-
* guest. */
175+
/* Wire-format header is type/reserved/sector only; the trailing fields
176+
* of struct virtio_blk_req are host bookkeeping.
177+
*/
100178
const size_t hdr_sz = offsetof(struct virtio_blk_req, data);
101179

102-
while ((desc = virtq_get_avail(vq))) {
103-
struct vring_packed_desc *used_desc = desc;
104-
ssize_t io_bytes = 0;
105-
106-
void *hdr = vm_guest_buf(v, desc->addr, hdr_sz);
107-
if (!hdr || desc->len < hdr_sz)
108-
return;
109-
memcpy(&req, hdr, hdr_sz);
110-
if (req.type == VIRTIO_BLK_T_IN || req.type == VIRTIO_BLK_T_OUT) {
111-
if (!virtq_check_next(desc))
112-
return;
113-
desc = virtq_get_avail(vq);
114-
req.data_size = desc->len;
115-
req.data = vm_guest_buf(v, desc->addr, req.data_size);
116-
117-
/* Validate that the request fits in the backing store. Both the
118-
* shift (sector*512) and the addition (offset+data_size) must not
119-
* overflow, and the end must be within diskimg->size. Any failure
120-
* yields VIRTIO_BLK_S_IOERR with no data transferred. */
121-
uint64_t off, end;
122-
bool io_ok = false;
123-
if (req.data && !__builtin_mul_overflow(req.sector, 512, &off) &&
124-
!__builtin_add_overflow(off, req.data_size, &end) &&
125-
end <= (uint64_t) dev->diskimg->size) {
126-
if (req.type == VIRTIO_BLK_T_IN)
127-
io_bytes = virtio_blk_read(dev, req.data, (off_t) off,
128-
req.data_size);
129-
else
130-
io_bytes = virtio_blk_write(dev, req.data, (off_t) off,
131-
req.data_size);
132-
/* A short read/write leaves part of the guest buffer stale,
133-
* so treat anything less than the full request as IOERR. */
134-
io_ok = io_bytes >= 0 && (size_t) io_bytes == req.data_size;
135-
}
136-
status = io_ok ? VIRTIO_BLK_S_OK : VIRTIO_BLK_S_IOERR;
137-
} else {
138-
status = VIRTIO_BLK_S_UNSUPP;
139-
}
140-
if (!virtq_check_next(desc))
141-
return;
142-
desc = virtq_get_avail(vq);
143-
/* The status descriptor must advertise at least one device-writable
144-
* byte; otherwise we'd clobber memory the guest did not offer.
180+
while ((head = virtq_get_avail(vq))) {
181+
struct desc_snap chain[VIRTQ_SIZE];
182+
/* Walker cap is the array bound, not the guest-controlled
183+
* vq->info.size — virtio-pci clamps that on writes, but pass
184+
* VIRTQ_SIZE here too as defense in depth against ABI drift.
145185
*/
146-
if (desc->len < 1)
186+
size_t n = virtio_blk_walk_chain(vq, head, chain, VIRTQ_SIZE);
187+
if (n == 0) {
188+
/* Malformed chain — buffer ID lives on the last descriptor and we
189+
* never reached it. Publishing USED with chain[0].id risks pointing
190+
* the driver at an unrelated in-flight chain. Stalling the queue is
191+
* the lesser evil.
192+
*/
147193
return;
148-
req.status = vm_guest_buf(v, desc->addr, 1);
149-
if (!req.status)
150-
return;
151-
*req.status = status;
152-
/* used.len is total bytes the device wrote into device-writable buffers
153-
* across the chain: the 1-byte status is always written, plus the data
154-
* buffer on a successful IN. On any error we report only the status
155-
* byte so the guest does not consume stale data.
194+
}
195+
196+
/* Default response: IOERR using the chain's last-descriptor id (the
197+
* buffer ID) and len=1. Single-descriptor chains have head == last
198+
* so this is the head's id.
156199
*/
157-
size_t written = 1;
158-
if (status == VIRTIO_BLK_S_OK && req.type == VIRTIO_BLK_T_IN)
159-
written += (size_t) io_bytes;
160-
used_desc->len = (uint32_t) written;
161-
/* Buffer ID lives on the chain's last descriptor in packed virtqueues;
162-
* echo it back into the head/used slot so the driver can match the
163-
* completion to its in-flight request.
200+
uint8_t status_byte = VIRTIO_BLK_S_IOERR;
201+
uint16_t buffer_id = chain[n - 1].id;
202+
uint32_t used_len = 1;
203+
uint8_t *status_ptr = NULL;
204+
205+
if (n < 2)
206+
goto publish;
207+
208+
/* Last descriptor of the chain owns the buffer ID and is the status
209+
* descriptor; it must be device-writable with at least one byte.
164210
*/
165-
used_desc->id = desc->id;
166-
/* Single-writer slot until USED publishes, so a plain load of the
167-
* current flags is safe. Release-store the new value so id and len are
168-
* visible to the guest before the USED flag flip.
211+
const struct desc_snap *status_desc = &chain[n - 1];
212+
if (!(status_desc->flags & VRING_DESC_F_WRITE) || status_desc->len < 1)
213+
goto publish;
214+
status_ptr = vm_guest_buf(v, status_desc->addr, 1);
215+
if (!status_ptr)
216+
goto publish;
217+
218+
/* Header descriptor must be device-readable and span at least the
219+
* wire-format header.
169220
*/
170-
uint16_t new_flags =
171-
used_desc->flags ^ (uint16_t) (1U << VRING_PACKED_DESC_F_USED);
172-
__atomic_store_n(&used_desc->flags, new_flags, __ATOMIC_RELEASE);
221+
if ((chain[0].flags & VRING_DESC_F_WRITE) || chain[0].len < hdr_sz)
222+
goto publish;
223+
void *hdr = vm_guest_buf(v, chain[0].addr, hdr_sz);
224+
if (!hdr)
225+
goto publish;
226+
227+
struct virtio_blk_req req;
228+
memcpy(&req, hdr, hdr_sz);
229+
230+
if (req.type == VIRTIO_BLK_T_IN || req.type == VIRTIO_BLK_T_OUT) {
231+
bool needs_write = req.type == VIRTIO_BLK_T_IN;
232+
uint32_t writable = 0;
233+
status_byte = virtio_blk_handle_io(dev, v, &req, chain, n,
234+
needs_write, &writable);
235+
used_len = writable + 1;
236+
} else if (req.type == VIRTIO_BLK_T_FLUSH) {
237+
status_byte = diskimg_flush(dev->diskimg) < 0 ? VIRTIO_BLK_S_IOERR
238+
: VIRTIO_BLK_S_OK;
239+
used_len = 1;
240+
} else {
241+
status_byte = VIRTIO_BLK_S_UNSUPP;
242+
used_len = 1;
243+
}
244+
245+
publish:
246+
if (status_ptr)
247+
*status_ptr = status_byte;
248+
virtq_publish_used(head, buffer_id, used_len);
173249
__atomic_fetch_or(&dev->virtio_pci_dev.config.isr_cap.isr_status,
174250
VIRTIO_PCI_ISR_QUEUE, __ATOMIC_RELEASE);
175251
}
@@ -225,7 +301,11 @@ int virtio_blk_init_pci(struct virtio_blk_dev *virtio_blk_dev,
225301
virtio_pci_set_pci_hdr(dev, VIRTIO_PCI_DEVICE_ID_BLK, VIRTIO_BLK_PCI_CLASS,
226302
virtio_blk_dev->irq_num);
227303
virtio_pci_set_virtq(dev, virtio_blk_dev->vq, VIRTIO_BLK_VIRTQ_NUM);
228-
virtio_pci_add_feature(dev, 0);
304+
/* FLUSH is required for guest fsync to be honored: with the bit clear the
305+
* Linux driver runs in writeback-without-barrier mode and a host crash can
306+
* lose data the guest believed durable.
307+
*/
308+
virtio_pci_add_feature(dev, 1ULL << VIRTIO_BLK_F_FLUSH);
229309
virtio_pci_enable(dev);
230310
return 0;
231311
}
@@ -245,6 +325,10 @@ void virtio_blk_exit(struct virtio_blk_dev *dev)
245325
throw_err("Failed to wake virtio-blk worker");
246326
if (dev->vq_thread_started)
247327
pthread_join(dev->vq_avail_thread, NULL);
328+
/* Honor guest barrier semantics on clean shutdown: writes that came back as
329+
* VIRTIO_BLK_S_OK could still be in the host page cache.
330+
*/
331+
diskimg_flush(dev->diskimg);
248332
diskimg_exit(dev->diskimg);
249333
virtio_pci_exit(&dev->virtio_pci_dev);
250334
close(dev->irqfd);

src/virtio-blk.h

Lines changed: 5 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -13,12 +13,16 @@
1313
#define VIRTIO_BLK_VIRTQ_NUM 1
1414
#define VIRTIO_BLK_PCI_CLASS 0x018000
1515

16+
/* Wire-format header is the first three fields (type/reserved/sector); the
17+
* trailing host-only bookkeeping is filled in by the device emulator from the
18+
* descriptor chain and never read from guest memory.
19+
*/
1620
struct virtio_blk_req {
1721
uint32_t type;
1822
uint32_t reserved;
1923
uint64_t sector;
2024
uint8_t *data;
21-
uint16_t data_size;
25+
uint32_t data_size;
2226
uint8_t *status;
2327
};
2428

0 commit comments

Comments
 (0)