This guide covers how to emulate CXL device hot-unplug events and inject faults in QEMU for testing system resilience.
Your QEMU instance must be started with the QMP (QEMU Machine Protocol) or HMP (Human Monitor Protocol) interface enabled:
# Add to QEMU command line:
-monitor tcp:127.0.0.1:1234,server,nowait
# Or for QMP (JSON-based):
-qmp tcp:127.0.0.1:4444,server,nowaitThe CXL device must be hotpluggable. In your launch_vm_nogdb.sh, the device is already set up correctly:
-device cxl-type1,bus=root_port13,memdev=cxl-mem1,lsa=cxl-lsa1,id=cxl-type1-0The id=cxl-type1-0 parameter is crucial for hot-unplug operations.
The cxl_hotplug_test.sh script provides an interactive interface:
# Automatic test mode
bash cxl_hotplug_test.sh auto
# Interactive mode
bash cxl_hotplug_test.sh# Using netcat
nc localhost 1234
# Or using telnet
telnet localhost 1234(qemu) info pci
# Find your CXL device (look for bus 0d:00.0)
(qemu) device_del cxl-type1-0
# Device is now being removed
(qemu) info pci
# Verify device is gone
(qemu) device_add cxl-type1,id=cxl-type1-0,bus=root_port13,memdev=cxl-mem1,lsa=cxl-lsa1
# Device is added back
QEMU supports PCIe AER error injection for testing error handling:
(qemu) pcie_aer_inject_error -c id=cxl-type1-0,error=COR_INTERNAL
Available correctable errors:
COR_INTERNAL- Internal errorCOR_BAD_TLP- Bad TLPCOR_BAD_DLLP- Bad DLLPCOR_REPLAY_TIMER- Replay timer timeoutCOR_REPLAY_ROLLOVER- Replay rollover
(qemu) pcie_aer_inject_error -u id=cxl-type1-0,error=UNCOR_POISON_TLP
Available uncorrectable errors:
UNCOR_POISON_TLP- Poisoned TLPUNCOR_UNSUPPORTED- Unsupported requestUNCOR_ECRC- ECRC errorUNCOR_MALFORMED_TLP- Malformed TLPUNCOR_COMPLETION_ABORT- Completion abort
(qemu) pcie_aer_inject_error -u -f id=cxl-type1-0,error=UNCOR_FLOW_CTRL
Inject poison into CXL memory to simulate memory corruption:
(qemu) cxl-inject-poison 0x0 0x40
# Inject poison at address 0x0, size 0x40 bytes
(qemu) cxl-inject-uncorrectable-error
(qemu) cxl-inject-correctable-error
Simulate link down/up events:
(qemu) set_link cxl-type1-0 off
# Link is now down
(qemu) set_link cxl-type1-0 on
# Link is back up
QEMU has built-in error injection capabilities. To enable them, add to QEMU command line:
-device pcie-root-port,id=root_port13,chassis=0,slot=0,aer=onThe aer=on enables Advanced Error Reporting.
You can manipulate the memory backend to simulate various failures:
# Remove memory backend (advanced)
(qemu) object_del cxl-mem1
# This will cause memory access failuresFor testing unexpected device removal:
# In host terminal
# Find QEMU process
ps aux | grep qemu
# Send device removal via QMP
echo '{"execute":"device_del", "arguments":{"id":"cxl-type1-0"}}' | \
nc localhost 4444# 1. Inside VM - ensure no active I/O to CXL device
sync
echo 1 > /sys/bus/pci/devices/0000:0d:00.0/remove
# 2. In QEMU monitor
device_del cxl-type1-0
# 3. Verify in VM
lspci | grep CXL
# Should show device is gone# 1. Inside VM - actively using CXL device
dd if=/dev/cxl-mem of=/dev/null bs=1M &
# 2. In QEMU monitor - immediately remove device
device_del cxl-type1-0
# 3. Check kernel logs
dmesg | tail -50
# Look for error handling# 1. Inside VM - start I/O workload
fio --name=test --rw=randwrite --bs=4k --size=1G \
--filename=/mnt/cxl_mount/testfile &
# 2. In QEMU monitor - inject errors
pcie_aer_inject_error -u id=cxl-type1-0,error=UNCOR_POISON_TLP
# 3. Monitor behavior
dmesg -w# Create a script to repeatedly toggle link state
for i in {1..10}; do
echo "set_link cxl-type1-0 off" | nc localhost 1234
sleep 2
echo "set_link cxl-type1-0 on" | nc localhost 1234
sleep 2
done# Watch PCI devices
watch -n 1 'lspci | grep CXL'
# Monitor kernel logs
dmesg -w | grep -i 'cxl\|pci\|aer'
# Check PCIe AER status
lspci -vvv -s 0d:00.0 | grep -A 20 'Advanced Error Reporting'
# Check CXL device status
ls -l /sys/bus/cxl/devices/
cat /sys/bus/cxl/devices/*/health_status# List all devices
info qtree
# Show PCI topology
info pci
# Show QOM tree
info qom-tree
# Check device status
device_list_properties cxl-type1
Create a Python script for automated testing:
#!/usr/bin/env python3
import socket
import json
import time
def qmp_command(sock, cmd):
sock.sendall(json.dumps(cmd).encode() + b'\n')
response = sock.recv(4096)
return json.loads(response)
sock = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
sock.connect(('localhost', 4444))
# Read banner
sock.recv(4096)
# Send capability negotiation
qmp_command(sock, {'execute': 'qmp_capabilities'})
# Hot-unplug device
qmp_command(sock, {
'execute': 'device_del',
'arguments': {'id': 'cxl-type1-0'}
})
time.sleep(5)
# Hot-plug device back
qmp_command(sock, {
'execute': 'device_add',
'arguments': {
'driver': 'cxl-type1',
'id': 'cxl-type1-0',
'bus': 'root_port13',
'memdev': 'cxl-mem1',
'lsa': 'cxl-lsa1'
}
})
sock.close()When a CXL device is hot-unplugged, the kernel should:
- Detect device removal via PCIe hotplug event
- Call device driver's
.remove()callback - Flush any pending I/O operations
- Unmap memory regions
- Remove sysfs entries
- Log the event in dmesg
# Inside VM
cat /proc/kallsyms | grep cxl | grep remove# Check if device is in use
lsof | grep cxl
# Force removal (use with caution)
echo 1 > /sys/bus/pci/devices/0000:0d:00.0/remove
# Check if monitor port is open
netstat -tuln | grep 1234
# Restart QEMU with monitor enabled
# Add to launch script: -monitor tcp:127.0.0.1:1234,server,nowait# Ensure AER is enabled in root port
# Modify launch script to add: aer=on to root port device
# Check kernel AER support
zcat /proc/config.gz | grep CONFIG_PCIEAERThe cxl_hotplug_test.sh script provides all these capabilities in an easy-to-use interface:
# Test hot-unplug with automatic recovery
QEMU_MONITOR_PORT=1234 bash cxl_hotplug_test.sh auto
# Interactive testing
bash cxl_hotplug_test.sh- PCIe Base Specification (AER)
- CXL 3.0 Specification (Error Handling)
- QEMU Documentation: docs/pcie_aer.txt
- Linux Kernel: drivers/cxl/