Skip to content

Commit 3831df1

Browse files
author
T. Andrew Davis
committed
Add kernel-related skills for documentation and review processes
- Introduced `kernel-locking` skill to guide on lock types, usage, and deadlock avoidance in kernel development. - Added `kernel-memory-ordering` skill focusing on memory barriers, atomic operations, and the C/Rust memory model. - Created `kernel-review-checklist` skill to provide a comprehensive pre-merge checklist for kernel patches, ensuring code quality and safety. - Established `kernel-unsafe-discipline` skill to outline best practices for using unsafe blocks and functions in kernel code. - Implemented `lapic-ipi` skill detailing LAPIC and x2APIC driver work, including IPI delivery and initialization sequences. - Developed `madt-topology` skill for ACPI MADT parsing, focusing on SMP topology and LAPIC ID extraction. - Introduced `msr-setup` skill for MSR programming during CPU bring-up, covering essential MSRs and their configurations. - Added `per-cpu-layout` skill to address per-CPU struct layout, GS-base ABI, and related assembly interactions.
1 parent 2c3ee54 commit 3831df1

66 files changed

Lines changed: 3378 additions & 553 deletions

File tree

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.
Lines changed: 97 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,97 @@
1+
---
2+
applyTo: ['hwinit/src/cpu/**/*.rs', 'hwinit/asm/cpu/**/*.s']
3+
---
4+
5+
# CPU Code — Real Hardware First
6+
7+
This instruction applies to all changes in `hwinit/src/cpu/` and CPU ASM.
8+
9+
## The Contract
10+
11+
Any change to CPU code **must be validated on real hardware before merging.** QEMU is a liar. It will let code slide that triple-faults on a real Xeon.
12+
13+
Symptoms that indicate you've broken something on real hardware that QEMU hides:
14+
15+
- AP triples-fault after SIPI
16+
- System locks up after N cores come online
17+
- Scheduler crashes when running on AP (works on BSP)
18+
- Per-CPU data corruption (GS-base, stack, PerCpu fields read as garbage)
19+
- Cache coherency bugs: one core writes, another doesn't see it (QEMU has perfect coherency)
20+
- LAPIC base remapping: firmware moved it; code assumes default
21+
22+
## Before You Start
23+
24+
1. Know the real-hardware symptom your change might cause
25+
2. Build a mental model of what can break in what you're touching
26+
3. Read the relevant skill (see the list below)
27+
28+
## While You Code
29+
30+
- Every `unsafe` block needs a `// SAFETY:` comment — non-negotiable
31+
- Every `Atomic*` operation needs deliberate ordering choice — document why
32+
- Every new lock needs a reason (see `kernel-locking` skill)
33+
- No `unwrap()` / `expect()` / `panic!()` outside init paths
34+
- Memory barriers (fence, mfence, SeqCst) need a comment explaining what they pair with
35+
36+
## After You Code — Pre-Merge Gates
37+
38+
1. **Static checks pass**:
39+
- `cargo fmt --check` clean
40+
- `cargo clippy -- -D warnings` clean (or justified allow)
41+
- `cargo check` succeeds
42+
43+
2. **Real hardware validation**:
44+
- Boot on real hardware (4+ cores minimum)
45+
- Verify no crashes, hangs, or data corruption
46+
- Serial log shows clean boot (no ERR codes)
47+
- System remains stable for ≥30 seconds with all cores online
48+
49+
3. **Code review gates** (use skills):
50+
- Run through `kernel-unsafe-discipline` skill checklist
51+
- Run through `kernel-memory-ordering` skill checklist (if touching Atomic or barriers)
52+
- Run through `kernel-review-checklist` skill checklist
53+
54+
## Relevant Skills
55+
56+
Depending on what you're changing:
57+
58+
- **Any CPU code**: `kernel-coding-style`, `kernel-review-checklist`
59+
- **AP bringup / GDT / MSR**: `/hardened-ap-bringup` prompt (major refactoring needed)
60+
- **AP trampoline, GDT, TSS, per-CPU**: `ap-trampoline`, `gdt-tss`, `per-cpu-layout` skills
61+
- **LAPIC, IPI, timers**: `lapic-ipi` skill
62+
- **ACPI MADT, topology**: `madt-topology` skill
63+
- **CPU features (SSE, AVX, SMEP)**: `cpuid-feature-gate` skill
64+
- **MSR programming**: `msr-setup` skill
65+
- **Locking or synchronization**: `kernel-locking`, `kernel-memory-ordering` skills
66+
- **Unsafe blocks**: `kernel-unsafe-discipline` skill
67+
68+
## Known Issues to Avoid
69+
70+
- **CR3 above 4 GB**: 32-bit trampoline can only load 32-bit CR3
71+
- **AP triple-fault on SIPI**: Usually CR3 wrong, GDT not accessible, or paging not set up
72+
- **Per-CPU offset mismatch**: asm uses hardcoded `gs:[0x20]` but you moved the field; add to `debug_assert_offsets()`
73+
- **x2APIC ID > 0xFF**: xAPIC destination field is 8-bit only; need x2APIC mode for large IDs
74+
- **Cache coherency**: QEMU hides coherency bugs; real hardware exposes them
75+
- **TD_READY not set**: AP either crashed before reaching `ap_rust_entry`, or never incremented `AP_ONLINE_COUNT`
76+
77+
## How to Debug Real-Hardware Failures
78+
79+
1. Add serial logging to every significant step in the affected code path
80+
2. Use unique log codes (see `log_error("AP", CODE, "...")` pattern)
81+
3. Boot on real hardware with verbose logging
82+
4. Identify the last successful log before the crash
83+
5. Audit the next function: unsafe blocks, memory ordering, lock contention
84+
6. Reference the appropriate skill for that subsystem
85+
86+
## What NOT to Do
87+
88+
- Do not assume BSP behavior works for APs (GDT, IDT, MSR, GS-base all per-core)
89+
- Do not assume QEMU coherency works on real hardware
90+
- Do not add a "TODO: fix on real hardware" comment — either fix it or open an issue
91+
- Do not merge with `#[allow(...)]` warnings without a strong reason documented
92+
- Do not allocate DMA memory from interrupt context
93+
- Do not hold a spinlock across an allocation or I/O operation
94+
95+
## Questions?
96+
97+
If you're uncertain about whether a change is safe for real hardware, invoke the relevant skill or ask an agent before coding.
Lines changed: 203 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,203 @@
1+
---
2+
description: 'Harden and modularize AP bringup and CPU init sequence. Use when: AP bringup succeeds on QEMU but fails or corrupts state on real hardware, need to audit boot ordering, refactor AP trampoline or per-CPU initialization, fix "works until scheduler starts" instability. Establishes diagnosis procedures, modularization boundaries, and validates real-hardware correctness.'
3+
argument-hint: "specific symptom (e.g. 'AP triple-faults at syscall_init', 'scheduler crashes after AP4 comes online')"
4+
---
5+
6+
# Harden and Modularize AP Bringup
7+
8+
## Problem Statement
9+
10+
The AP bringup sequence **works on QEMU but fails/corrupts state on real hardware**. Symptoms include:
11+
12+
- AP triple-faults after SIPI
13+
- AP comes online but scheduler crashes
14+
- System locks up after N cores are online
15+
- Per-CPU data corruption (GS-base not set, offsets wrong)
16+
- Boot hangs instead of timing out
17+
18+
The existing code in `hwinit/src/cpu/ap_boot.rs` has:
19+
20+
1. **Coupling**: trampoline setup, data block fill, and boot sequencing all live in one function
21+
2. **Missing validation**: no checks that per-CPU state is coherent after AP init
22+
3. **Ordering assumptions**: code assumes BSP init paths happen before APs, but doesn't enforce it
23+
4. **Race windows**: AP_ONLINE_COUNT increment timing is ambiguous with scheduler activation
24+
25+
## Goals
26+
27+
1. **Diagnose real-hardware failure mode** — understand what breaks where
28+
2. **Modularize**: separate concerns (trampoline, GDT/TSS, LAPIC, MSR, PerCpu init)
29+
3. **Enforce ordering**: explicit state machine for BSP → AP handoff
30+
4. **Harden**: add validation, assertions, bounded errors
31+
5. **Real-hardware-first**: boot and validate on actual hardware before merging
32+
33+
## Scope
34+
35+
**In scope**:
36+
- `hwinit/src/cpu/ap_boot.rs` — orchestration
37+
- `hwinit/src/cpu/gdt.rs` — per-AP GDT/TSS setup
38+
- `hwinit/src/cpu/per_cpu.rs` — PerCpu init, AP_ONLINE_COUNT semantics
39+
- `hwinit/asm/cpu/ap_trampoline.s` — real-mode→LM transition
40+
- `hwinit/build.rs` — trampoline assembly, binary validation
41+
- BSP init sequence (kernel entry → scheduler)
42+
43+
**Out of scope**:
44+
- Scheduler modifications
45+
- Interrupt delivery optimization
46+
- ACPI topology discovery (use what `start_aps_from_list` gets)
47+
48+
## Diagnosis Procedure
49+
50+
Start here. Don't code yet.
51+
52+
### Step 1: Reproduce on Real Hardware
53+
54+
1. Boot MorpheusX with 4+ CPU cores on real hardware (ThinkPad, Xeon, whatever is available)
55+
2. Note the exact failure point:
56+
- Does serial log show APs coming online?
57+
- Which core # first fails?
58+
- Does the crash happen at a fixed point or random?
59+
3. Add detailed logging to `ap_rust_entry`:
60+
```rust
61+
log_ok("AP", 520, "ap_rust_entry: entering");
62+
log_ok("AP", 521, "gdt_init done");
63+
log_ok("AP", 522, "idt_load done");
64+
// ... one log per major step
65+
```
66+
4. Identify the exact line/function where the AP dies.
67+
68+
### Step 2: Cross-Reference Against Symptoms
69+
70+
Use the skills:
71+
72+
- **kernel-unsafe-discipline**: Are all `unsafe` blocks justified? Check if AP copies from trampoline safely.
73+
- **kernel-memory-ordering**: Is AP_ONLINE_COUNT increment synchronized correctly? Is TD_STACK write visible to AP?
74+
- **per-cpu-layout**: Are PERCPU_* offsets correct? Compare `PerCpu` struct layout against `gs:[offset]` in asm.
75+
- **gdt-tss**: Is per-AP GDT/TSS allocated before SIPI? Is RSP0 in TSS set correctly?
76+
- **kernel-locking**: Is there a race between AP coming online and BSP starting scheduler?
77+
78+
### Step 3: Root Cause Categories
79+
80+
Map symptom to likely cause:
81+
82+
| Symptom | Likely Cause | Diagnosis |
83+
|---------|--------------|-----------|
84+
| Triple-fault after SIPI | CR3 wrong, GDT not accessible, paging broken | Add logging before SIPI; check `setup_trampoline` CR3 path |
85+
| AP hangs at TD_READY poll | Stack ptr wrong, AP crashes silently | Verify stack allocation in `boot_single_ap` |
86+
| Crash in `syscall_init` | STAR selector mismatch with GDT, LSTAR points to BSP code | Audit GDT slot ordering vs STAR constants |
87+
| Scheduler hangs after AP3 | Per-CPU offset mismatch, GS-base wrong | Run `debug_assert_offsets()` and check GS-base MSR write |
88+
| Data corruption | PerCpu read/write races, no synchronization | Check AP_ONLINE_COUNT timing — is it set before scheduler reads? |
89+
90+
## Modularization Proposal
91+
92+
Split `ap_boot.rs` into clearer stages:
93+
94+
### Stage 0: BSP Validation (new function)
95+
```rust
96+
unsafe fn validate_bsp_preconditions() -> Result<(), ApBootError> {
97+
// Assert: GDT loaded, IDT loaded, paging on, LAPIC online
98+
// Assert: memory registry ready
99+
// Assert: scheduler NOT started yet
100+
// Return error if any contract violated
101+
}
102+
```
103+
104+
### Stage 1: Trampoline Prep (refactor from `setup_trampoline`)
105+
```rust
106+
unsafe fn prepare_trampoline_once() -> Result<TrampolineHandle, ApBootError> {
107+
// Reserve 0x8000
108+
// Copy trampoline binary
109+
// Zero data block
110+
// Return handle so we don't repeat this for every AP
111+
// (problem now: every AP boot re-does this work)
112+
}
113+
```
114+
115+
### Stage 2: Per-AP Resource Allocation (new function)
116+
```rust
117+
struct ApResources {
118+
stack_base: u64,
119+
gdt: &'static mut [GdtEntry; GDT_SIZE],
120+
tss: &'static mut Tss,
121+
}
122+
123+
unsafe fn allocate_ap_resources(core_idx: u32) -> Result<ApResources, ApBootError> {
124+
// Allocate stack (no SIPI yet)
125+
// Allocate per-AP GDT
126+
// Allocate per-AP TSS
127+
// Fill GDT+TSS
128+
// Return — AP cannot run yet
129+
}
130+
```
131+
132+
### Stage 3: Pre-SIPI Data Handoff (new function)
133+
```rust
134+
unsafe fn write_trampoline_handoff(resources: &ApResources, lapic_id: u32, core_idx: u32) -> Result<(), ApBootError> {
135+
// Fill TD_STACK, TD_GDT_PTR, TD_ENTRY64, TD_CORE_IDX, TD_LAPIC_ID
136+
// Fence (Acquire): ensure all writes reach memory
137+
// Return
138+
}
139+
```
140+
141+
### Stage 4: INIT/SIPI Sequence (extract to new function)
142+
```rust
143+
unsafe fn send_init_sipi_sequence(lapic_id: u32) -> Result<(), ApBootError> {
144+
// INIT assert, wait, SIPI 1, wait, SIPI 2, wait
145+
// Bounded timeouts
146+
// Return
147+
}
148+
```
149+
150+
### Stage 5: AP Readiness Poll (extract to new function)
151+
```rust
152+
unsafe fn wait_ap_online(core_idx: u32, timeout_us: u64) -> Result<(), ApBootError> {
153+
// Poll AP_ONLINE_COUNT with timeout
154+
// Return early if timeout
155+
// Return error code (not just bool) so caller can log which AP failed
156+
}
157+
```
158+
159+
Each stage is now reviewable, testable, and has a single responsibility.
160+
161+
## Validation Checklist
162+
163+
Before merging any changes:
164+
165+
- [ ] **Real hardware**: boots with all cores online on real hardware
166+
- [ ] **Diagnostic logs**: every major step in AP init is logged with unique code
167+
- [ ] **Modularization**: each function ≤ 50 lines, one purpose per function
168+
- [ ] **Error handling**: every fallible operation returns `Result`, no `unwrap`
169+
- [ ] **Memory safety**: run through `kernel-unsafe-discipline` skill checklist
170+
- [ ] **Ordering**: all `Atomic*` operations audited for `Acquire`/`Release` pairing (see `kernel-memory-ordering`)
171+
- [ ] **ABI**: `debug_assert_offsets` passes, every PerCpu change updates PERCPU_* constants
172+
- [ ] **Locking**: no new spinlock deadlock vectors (check against `kernel-locking` skill)
173+
- [ ] **Code style**: pass `cargo fmt`, `cargo clippy -D warnings`, no dead code
174+
175+
## Success Criteria
176+
177+
- [ ] Boot on real 4-core or 8-core system without hang / corruption
178+
- [ ] Scheduler runs on all cores
179+
- [ ] No spurious crashes after APs are online
180+
- [ ] Serial log is clean (no error codes during normal boot)
181+
- [ ] All cores remain online for ≥ 10 seconds (stress-test stability)
182+
183+
## Recommended Reading
184+
185+
Before starting code:
186+
187+
1. **ap_boot.rs**: read the entire file top-to-bottom
188+
2. **ap-trampoline skill**: understand CR3, stack, GDT handoff
189+
3. **per-cpu-layout skill**: verify your understanding of PerCpu offsets
190+
4. **kernel-memory-ordering skill**: reason through AP_ONLINE_COUNT synchronization
191+
5. **kernel-review-checklist skill**: use as final validation gate
192+
193+
## Procedure
194+
195+
1. Use Step 1 (Reproduce) to nail down the real-hardware failure
196+
2. Use Step 2 (Cross-Reference) to pick the most likely root cause
197+
3. Design modularization per the proposal above (don't code; draw boxes)
198+
4. Implement Stage 0 validation (minimal, just asserts)
199+
5. Refactor into Stages 1–5 without changing behavior (structural only)
200+
6. Add logging per Step 1 diagnostics
201+
7. Test on real hardware; iterate until stable
202+
8. Run full validation checklist
203+
9. Ensure all commits are self-contained and reviewable
Lines changed: 60 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,60 @@
1+
---
2+
name: ap-trampoline
3+
description: 'AP trampoline authoring, debugging, and layout. Use when writing or fixing asm/cpu/ap_trampoline.s, real-mode to protected-mode to long-mode AP entry sequence, GDT/CR3/stack handoff from trampoline data block, TRAMPOLINE_DATA_OFFSET layout, TD_CR3/TD_STACK/TD_ENTRY64/TD_READY fields, trampoline binary included via include_bytes!, AP triple-fault on real hardware, page 0x8000 setup, AP_TRAMPOLINE_PHYS.'
4+
argument-hint: "Trampoline task or symptom (e.g. 'AP triple-faults after SIPI')"
5+
---
6+
7+
# AP Trampoline
8+
9+
## When to Use
10+
- Writing or modifying `asm/cpu/ap_trampoline.s`
11+
- Debugging AP triple-faults after SIPI (especially when QEMU works, real hardware dies)
12+
- Changing the trampoline data block layout (`TRAMPOLINE_DATA_OFFSET`, `TD_*` offsets)
13+
- Updating `build.rs` trampoline assembly step
14+
- AP boot hangs at the `TD_READY` poll
15+
16+
## Key Files
17+
- `hwinit/asm/cpu/ap_trampoline.s` — the trampoline source
18+
- `hwinit/src/cpu/ap_boot.rs` — BSP-side setup (`setup_trampoline`, `boot_single_ap`)
19+
- `hwinit/build.rs` — assembles the trampoline flat binary into `OUT_DIR/ap_trampoline.bin`
20+
21+
## Trampoline Data Block Contract
22+
23+
The data block lives at `AP_TRAMPOLINE_PHYS + 0xF00` (= `0x8F00`).
24+
The BSP writes before firing SIPI; the trampoline reads in real/protected mode.
25+
26+
| Offset | Size | Field | Written by | Read by |
27+
|--------|------|-------|------------|---------|
28+
| +0x00 | 8 | `TD_CR3` | BSP | trampoline (32-bit) |
29+
| +0x08 | 8 | `TD_ENTRY64` | BSP | trampoline (64-bit jmp) |
30+
| +0x10 | 8 | `TD_STACK` | BSP | trampoline (RSP setup) |
31+
| +0x18 | 4 | `TD_CORE_IDX` | BSP | `ap_rust_entry` arg 0 |
32+
| +0x1C | 4 | `TD_LAPIC_ID` | BSP | `ap_rust_entry` arg 1 |
33+
| +0x20 | 10 | `TD_GDT_PTR` | BSP | trampoline LGDT |
34+
| +0x30 | 4 | `TD_READY` | BSP (0), AP (1) | BSP poll in `boot_single_ap` |
35+
36+
If you add fields: keep 8-byte alignment, update both the `.s` and `ap_boot.rs` constants.
37+
38+
## Real-Mode → Protected → Long Mode Sequence
39+
40+
1. **Real mode**: CPU starts at `0x8000:0000`, 16-bit. Load a flat 32-bit GDT (from `TD_GDT_PTR`), enable PE in CR0.
41+
2. **32-bit protected**: Far-jump to flush CS. Load `TD_CR3` into CR0 — **must be ≤ 4 GB** (the check is in `setup_trampoline`). Enable PAE in CR4. Set IA32_EFER.LME. Enable paging (CR0.PG).
42+
3. **64-bit long mode**: Far-jump with 64-bit code selector. Load `TD_STACK` into RSP. Call `TD_ENTRY64` with `(TD_CORE_IDX, TD_LAPIC_ID)` in `edi`/`esi`.
43+
44+
## Common Failure Modes
45+
46+
| Symptom | Likely Cause |
47+
|---------|-------------|
48+
| Triple-fault immediately after SIPI | CR3 > 4 GB, or GDT not accessible from trampoline |
49+
| AP hangs, never sets `TD_READY` | Stack pointer wrong (stack_top vs stack_base confusion) |
50+
| Works in QEMU, dies on real hardware | Cache not flushed before CPU reads trampoline data; add `WBINVD` or ensure WB mapping |
51+
| SIPI fires, AP starts, crashes in Rust | `TD_ENTRY64` address wrong; `ap_rust_entry` calling convention mismatch |
52+
| Second SIPI causes double-init | Normal — Intel MP spec requires two SIPIs for reliability |
53+
54+
## Procedure
55+
56+
1. Read the current `ap_trampoline.s` and note section labels and data offsets.
57+
2. Cross-check `TRAMPOLINE_DATA_OFFSET` and `TD_*` constants against the `.s` layout.
58+
3. Make changes, then verify `build.rs` still assembles it as a flat binary (no ELF header).
59+
4. Confirm the assembled binary is ≤ `0xF00` bytes (data block starts there).
60+
5. Test with `AP_TRAMPOLINE_BIN.len()` assertion in `setup_trampoline`.

0 commit comments

Comments
 (0)