|
| 1 | +--- |
| 2 | +description: 'Harden and modularize AP bringup and CPU init sequence. Use when: AP bringup succeeds on QEMU but fails or corrupts state on real hardware, need to audit boot ordering, refactor AP trampoline or per-CPU initialization, fix "works until scheduler starts" instability. Establishes diagnosis procedures, modularization boundaries, and validates real-hardware correctness.' |
| 3 | +argument-hint: "specific symptom (e.g. 'AP triple-faults at syscall_init', 'scheduler crashes after AP4 comes online')" |
| 4 | +--- |
| 5 | + |
| 6 | +# Harden and Modularize AP Bringup |
| 7 | + |
| 8 | +## Problem Statement |
| 9 | + |
| 10 | +The AP bringup sequence **works on QEMU but fails/corrupts state on real hardware**. Symptoms include: |
| 11 | + |
| 12 | +- AP triple-faults after SIPI |
| 13 | +- AP comes online but scheduler crashes |
| 14 | +- System locks up after N cores are online |
| 15 | +- Per-CPU data corruption (GS-base not set, offsets wrong) |
| 16 | +- Boot hangs instead of timing out |
| 17 | + |
| 18 | +The existing code in `hwinit/src/cpu/ap_boot.rs` has: |
| 19 | + |
| 20 | +1. **Coupling**: trampoline setup, data block fill, and boot sequencing all live in one function |
| 21 | +2. **Missing validation**: no checks that per-CPU state is coherent after AP init |
| 22 | +3. **Ordering assumptions**: code assumes BSP init paths happen before APs, but doesn't enforce it |
| 23 | +4. **Race windows**: AP_ONLINE_COUNT increment timing is ambiguous with scheduler activation |
| 24 | + |
| 25 | +## Goals |
| 26 | + |
| 27 | +1. **Diagnose real-hardware failure mode** — understand what breaks where |
| 28 | +2. **Modularize**: separate concerns (trampoline, GDT/TSS, LAPIC, MSR, PerCpu init) |
| 29 | +3. **Enforce ordering**: explicit state machine for BSP → AP handoff |
| 30 | +4. **Harden**: add validation, assertions, bounded errors |
| 31 | +5. **Real-hardware-first**: boot and validate on actual hardware before merging |
| 32 | + |
| 33 | +## Scope |
| 34 | + |
| 35 | +**In scope**: |
| 36 | +- `hwinit/src/cpu/ap_boot.rs` — orchestration |
| 37 | +- `hwinit/src/cpu/gdt.rs` — per-AP GDT/TSS setup |
| 38 | +- `hwinit/src/cpu/per_cpu.rs` — PerCpu init, AP_ONLINE_COUNT semantics |
| 39 | +- `hwinit/asm/cpu/ap_trampoline.s` — real-mode→LM transition |
| 40 | +- `hwinit/build.rs` — trampoline assembly, binary validation |
| 41 | +- BSP init sequence (kernel entry → scheduler) |
| 42 | + |
| 43 | +**Out of scope**: |
| 44 | +- Scheduler modifications |
| 45 | +- Interrupt delivery optimization |
| 46 | +- ACPI topology discovery (use what `start_aps_from_list` gets) |
| 47 | + |
| 48 | +## Diagnosis Procedure |
| 49 | + |
| 50 | +Start here. Don't code yet. |
| 51 | + |
| 52 | +### Step 1: Reproduce on Real Hardware |
| 53 | + |
| 54 | +1. Boot MorpheusX with 4+ CPU cores on real hardware (ThinkPad, Xeon, whatever is available) |
| 55 | +2. Note the exact failure point: |
| 56 | + - Does serial log show APs coming online? |
| 57 | + - Which core # first fails? |
| 58 | + - Does the crash happen at a fixed point or random? |
| 59 | +3. Add detailed logging to `ap_rust_entry`: |
| 60 | + ```rust |
| 61 | + log_ok("AP", 520, "ap_rust_entry: entering"); |
| 62 | + log_ok("AP", 521, "gdt_init done"); |
| 63 | + log_ok("AP", 522, "idt_load done"); |
| 64 | + // ... one log per major step |
| 65 | + ``` |
| 66 | +4. Identify the exact line/function where the AP dies. |
| 67 | + |
| 68 | +### Step 2: Cross-Reference Against Symptoms |
| 69 | + |
| 70 | +Use the skills: |
| 71 | + |
| 72 | +- **kernel-unsafe-discipline**: Are all `unsafe` blocks justified? Check if AP copies from trampoline safely. |
| 73 | +- **kernel-memory-ordering**: Is AP_ONLINE_COUNT increment synchronized correctly? Is TD_STACK write visible to AP? |
| 74 | +- **per-cpu-layout**: Are PERCPU_* offsets correct? Compare `PerCpu` struct layout against `gs:[offset]` in asm. |
| 75 | +- **gdt-tss**: Is per-AP GDT/TSS allocated before SIPI? Is RSP0 in TSS set correctly? |
| 76 | +- **kernel-locking**: Is there a race between AP coming online and BSP starting scheduler? |
| 77 | + |
| 78 | +### Step 3: Root Cause Categories |
| 79 | + |
| 80 | +Map symptom to likely cause: |
| 81 | + |
| 82 | +| Symptom | Likely Cause | Diagnosis | |
| 83 | +|---------|--------------|-----------| |
| 84 | +| Triple-fault after SIPI | CR3 wrong, GDT not accessible, paging broken | Add logging before SIPI; check `setup_trampoline` CR3 path | |
| 85 | +| AP hangs at TD_READY poll | Stack ptr wrong, AP crashes silently | Verify stack allocation in `boot_single_ap` | |
| 86 | +| Crash in `syscall_init` | STAR selector mismatch with GDT, LSTAR points to BSP code | Audit GDT slot ordering vs STAR constants | |
| 87 | +| Scheduler hangs after AP3 | Per-CPU offset mismatch, GS-base wrong | Run `debug_assert_offsets()` and check GS-base MSR write | |
| 88 | +| Data corruption | PerCpu read/write races, no synchronization | Check AP_ONLINE_COUNT timing — is it set before scheduler reads? | |
| 89 | + |
| 90 | +## Modularization Proposal |
| 91 | + |
| 92 | +Split `ap_boot.rs` into clearer stages: |
| 93 | + |
| 94 | +### Stage 0: BSP Validation (new function) |
| 95 | +```rust |
| 96 | +unsafe fn validate_bsp_preconditions() -> Result<(), ApBootError> { |
| 97 | + // Assert: GDT loaded, IDT loaded, paging on, LAPIC online |
| 98 | + // Assert: memory registry ready |
| 99 | + // Assert: scheduler NOT started yet |
| 100 | + // Return error if any contract violated |
| 101 | +} |
| 102 | +``` |
| 103 | + |
| 104 | +### Stage 1: Trampoline Prep (refactor from `setup_trampoline`) |
| 105 | +```rust |
| 106 | +unsafe fn prepare_trampoline_once() -> Result<TrampolineHandle, ApBootError> { |
| 107 | + // Reserve 0x8000 |
| 108 | + // Copy trampoline binary |
| 109 | + // Zero data block |
| 110 | + // Return handle so we don't repeat this for every AP |
| 111 | + // (problem now: every AP boot re-does this work) |
| 112 | +} |
| 113 | +``` |
| 114 | + |
| 115 | +### Stage 2: Per-AP Resource Allocation (new function) |
| 116 | +```rust |
| 117 | +struct ApResources { |
| 118 | + stack_base: u64, |
| 119 | + gdt: &'static mut [GdtEntry; GDT_SIZE], |
| 120 | + tss: &'static mut Tss, |
| 121 | +} |
| 122 | + |
| 123 | +unsafe fn allocate_ap_resources(core_idx: u32) -> Result<ApResources, ApBootError> { |
| 124 | + // Allocate stack (no SIPI yet) |
| 125 | + // Allocate per-AP GDT |
| 126 | + // Allocate per-AP TSS |
| 127 | + // Fill GDT+TSS |
| 128 | + // Return — AP cannot run yet |
| 129 | +} |
| 130 | +``` |
| 131 | + |
| 132 | +### Stage 3: Pre-SIPI Data Handoff (new function) |
| 133 | +```rust |
| 134 | +unsafe fn write_trampoline_handoff(resources: &ApResources, lapic_id: u32, core_idx: u32) -> Result<(), ApBootError> { |
| 135 | + // Fill TD_STACK, TD_GDT_PTR, TD_ENTRY64, TD_CORE_IDX, TD_LAPIC_ID |
| 136 | + // Fence (Acquire): ensure all writes reach memory |
| 137 | + // Return |
| 138 | +} |
| 139 | +``` |
| 140 | + |
| 141 | +### Stage 4: INIT/SIPI Sequence (extract to new function) |
| 142 | +```rust |
| 143 | +unsafe fn send_init_sipi_sequence(lapic_id: u32) -> Result<(), ApBootError> { |
| 144 | + // INIT assert, wait, SIPI 1, wait, SIPI 2, wait |
| 145 | + // Bounded timeouts |
| 146 | + // Return |
| 147 | +} |
| 148 | +``` |
| 149 | + |
| 150 | +### Stage 5: AP Readiness Poll (extract to new function) |
| 151 | +```rust |
| 152 | +unsafe fn wait_ap_online(core_idx: u32, timeout_us: u64) -> Result<(), ApBootError> { |
| 153 | + // Poll AP_ONLINE_COUNT with timeout |
| 154 | + // Return early if timeout |
| 155 | + // Return error code (not just bool) so caller can log which AP failed |
| 156 | +} |
| 157 | +``` |
| 158 | + |
| 159 | +Each stage is now reviewable, testable, and has a single responsibility. |
| 160 | + |
| 161 | +## Validation Checklist |
| 162 | + |
| 163 | +Before merging any changes: |
| 164 | + |
| 165 | +- [ ] **Real hardware**: boots with all cores online on real hardware |
| 166 | +- [ ] **Diagnostic logs**: every major step in AP init is logged with unique code |
| 167 | +- [ ] **Modularization**: each function ≤ 50 lines, one purpose per function |
| 168 | +- [ ] **Error handling**: every fallible operation returns `Result`, no `unwrap` |
| 169 | +- [ ] **Memory safety**: run through `kernel-unsafe-discipline` skill checklist |
| 170 | +- [ ] **Ordering**: all `Atomic*` operations audited for `Acquire`/`Release` pairing (see `kernel-memory-ordering`) |
| 171 | +- [ ] **ABI**: `debug_assert_offsets` passes, every PerCpu change updates PERCPU_* constants |
| 172 | +- [ ] **Locking**: no new spinlock deadlock vectors (check against `kernel-locking` skill) |
| 173 | +- [ ] **Code style**: pass `cargo fmt`, `cargo clippy -D warnings`, no dead code |
| 174 | + |
| 175 | +## Success Criteria |
| 176 | + |
| 177 | +- [ ] Boot on real 4-core or 8-core system without hang / corruption |
| 178 | +- [ ] Scheduler runs on all cores |
| 179 | +- [ ] No spurious crashes after APs are online |
| 180 | +- [ ] Serial log is clean (no error codes during normal boot) |
| 181 | +- [ ] All cores remain online for ≥ 10 seconds (stress-test stability) |
| 182 | + |
| 183 | +## Recommended Reading |
| 184 | + |
| 185 | +Before starting code: |
| 186 | + |
| 187 | +1. **ap_boot.rs**: read the entire file top-to-bottom |
| 188 | +2. **ap-trampoline skill**: understand CR3, stack, GDT handoff |
| 189 | +3. **per-cpu-layout skill**: verify your understanding of PerCpu offsets |
| 190 | +4. **kernel-memory-ordering skill**: reason through AP_ONLINE_COUNT synchronization |
| 191 | +5. **kernel-review-checklist skill**: use as final validation gate |
| 192 | + |
| 193 | +## Procedure |
| 194 | + |
| 195 | +1. Use Step 1 (Reproduce) to nail down the real-hardware failure |
| 196 | +2. Use Step 2 (Cross-Reference) to pick the most likely root cause |
| 197 | +3. Design modularization per the proposal above (don't code; draw boxes) |
| 198 | +4. Implement Stage 0 validation (minimal, just asserts) |
| 199 | +5. Refactor into Stages 1–5 without changing behavior (structural only) |
| 200 | +6. Add logging per Step 1 diagnostics |
| 201 | +7. Test on real hardware; iterate until stable |
| 202 | +8. Run full validation checklist |
| 203 | +9. Ensure all commits are self-contained and reviewable |
0 commit comments