Poprdi
diff --git a/‎.github/instructions/cpu-code-real-hardware.instructions.md‎
Lines changed: 97 additions & 0 deletions b/‎.github/instructions/cpu-code-real-hardware.instructions.md‎
Lines changed: 97 additions & 0 deletions
diff --git a/‎.github/prompts/hardened-ap-bringup.prompt.md‎
Lines changed: 203 additions & 0 deletions b/‎.github/prompts/hardened-ap-bringup.prompt.md‎
Lines changed: 203 additions & 0 deletions
diff --git a/‎.github/skills/ap-trampoline/SKILL.md‎
Lines changed: 60 additions & 0 deletions b/‎.github/skills/ap-trampoline/SKILL.md‎
Lines changed: 60 additions & 0 deletions
@@ -0,0 +1,97 @@
+---
+applyTo: ['hwinit/src/cpu/**/*.rs', 'hwinit/asm/cpu/**/*.s']
+---
+
+# CPU Code — Real Hardware First
+
+This instruction applies to all changes in `hwinit/src/cpu/` and CPU ASM.
+
+## The Contract
+
+Any change to CPU code **must be validated on real hardware before merging.** QEMU is a liar. It will let code slide that triple-faults on a real Xeon.
+
+Symptoms that indicate you've broken something on real hardware that QEMU hides:
+
+- AP triples-fault after SIPI
+- System locks up after N cores come online
+- Scheduler crashes when running on AP (works on BSP)
+- Per-CPU data corruption (GS-base, stack, PerCpu fields read as garbage)
+- Cache coherency bugs: one core writes, another doesn't see it (QEMU has perfect coherency)
+- LAPIC base remapping: firmware moved it; code assumes default
+
+## Before You Start
+
+1. Know the real-hardware symptom your change might cause
+2. Build a mental model of what can break in what you're touching
+3. Read the relevant skill (see the list below)
+
+## While You Code
+
+- Every `unsafe` block needs a `// SAFETY:` comment — non-negotiable
+- Every `Atomic*` operation needs deliberate ordering choice — document why
+- Every new lock needs a reason (see `kernel-locking` skill)
+- No `unwrap()` / `expect()` / `panic!()` outside init paths
+- Memory barriers (fence, mfence, SeqCst) need a comment explaining what they pair with
+
+## After You Code — Pre-Merge Gates
+
+1. **Static checks pass**:
+   - `cargo fmt --check` clean
+   - `cargo clippy -- -D warnings` clean (or justified allow)
+   - `cargo check` succeeds
+
+2. **Real hardware validation**:
+   - Boot on real hardware (4+ cores minimum)
+   - Verify no crashes, hangs, or data corruption
+   - Serial log shows clean boot (no ERR codes)
+   - System remains stable for ≥30 seconds with all cores online
+
+3. **Code review gates** (use skills):
+   - Run through `kernel-unsafe-discipline` skill checklist
+   - Run through `kernel-memory-ordering` skill checklist (if touching Atomic or barriers)
+   - Run through `kernel-review-checklist` skill checklist
+
+## Relevant Skills
+
+Depending on what you're changing:
+
+- **Any CPU code**: `kernel-coding-style`, `kernel-review-checklist`
+- **AP bringup / GDT / MSR**: `/hardened-ap-bringup` prompt (major refactoring needed)
+- **AP trampoline, GDT, TSS, per-CPU**: `ap-trampoline`, `gdt-tss`, `per-cpu-layout` skills
+- **LAPIC, IPI, timers**: `lapic-ipi` skill
+- **ACPI MADT, topology**: `madt-topology` skill
+- **CPU features (SSE, AVX, SMEP)**: `cpuid-feature-gate` skill
+- **MSR programming**: `msr-setup` skill
+- **Locking or synchronization**: `kernel-locking`, `kernel-memory-ordering` skills
+- **Unsafe blocks**: `kernel-unsafe-discipline` skill
+
+## Known Issues to Avoid
+
+- **CR3 above 4 GB**: 32-bit trampoline can only load 32-bit CR3
+- **AP triple-fault on SIPI**: Usually CR3 wrong, GDT not accessible, or paging not set up
+- **Per-CPU offset mismatch**: asm uses hardcoded `gs:[0x20]` but you moved the field; add to `debug_assert_offsets()`
+- **x2APIC ID > 0xFF**: xAPIC destination field is 8-bit only; need x2APIC mode for large IDs
+- **Cache coherency**: QEMU hides coherency bugs; real hardware exposes them
+- **TD_READY not set**: AP either crashed before reaching `ap_rust_entry`, or never incremented `AP_ONLINE_COUNT`
+
+## How to Debug Real-Hardware Failures
+
+1. Add serial logging to every significant step in the affected code path
+2. Use unique log codes (see `log_error("AP", CODE, "...")` pattern)
+3. Boot on real hardware with verbose logging
+4. Identify the last successful log before the crash
+5. Audit the next function: unsafe blocks, memory ordering, lock contention
+6. Reference the appropriate skill for that subsystem
+
+## What NOT to Do
+
+- Do not assume BSP behavior works for APs (GDT, IDT, MSR, GS-base all per-core)
+- Do not assume QEMU coherency works on real hardware
+- Do not add a "TODO: fix on real hardware" comment — either fix it or open an issue
+- Do not merge with `#[allow(...)]` warnings without a strong reason documented
+- Do not allocate DMA memory from interrupt context
+- Do not hold a spinlock across an allocation or I/O operation
+
+## Questions?
+
+If you're uncertain about whether a change is safe for real hardware, invoke the relevant skill or ask an agent before coding.
@@ -0,0 +1,203 @@
+---
+description: 'Harden and modularize AP bringup and CPU init sequence. Use when: AP bringup succeeds on QEMU but fails or corrupts state on real hardware, need to audit boot ordering, refactor AP trampoline or per-CPU initialization, fix "works until scheduler starts" instability. Establishes diagnosis procedures, modularization boundaries, and validates real-hardware correctness.'
+argument-hint: "specific symptom (e.g. 'AP triple-faults at syscall_init', 'scheduler crashes after AP4 comes online')"
+---
+
+# Harden and Modularize AP Bringup
+
+## Problem Statement
+
+The AP bringup sequence **works on QEMU but fails/corrupts state on real hardware**. Symptoms include:
+
+- AP triple-faults after SIPI
+- AP comes online but scheduler crashes
+- System locks up after N cores are online
+- Per-CPU data corruption (GS-base not set, offsets wrong)
+- Boot hangs instead of timing out
+
+The existing code in `hwinit/src/cpu/ap_boot.rs` has:
+
+1. **Coupling**: trampoline setup, data block fill, and boot sequencing all live in one function
+2. **Missing validation**: no checks that per-CPU state is coherent after AP init
+3. **Ordering assumptions**: code assumes BSP init paths happen before APs, but doesn't enforce it
+4. **Race windows**: AP_ONLINE_COUNT increment timing is ambiguous with scheduler activation
+
+## Goals
+
+1. **Diagnose real-hardware failure mode** — understand what breaks where
+2. **Modularize**: separate concerns (trampoline, GDT/TSS, LAPIC, MSR, PerCpu init)
+3. **Enforce ordering**: explicit state machine for BSP → AP handoff
+4. **Harden**: add validation, assertions, bounded errors
+5. **Real-hardware-first**: boot and validate on actual hardware before merging
+
+## Scope
+
+**In scope**:
+- `hwinit/src/cpu/ap_boot.rs` — orchestration
+- `hwinit/src/cpu/gdt.rs` — per-AP GDT/TSS setup
+- `hwinit/src/cpu/per_cpu.rs` — PerCpu init, AP_ONLINE_COUNT semantics
+- `hwinit/asm/cpu/ap_trampoline.s` — real-mode→LM transition
+- `hwinit/build.rs` — trampoline assembly, binary validation
+- BSP init sequence (kernel entry → scheduler)
+
+**Out of scope**:
+- Scheduler modifications
+- Interrupt delivery optimization
+- ACPI topology discovery (use what `start_aps_from_list` gets)
+
+## Diagnosis Procedure
+
+Start here. Don't code yet.
+
+### Step 1: Reproduce on Real Hardware
+
+1. Boot MorpheusX with 4+ CPU cores on real hardware (ThinkPad, Xeon, whatever is available)
+2. Note the exact failure point:
+   - Does serial log show APs coming online?
+   - Which core # first fails?
+   - Does the crash happen at a fixed point or random?
+3. Add detailed logging to `ap_rust_entry`:
+   ```rust
+   log_ok("AP", 520, "ap_rust_entry: entering");
+   log_ok("AP", 521, "gdt_init done");
+   log_ok("AP", 522, "idt_load done");
+   // ... one log per major step
+   ```
+4. Identify the exact line/function where the AP dies.
+
+### Step 2: Cross-Reference Against Symptoms
+
+Use the skills:
+
+- **kernel-unsafe-discipline**: Are all `unsafe` blocks justified? Check if AP copies from trampoline safely.
+- **kernel-memory-ordering**: Is AP_ONLINE_COUNT increment synchronized correctly? Is TD_STACK write visible to AP?
+- **per-cpu-layout**: Are PERCPU_* offsets correct? Compare `PerCpu` struct layout against `gs:[offset]` in asm.
+- **gdt-tss**: Is per-AP GDT/TSS allocated before SIPI? Is RSP0 in TSS set correctly?
+- **kernel-locking**: Is there a race between AP coming online and BSP starting scheduler?
+
+### Step 3: Root Cause Categories
+
+Map symptom to likely cause:
+
+| Symptom | Likely Cause | Diagnosis |
+|---------|--------------|-----------|
+| Triple-fault after SIPI | CR3 wrong, GDT not accessible, paging broken | Add logging before SIPI; check `setup_trampoline` CR3 path |
+| AP hangs at TD_READY poll | Stack ptr wrong, AP crashes silently | Verify stack allocation in `boot_single_ap` |
+| Crash in `syscall_init` | STAR selector mismatch with GDT, LSTAR points to BSP code | Audit GDT slot ordering vs STAR constants |
+| Scheduler hangs after AP3 | Per-CPU offset mismatch, GS-base wrong | Run `debug_assert_offsets()` and check GS-base MSR write |
+| Data corruption | PerCpu read/write races, no synchronization | Check AP_ONLINE_COUNT timing — is it set before scheduler reads? |
+
+## Modularization Proposal
+
+Split `ap_boot.rs` into clearer stages:
+
+### Stage 0: BSP Validation (new function)
+```rust
+unsafe fn validate_bsp_preconditions() -> Result<(), ApBootError> {
+    // Assert: GDT loaded, IDT loaded, paging on, LAPIC online
+    // Assert: memory registry ready
+    // Assert: scheduler NOT started yet
+    // Return error if any contract violated
+}
+```
+
+### Stage 1: Trampoline Prep (refactor from `setup_trampoline`)
+```rust
+unsafe fn prepare_trampoline_once() -> Result<TrampolineHandle, ApBootError> {
+    // Reserve 0x8000
+    // Copy trampoline binary
+    // Zero data block
+    // Return handle so we don't repeat this for every AP
+    // (problem now: every AP boot re-does this work)
+}
+```
+
+### Stage 2: Per-AP Resource Allocation (new function)
+```rust
+struct ApResources {
+    stack_base: u64,
+    gdt: &'static mut [GdtEntry; GDT_SIZE],
+    tss: &'static mut Tss,
+}
+
+unsafe fn allocate_ap_resources(core_idx: u32) -> Result<ApResources, ApBootError> {
+    // Allocate stack (no SIPI yet)
+    // Allocate per-AP GDT
+    // Allocate per-AP TSS
+    // Fill GDT+TSS
+    // Return — AP cannot run yet
+}
+```
+
+### Stage 3: Pre-SIPI Data Handoff (new function)
+```rust
+unsafe fn write_trampoline_handoff(resources: &ApResources, lapic_id: u32, core_idx: u32) -> Result<(), ApBootError> {
+    // Fill TD_STACK, TD_GDT_PTR, TD_ENTRY64, TD_CORE_IDX, TD_LAPIC_ID
+    // Fence (Acquire): ensure all writes reach memory
+    // Return
+}
+```
+
+### Stage 4: INIT/SIPI Sequence (extract to new function)
+```rust
+unsafe fn send_init_sipi_sequence(lapic_id: u32) -> Result<(), ApBootError> {
+    // INIT assert, wait, SIPI 1, wait, SIPI 2, wait
+    // Bounded timeouts
+    // Return
+}
+```
+
+### Stage 5: AP Readiness Poll (extract to new function)
+```rust
+unsafe fn wait_ap_online(core_idx: u32, timeout_us: u64) -> Result<(), ApBootError> {
+    // Poll AP_ONLINE_COUNT with timeout
+    // Return early if timeout
+    // Return error code (not just bool) so caller can log which AP failed
+}
+```
+
+Each stage is now reviewable, testable, and has a single responsibility.
+
+## Validation Checklist
+
+Before merging any changes:
+
+- [ ] **Real hardware**: boots with all cores online on real hardware
+- [ ] **Diagnostic logs**: every major step in AP init is logged with unique code
+- [ ] **Modularization**: each function ≤ 50 lines, one purpose per function
+- [ ] **Error handling**: every fallible operation returns `Result`, no `unwrap`
+- [ ] **Memory safety**: run through `kernel-unsafe-discipline` skill checklist
+- [ ] **Ordering**: all `Atomic*` operations audited for `Acquire`/`Release` pairing (see `kernel-memory-ordering`)
+- [ ] **ABI**: `debug_assert_offsets` passes, every PerCpu change updates PERCPU_* constants
+- [ ] **Locking**: no new spinlock deadlock vectors (check against `kernel-locking` skill)
+- [ ] **Code style**: pass `cargo fmt`, `cargo clippy -D warnings`, no dead code
+
+## Success Criteria
+
+- [ ] Boot on real 4-core or 8-core system without hang / corruption
+- [ ] Scheduler runs on all cores
+- [ ] No spurious crashes after APs are online
+- [ ] Serial log is clean (no error codes during normal boot)
+- [ ] All cores remain online for ≥ 10 seconds (stress-test stability)
+
+## Recommended Reading
+
+Before starting code:
+
+1. **ap_boot.rs**: read the entire file top-to-bottom
+2. **ap-trampoline skill**: understand CR3, stack, GDT handoff
+3. **per-cpu-layout skill**: verify your understanding of PerCpu offsets
+4. **kernel-memory-ordering skill**: reason through AP_ONLINE_COUNT synchronization
+5. **kernel-review-checklist skill**: use as final validation gate
+
+## Procedure
+
+1. Use Step 1 (Reproduce) to nail down the real-hardware failure
+2. Use Step 2 (Cross-Reference) to pick the most likely root cause
+3. Design modularization per the proposal above (don't code; draw boxes)
+4. Implement Stage 0 validation (minimal, just asserts)
+5. Refactor into Stages 1–5 without changing behavior (structural only)
+6. Add logging per Step 1 diagnostics
+7. Test on real hardware; iterate until stable
+8. Run full validation checklist
+9. Ensure all commits are self-contained and reviewable
@@ -0,0 +1,60 @@
+---
+name: ap-trampoline
+description: 'AP trampoline authoring, debugging, and layout. Use when writing or fixing asm/cpu/ap_trampoline.s, real-mode to protected-mode to long-mode AP entry sequence, GDT/CR3/stack handoff from trampoline data block, TRAMPOLINE_DATA_OFFSET layout, TD_CR3/TD_STACK/TD_ENTRY64/TD_READY fields, trampoline binary included via include_bytes!, AP triple-fault on real hardware, page 0x8000 setup, AP_TRAMPOLINE_PHYS.'
+argument-hint: "Trampoline task or symptom (e.g. 'AP triple-faults after SIPI')"
+---
+
+# AP Trampoline
+
+## When to Use
+- Writing or modifying `asm/cpu/ap_trampoline.s`
+- Debugging AP triple-faults after SIPI (especially when QEMU works, real hardware dies)
+- Changing the trampoline data block layout (`TRAMPOLINE_DATA_OFFSET`, `TD_*` offsets)
+- Updating `build.rs` trampoline assembly step
+- AP boot hangs at the `TD_READY` poll
+
+## Key Files
+- `hwinit/asm/cpu/ap_trampoline.s` — the trampoline source
+- `hwinit/src/cpu/ap_boot.rs` — BSP-side setup (`setup_trampoline`, `boot_single_ap`)
+- `hwinit/build.rs` — assembles the trampoline flat binary into `OUT_DIR/ap_trampoline.bin`
+
+## Trampoline Data Block Contract
+
+The data block lives at `AP_TRAMPOLINE_PHYS + 0xF00` (= `0x8F00`).
+The BSP writes before firing SIPI; the trampoline reads in real/protected mode.
+
+| Offset | Size | Field | Written by | Read by |
+|--------|------|-------|------------|---------|
+| +0x00 | 8 | `TD_CR3` | BSP | trampoline (32-bit) |
+| +0x08 | 8 | `TD_ENTRY64` | BSP | trampoline (64-bit jmp) |
+| +0x10 | 8 | `TD_STACK` | BSP | trampoline (RSP setup) |
+| +0x18 | 4 | `TD_CORE_IDX` | BSP | `ap_rust_entry` arg 0 |
+| +0x1C | 4 | `TD_LAPIC_ID` | BSP | `ap_rust_entry` arg 1 |
+| +0x20 | 10 | `TD_GDT_PTR` | BSP | trampoline LGDT |
+| +0x30 | 4 | `TD_READY` | BSP (0), AP (1) | BSP poll in `boot_single_ap` |
+
+If you add fields: keep 8-byte alignment, update both the `.s` and `ap_boot.rs` constants.
+
+## Real-Mode → Protected → Long Mode Sequence
+
+1. **Real mode**: CPU starts at `0x8000:0000`, 16-bit. Load a flat 32-bit GDT (from `TD_GDT_PTR`), enable PE in CR0.
+2. **32-bit protected**: Far-jump to flush CS. Load `TD_CR3` into CR0 — **must be ≤ 4 GB** (the check is in `setup_trampoline`). Enable PAE in CR4. Set IA32_EFER.LME. Enable paging (CR0.PG). 
+3. **64-bit long mode**: Far-jump with 64-bit code selector. Load `TD_STACK` into RSP. Call `TD_ENTRY64` with `(TD_CORE_IDX, TD_LAPIC_ID)` in `edi`/`esi`.
+
+## Common Failure Modes
+
+| Symptom | Likely Cause |
+|---------|-------------|
+| Triple-fault immediately after SIPI | CR3 > 4 GB, or GDT not accessible from trampoline |
+| AP hangs, never sets `TD_READY` | Stack pointer wrong (stack_top vs stack_base confusion) |
+| Works in QEMU, dies on real hardware | Cache not flushed before CPU reads trampoline data; add `WBINVD` or ensure WB mapping |
+| SIPI fires, AP starts, crashes in Rust | `TD_ENTRY64` address wrong; `ap_rust_entry` calling convention mismatch |
+| Second SIPI causes double-init | Normal — Intel MP spec requires two SIPIs for reliability |
+
+## Procedure
+
+1. Read the current `ap_trampoline.s` and note section labels and data offsets.
+2. Cross-check `TRAMPOLINE_DATA_OFFSET` and `TD_*` constants against the `.s` layout.
+3. Make changes, then verify `build.rs` still assembles it as a flat binary (no ELF header).
+4. Confirm the assembled binary is ≤ `0xF00` bytes (data block starts there).
+5. Test with `AP_TRAMPOLINE_BIN.len()` assertion in `setup_trampoline`.