Skip to content

[Draft] Fangrui/xc7k480t bring up#14

Open
mpskex wants to merge 3 commits into
mainfrom
fangrui/xc7k480t-bring-up
Open

[Draft] Fangrui/xc7k480t bring up#14
mpskex wants to merge 3 commits into
mainfrom
fangrui/xc7k480t-bring-up

Conversation

@mpskex
Copy link
Copy Markdown
Owner

@mpskex mpskex commented May 14, 2026

Bring up XC7K480T with 4GB on board DDR.

Setup: 1K PE with 32 x 8bit elements per vector word

mpskex added 3 commits May 10, 2026 09:54
… clock

Three compounding fixes that together achieve formal timing closure
(WNS = +0.003 ns, 0 failing endpoints) for the K=32 MMALU on
xc7k480tffg1156-2 at 200 MHz fabric / 250 MHz PCIe.

## DataFeeder per-lane buffer_accum (src/main/scala/alu/mma/sa/dataFeeder.scala)

The original Pipe(Vec(n, SInt(accum_nbits.W)), 2*n-1) generated a single
shared valid register (_v_reg) with fanout=1025 (n=32 lanes × 2n-1=63
pipeline stages).  Vivado placed this FF in the centre of the die with
7.9 ns of net delay against an 8 ns two-cycle MCP budget → WNS=-0.782 ns.

Replace with n individual Pipe(SInt(accum_nbits.W), 2*n-1) instances.
Chisel deduplicates to one Pipe63_SInt32 module × 32 instances.  Each
instance has its own private _v_reg chain; max fanout per valid signal
drops to 2.  Result: 96% reduction in failing endpoints (22,608→1,299),
WNS improves from −0.782 to −0.265 ns.

## 200 MHz fabric clock (ip/vivado/xc7k480t/)

Remaining 1,299 violations were dominated by paths that are fundamentally
limited by placement spread at 250 MHz (4 ns budget):
  - xdma_inst → dma_master_inst (563 src paths, placement-dependent)
  - calib_sync2_reg fanout resets (238 paths, 3.5–4.1 ns route delay)
  - MMPE accumulator carry chain (195 paths, ~4.3 ns logic + routing)

Fix: insert clk_wiz_0 (MMCM, 250→200 MHz) and two AXI clock converters:
  - axi_cc_xdma_in: 128-bit AXI4, XDMA M_AXI (250 MHz) → fabric (200 MHz)
  - axi_cc_byp_in:   32-bit AXI4L, BYPASS proto_conv out → ctrl_lite

MMALU, DMA master, and ctrl_lite now run at fabric_aclk (200 MHz).
axi_clkconv_xdma and axi_clkconv_npu slave sides also move to 200 MHz.
The three-group set_clock_groups constraint correctly isolates:
  {userclk2, userclk1} | {clk_out1_clk_wiz_0_1} | {clk_pll_i, clk_pll_i_1}

At 200 MHz the 2-cycle MMALU MCP budget is 10 ns (was 8 ns), giving all
carry-chain and routing paths comfortable margin.  Result: WNS = +0.003 ns,
WHS = +0.017 ns, 0 failing endpoints.

## calib_sync MCP extension (constrs/npu_top.xdc)

Extended the existing 2-cycle MCP from c0/c1_calib_sync2_reg to also
cover dma_master_inst/* destinations (previously only mmalu_inst/*).
Diagnosis showed 238 paths from the same source fanout reaching
dma_master_inst acc_buf/out_buf CE pins — legitimately slow-moving
one-shot reset signal; 2-cycle constraint is functionally safe.
Add docs/implementations/FPGA_XC7K480T.md covering the full FPGA
verification platform for the K=32 NPU on xc7k480tffg1156-2.

Document focuses on the timing closure refinement journey:

Architecture
- PCIe Gen2×8 XDMA + dual DDR3 MIG (ECC) + K=32 MMALU
- 200 MHz fabric clock (clk_wiz_0 MMCM from 250 MHz XDMA userclk2)
- Four clock domains: userclk2 (250), fabric_aclk (200), clk_pll_i (133)
- AXI topology: clkconv-first (Tier-2.5), then dwidth at 133 MHz

Timing closure history table (K=16 baseline → formal closure)
- Documents each WNS/TNS/failing-endpoint data point
- Links each number to the specific architectural or constraint change

Key techniques documented
- Per-lane DataFeeder refactor: eliminates 1025-fanout Pipe(Vec) v_reg
- 200 MHz fabric insertion: clk_wiz_0 + axi_cc_xdma_in + axi_cc_byp_in
- Three-group set_clock_groups for correct inter-domain path exclusion
- calib_sync 2-cycle MCP covering both mmalu_inst/* and dma_master_inst/*

Build instructions: two-step Vivado batch flow (create_project.tcl → impl)
Final metrics: WNS=+0.003 ns, 0 failing endpoints, 18 MB bitstream
…alone build

## Summary

Full FPGA bring-up for the Chisel NPU on xc7k480tffg1156-2 PCIe carrier:
cold-boot root cause found and fixed, V0–V9 validation ladder passes 9/9,
standalone self-contained build (no external Vivado project needed).

## Cold-boot root causes (fixed)

Two independent bugs caused PCIe to fail to enumerate after cold power-on:

1. CONFIG_MODE=BPI16 in XDC set COR0 MATCH_CYCLE=2 and DRIVEDONE=1,
   delaying FPGA DONE past the AMD FCH link-training window.
   Fix: override CONFIG_MODE=SPIx1 + CONFIGRATE=3 before write_bitstream.
   Target COR0 = 0x02003fe5.

2. mig_sys_rst_n driven from axi_aresetn (PCIe-derived) created a
   chicken-and-egg deadlock: PCIe cannot train because MIG is in reset,
   but MIG only exits reset after PCIe trains.
   Fix: drive mig_sys_rst_n from board reset pin.

## Bisect ladder (V0–V9, all 9/9 PASS)

Each step adds one NPU component on top of the reference XDMA+MIG design
to confirm cold-boot safety of each incremental addition:

  V0  baseline        Reference design, no changes
  V1  no_mb           Remove MicroBlaze + peripherals
  V2  bypass_ctrl     BYPASS path → ctrl_lite (AXI-Lite CTRL register)
  V3  mmcm            Fabric MMCM (125 MHz)
  V4  byp_cdc         BYPASS clock domain crossing
  V5  xdma_cc         XDMA→fabric clock domain crossing
  V6  no_smc          Remove SmartConnect; NPU flat AXI chain to MIG C0
  V7  dma_master      npu_dma_master → MIG C1
  V8  npu_stub        Chisel mmalu_stub.v (synthesis shape test)
  V9  npu_full        Real Chisel top.sv (K=32, N=8, firtool-1.62.1)

SmartConnect architectural note: the reference SmartConnect handles
200→133 MHz CDC internally via its aclk1 port. Routing M00_AXI through
an external converter breaks aclk1 domain inference. V6 removes the
SmartConnect entirely, replacing it with the NPU's own flat converter
chain (axi_cc_xdma_in → axi_clkconv_xdma → axi_dwidth_xdma → MIG C0).

## Standalone build (no external project needed)

bootstrap_project.tcl creates ip/vivado/xc7k480t/proj/ from committed
XCI sources using recreate_bd.tcl (generated via write_bd_tcl from the
final V9 BD state). Key bootstrap steps:
  - source recreate_bd.tcl (creates BD from TCL, no design_info lock-in)
  - generate_target all (regenerates IP HDL and OOC synthesis)
  - wait for ALL OOC synthesis runs before write_checkpoint
  - copy IP DCPs from IP_OUTPUT_DIR (gen/) to IP_DIR (srcs/) so
    link_design in impl can resolve them

## Files added

ip/vivado/xc7k480t/
  scripts/           build_v0..v9, migrate_lib, bootstrap, apply procs
  src/               npu_ctrl_lite.v, npu_dma_master.v, npu_subsys.v, stub
  constrs/           IO_Port.xdc
  README.md          FPGA bring-up guide

ip/vivado/xc7k480t.reference/
  src/bd/            top.bd, top.bda, 14× IP XCI files, mig_a.prj
  scripts/           recreate_bd.tcl, build.tcl, create_project.tcl
  constrs/           IO_Port.xdc

driver/linux/        XDMA kernel driver source mirror (Xilinx, GPL-2.0)
tool/hw/             Serial console, flash, bringup, smoke test scripts
pytest.ini           hw + slow marker declarations
AGENTS.md            Repo guidance for AI agents
docs/implementations/FPGA_XC7K480T.md  FPGA implementation notes
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant