[Draft] Fangrui/xc7k480t bring up by mpskex · Pull Request #14 · mpskex/chisel-npu

mpskex · 2026-05-14T15:54:14Z

Bring up XC7K480T with 4GB on board DDR.

Setup: 1K PE with 32 x 8bit elements per vector word

… clock Three compounding fixes that together achieve formal timing closure (WNS = +0.003 ns, 0 failing endpoints) for the K=32 MMALU on xc7k480tffg1156-2 at 200 MHz fabric / 250 MHz PCIe. ## DataFeeder per-lane buffer_accum (src/main/scala/alu/mma/sa/dataFeeder.scala) The original Pipe(Vec(n, SInt(accum_nbits.W)), 2*n-1) generated a single shared valid register (_v_reg) with fanout=1025 (n=32 lanes × 2n-1=63 pipeline stages). Vivado placed this FF in the centre of the die with 7.9 ns of net delay against an 8 ns two-cycle MCP budget → WNS=-0.782 ns. Replace with n individual Pipe(SInt(accum_nbits.W), 2*n-1) instances. Chisel deduplicates to one Pipe63_SInt32 module × 32 instances. Each instance has its own private _v_reg chain; max fanout per valid signal drops to 2. Result: 96% reduction in failing endpoints (22,608→1,299), WNS improves from −0.782 to −0.265 ns. ## 200 MHz fabric clock (ip/vivado/xc7k480t/) Remaining 1,299 violations were dominated by paths that are fundamentally limited by placement spread at 250 MHz (4 ns budget): - xdma_inst → dma_master_inst (563 src paths, placement-dependent) - calib_sync2_reg fanout resets (238 paths, 3.5–4.1 ns route delay) - MMPE accumulator carry chain (195 paths, ~4.3 ns logic + routing) Fix: insert clk_wiz_0 (MMCM, 250→200 MHz) and two AXI clock converters: - axi_cc_xdma_in: 128-bit AXI4, XDMA M_AXI (250 MHz) → fabric (200 MHz) - axi_cc_byp_in: 32-bit AXI4L, BYPASS proto_conv out → ctrl_lite MMALU, DMA master, and ctrl_lite now run at fabric_aclk (200 MHz). axi_clkconv_xdma and axi_clkconv_npu slave sides also move to 200 MHz. The three-group set_clock_groups constraint correctly isolates: {userclk2, userclk1} | {clk_out1_clk_wiz_0_1} | {clk_pll_i, clk_pll_i_1} At 200 MHz the 2-cycle MMALU MCP budget is 10 ns (was 8 ns), giving all carry-chain and routing paths comfortable margin. Result: WNS = +0.003 ns, WHS = +0.017 ns, 0 failing endpoints. ## calib_sync MCP extension (constrs/npu_top.xdc) Extended the existing 2-cycle MCP from c0/c1_calib_sync2_reg to also cover dma_master_inst/* destinations (previously only mmalu_inst/*). Diagnosis showed 238 paths from the same source fanout reaching dma_master_inst acc_buf/out_buf CE pins — legitimately slow-moving one-shot reset signal; 2-cycle constraint is functionally safe.

Add docs/implementations/FPGA_XC7K480T.md covering the full FPGA verification platform for the K=32 NPU on xc7k480tffg1156-2. Document focuses on the timing closure refinement journey: Architecture - PCIe Gen2×8 XDMA + dual DDR3 MIG (ECC) + K=32 MMALU - 200 MHz fabric clock (clk_wiz_0 MMCM from 250 MHz XDMA userclk2) - Four clock domains: userclk2 (250), fabric_aclk (200), clk_pll_i (133) - AXI topology: clkconv-first (Tier-2.5), then dwidth at 133 MHz Timing closure history table (K=16 baseline → formal closure) - Documents each WNS/TNS/failing-endpoint data point - Links each number to the specific architectural or constraint change Key techniques documented - Per-lane DataFeeder refactor: eliminates 1025-fanout Pipe(Vec) v_reg - 200 MHz fabric insertion: clk_wiz_0 + axi_cc_xdma_in + axi_cc_byp_in - Three-group set_clock_groups for correct inter-domain path exclusion - calib_sync 2-cycle MCP covering both mmalu_inst/* and dma_master_inst/* Build instructions: two-step Vivado batch flow (create_project.tcl → impl) Final metrics: WNS=+0.003 ns, 0 failing endpoints, 18 MB bitstream

…alone build ## Summary Full FPGA bring-up for the Chisel NPU on xc7k480tffg1156-2 PCIe carrier: cold-boot root cause found and fixed, V0–V9 validation ladder passes 9/9, standalone self-contained build (no external Vivado project needed). ## Cold-boot root causes (fixed) Two independent bugs caused PCIe to fail to enumerate after cold power-on: 1. CONFIG_MODE=BPI16 in XDC set COR0 MATCH_CYCLE=2 and DRIVEDONE=1, delaying FPGA DONE past the AMD FCH link-training window. Fix: override CONFIG_MODE=SPIx1 + CONFIGRATE=3 before write_bitstream. Target COR0 = 0x02003fe5. 2. mig_sys_rst_n driven from axi_aresetn (PCIe-derived) created a chicken-and-egg deadlock: PCIe cannot train because MIG is in reset, but MIG only exits reset after PCIe trains. Fix: drive mig_sys_rst_n from board reset pin. ## Bisect ladder (V0–V9, all 9/9 PASS) Each step adds one NPU component on top of the reference XDMA+MIG design to confirm cold-boot safety of each incremental addition: V0 baseline Reference design, no changes V1 no_mb Remove MicroBlaze + peripherals V2 bypass_ctrl BYPASS path → ctrl_lite (AXI-Lite CTRL register) V3 mmcm Fabric MMCM (125 MHz) V4 byp_cdc BYPASS clock domain crossing V5 xdma_cc XDMA→fabric clock domain crossing V6 no_smc Remove SmartConnect; NPU flat AXI chain to MIG C0 V7 dma_master npu_dma_master → MIG C1 V8 npu_stub Chisel mmalu_stub.v (synthesis shape test) V9 npu_full Real Chisel top.sv (K=32, N=8, firtool-1.62.1) SmartConnect architectural note: the reference SmartConnect handles 200→133 MHz CDC internally via its aclk1 port. Routing M00_AXI through an external converter breaks aclk1 domain inference. V6 removes the SmartConnect entirely, replacing it with the NPU's own flat converter chain (axi_cc_xdma_in → axi_clkconv_xdma → axi_dwidth_xdma → MIG C0). ## Standalone build (no external project needed) bootstrap_project.tcl creates ip/vivado/xc7k480t/proj/ from committed XCI sources using recreate_bd.tcl (generated via write_bd_tcl from the final V9 BD state). Key bootstrap steps: - source recreate_bd.tcl (creates BD from TCL, no design_info lock-in) - generate_target all (regenerates IP HDL and OOC synthesis) - wait for ALL OOC synthesis runs before write_checkpoint - copy IP DCPs from IP_OUTPUT_DIR (gen/) to IP_DIR (srcs/) so link_design in impl can resolve them ## Files added ip/vivado/xc7k480t/ scripts/ build_v0..v9, migrate_lib, bootstrap, apply procs src/ npu_ctrl_lite.v, npu_dma_master.v, npu_subsys.v, stub constrs/ IO_Port.xdc README.md FPGA bring-up guide ip/vivado/xc7k480t.reference/ src/bd/ top.bd, top.bda, 14× IP XCI files, mig_a.prj scripts/ recreate_bd.tcl, build.tcl, create_project.tcl constrs/ IO_Port.xdc driver/linux/ XDMA kernel driver source mirror (Xilinx, GPL-2.0) tool/hw/ Serial console, flash, bringup, smoke test scripts pytest.ini hw + slow marker declarations AGENTS.md Repo guidance for AI agents docs/implementations/FPGA_XC7K480T.md FPGA implementation notes

mpskex added 3 commits May 10, 2026 09:54

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Draft] Fangrui/xc7k480t bring up#14

[Draft] Fangrui/xc7k480t bring up#14
mpskex wants to merge 3 commits into
mainfrom
fangrui/xc7k480t-bring-up

mpskex commented May 14, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

mpskex commented May 14, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant