[Draft] Fangrui/xc7k480t bring up#14
Open
mpskex wants to merge 3 commits into
Open
Conversation
… clock
Three compounding fixes that together achieve formal timing closure
(WNS = +0.003 ns, 0 failing endpoints) for the K=32 MMALU on
xc7k480tffg1156-2 at 200 MHz fabric / 250 MHz PCIe.
## DataFeeder per-lane buffer_accum (src/main/scala/alu/mma/sa/dataFeeder.scala)
The original Pipe(Vec(n, SInt(accum_nbits.W)), 2*n-1) generated a single
shared valid register (_v_reg) with fanout=1025 (n=32 lanes × 2n-1=63
pipeline stages). Vivado placed this FF in the centre of the die with
7.9 ns of net delay against an 8 ns two-cycle MCP budget → WNS=-0.782 ns.
Replace with n individual Pipe(SInt(accum_nbits.W), 2*n-1) instances.
Chisel deduplicates to one Pipe63_SInt32 module × 32 instances. Each
instance has its own private _v_reg chain; max fanout per valid signal
drops to 2. Result: 96% reduction in failing endpoints (22,608→1,299),
WNS improves from −0.782 to −0.265 ns.
## 200 MHz fabric clock (ip/vivado/xc7k480t/)
Remaining 1,299 violations were dominated by paths that are fundamentally
limited by placement spread at 250 MHz (4 ns budget):
- xdma_inst → dma_master_inst (563 src paths, placement-dependent)
- calib_sync2_reg fanout resets (238 paths, 3.5–4.1 ns route delay)
- MMPE accumulator carry chain (195 paths, ~4.3 ns logic + routing)
Fix: insert clk_wiz_0 (MMCM, 250→200 MHz) and two AXI clock converters:
- axi_cc_xdma_in: 128-bit AXI4, XDMA M_AXI (250 MHz) → fabric (200 MHz)
- axi_cc_byp_in: 32-bit AXI4L, BYPASS proto_conv out → ctrl_lite
MMALU, DMA master, and ctrl_lite now run at fabric_aclk (200 MHz).
axi_clkconv_xdma and axi_clkconv_npu slave sides also move to 200 MHz.
The three-group set_clock_groups constraint correctly isolates:
{userclk2, userclk1} | {clk_out1_clk_wiz_0_1} | {clk_pll_i, clk_pll_i_1}
At 200 MHz the 2-cycle MMALU MCP budget is 10 ns (was 8 ns), giving all
carry-chain and routing paths comfortable margin. Result: WNS = +0.003 ns,
WHS = +0.017 ns, 0 failing endpoints.
## calib_sync MCP extension (constrs/npu_top.xdc)
Extended the existing 2-cycle MCP from c0/c1_calib_sync2_reg to also
cover dma_master_inst/* destinations (previously only mmalu_inst/*).
Diagnosis showed 238 paths from the same source fanout reaching
dma_master_inst acc_buf/out_buf CE pins — legitimately slow-moving
one-shot reset signal; 2-cycle constraint is functionally safe.
Add docs/implementations/FPGA_XC7K480T.md covering the full FPGA verification platform for the K=32 NPU on xc7k480tffg1156-2. Document focuses on the timing closure refinement journey: Architecture - PCIe Gen2×8 XDMA + dual DDR3 MIG (ECC) + K=32 MMALU - 200 MHz fabric clock (clk_wiz_0 MMCM from 250 MHz XDMA userclk2) - Four clock domains: userclk2 (250), fabric_aclk (200), clk_pll_i (133) - AXI topology: clkconv-first (Tier-2.5), then dwidth at 133 MHz Timing closure history table (K=16 baseline → formal closure) - Documents each WNS/TNS/failing-endpoint data point - Links each number to the specific architectural or constraint change Key techniques documented - Per-lane DataFeeder refactor: eliminates 1025-fanout Pipe(Vec) v_reg - 200 MHz fabric insertion: clk_wiz_0 + axi_cc_xdma_in + axi_cc_byp_in - Three-group set_clock_groups for correct inter-domain path exclusion - calib_sync 2-cycle MCP covering both mmalu_inst/* and dma_master_inst/* Build instructions: two-step Vivado batch flow (create_project.tcl → impl) Final metrics: WNS=+0.003 ns, 0 failing endpoints, 18 MB bitstream
…alone build
## Summary
Full FPGA bring-up for the Chisel NPU on xc7k480tffg1156-2 PCIe carrier:
cold-boot root cause found and fixed, V0–V9 validation ladder passes 9/9,
standalone self-contained build (no external Vivado project needed).
## Cold-boot root causes (fixed)
Two independent bugs caused PCIe to fail to enumerate after cold power-on:
1. CONFIG_MODE=BPI16 in XDC set COR0 MATCH_CYCLE=2 and DRIVEDONE=1,
delaying FPGA DONE past the AMD FCH link-training window.
Fix: override CONFIG_MODE=SPIx1 + CONFIGRATE=3 before write_bitstream.
Target COR0 = 0x02003fe5.
2. mig_sys_rst_n driven from axi_aresetn (PCIe-derived) created a
chicken-and-egg deadlock: PCIe cannot train because MIG is in reset,
but MIG only exits reset after PCIe trains.
Fix: drive mig_sys_rst_n from board reset pin.
## Bisect ladder (V0–V9, all 9/9 PASS)
Each step adds one NPU component on top of the reference XDMA+MIG design
to confirm cold-boot safety of each incremental addition:
V0 baseline Reference design, no changes
V1 no_mb Remove MicroBlaze + peripherals
V2 bypass_ctrl BYPASS path → ctrl_lite (AXI-Lite CTRL register)
V3 mmcm Fabric MMCM (125 MHz)
V4 byp_cdc BYPASS clock domain crossing
V5 xdma_cc XDMA→fabric clock domain crossing
V6 no_smc Remove SmartConnect; NPU flat AXI chain to MIG C0
V7 dma_master npu_dma_master → MIG C1
V8 npu_stub Chisel mmalu_stub.v (synthesis shape test)
V9 npu_full Real Chisel top.sv (K=32, N=8, firtool-1.62.1)
SmartConnect architectural note: the reference SmartConnect handles
200→133 MHz CDC internally via its aclk1 port. Routing M00_AXI through
an external converter breaks aclk1 domain inference. V6 removes the
SmartConnect entirely, replacing it with the NPU's own flat converter
chain (axi_cc_xdma_in → axi_clkconv_xdma → axi_dwidth_xdma → MIG C0).
## Standalone build (no external project needed)
bootstrap_project.tcl creates ip/vivado/xc7k480t/proj/ from committed
XCI sources using recreate_bd.tcl (generated via write_bd_tcl from the
final V9 BD state). Key bootstrap steps:
- source recreate_bd.tcl (creates BD from TCL, no design_info lock-in)
- generate_target all (regenerates IP HDL and OOC synthesis)
- wait for ALL OOC synthesis runs before write_checkpoint
- copy IP DCPs from IP_OUTPUT_DIR (gen/) to IP_DIR (srcs/) so
link_design in impl can resolve them
## Files added
ip/vivado/xc7k480t/
scripts/ build_v0..v9, migrate_lib, bootstrap, apply procs
src/ npu_ctrl_lite.v, npu_dma_master.v, npu_subsys.v, stub
constrs/ IO_Port.xdc
README.md FPGA bring-up guide
ip/vivado/xc7k480t.reference/
src/bd/ top.bd, top.bda, 14× IP XCI files, mig_a.prj
scripts/ recreate_bd.tcl, build.tcl, create_project.tcl
constrs/ IO_Port.xdc
driver/linux/ XDMA kernel driver source mirror (Xilinx, GPL-2.0)
tool/hw/ Serial console, flash, bringup, smoke test scripts
pytest.ini hw + slow marker declarations
AGENTS.md Repo guidance for AI agents
docs/implementations/FPGA_XC7K480T.md FPGA implementation notes
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Bring up XC7K480T with 4GB on board DDR.
Setup: 1K PE with 32 x 8bit elements per vector word