|
1 | 1 | # linreg |
2 | 2 |
|
3 | | -Hardware-accelerated vectorized gradient descent for linear regression. |
| 3 | +Hardware-accelerated gradient descent for linear regression on an FPGA. |
4 | 4 |
|
5 | | -## Architecture |
| 5 | +The design pairs a **MicroBlaze soft processor** with a **custom AXI4-Stream IP core** that computes the gradient descent update rule in a single combinational pass: |
6 | 6 |
|
7 | | -To speed up linear regression one can leverage the inherent parallelism in matrix multiplication which custom hardware can effectively do in parallel. |
| 7 | +``` |
| 8 | +θ := θ − (α/m) · Xᵀ · (Xθ − Y) |
| 9 | +``` |
8 | 10 |
|
9 | | -Our design uses Xilinx's MicroBlaze processor (to control the flow of the gradient descent algorithm) and a custom IP Core (to compute the parameter updates) interfaced by AXI Stream. |
| 11 | +On a small dataset this yields an **8× speedup** over a pure software MicroBlaze implementation. |
10 | 12 |
|
11 | 13 |  |
12 | 14 |
|
13 | | -### Custom IP Core |
| 15 | +--- |
| 16 | + |
| 17 | +## Source Navigation |
| 18 | + |
| 19 | +``` |
| 20 | +ip_repo/gradientdescent_1.0/ |
| 21 | +├── src/ ← core algorithm (portable VHDL-2008) |
| 22 | +│ ├── Types.vhd Q20.11 fixed-point types and arithmetic |
| 23 | +│ ├── MiniBatchGradientDescent.vhd top-level combinational pipeline |
| 24 | +│ ├── matrix_multiply_by_vector.vhd m×n parallel multiply-accumulate |
| 25 | +│ ├── matrix_transpose.vhd pure wire routing, zero logic cost |
| 26 | +│ ├── vector_subtract.vhd parallel element-wise subtraction |
| 27 | +│ ├── vector_multiply_by_scalar.vhd parallel Q20.11 scalar scaling |
| 28 | +│ └── gradientdescent_testbench.vhd stimulus-only simulation testbench |
| 29 | +└── hdl/ ← AXI4-Stream interface wrappers |
| 30 | + ├── gradientdescent_v1_0.vhd top-level IP wrapper (exposes m, n generics) |
| 31 | + ├── gradientdescent_v1_0_S00_AXIS.vhd slave: decodes 6-instruction protocol, |
| 32 | + │ holds X/Y/θ/α registers, latches |
| 33 | + │ theta_new → theta for multi-iteration runs |
| 34 | + └── gradientdescent_v1_0_M00_AXIS.vhd master: streams theta_new back word-by-word |
| 35 | +
|
| 36 | +linreg.sdk/microblaze/src/ ← MicroBlaze C application |
| 37 | + ├── helloworld.c main loop: drives accelerator, reads results, benchmarks |
| 38 | + ├── instructions.c/h encodes opcodes into 32-bit FSL words (putfsl) |
| 39 | + └── linreg.c/h convergence check and result printing |
| 40 | +
|
| 41 | +linreg.srcs/constrs_1/.../Nexys4_Master.xdc ← constraints (only clock pin E3/100 MHz is active) |
| 42 | +linreg.srcs/sources_1/bd/design_1/ ← Xilinx-generated block design (MicroBlaze, AXI interconnect, BRAM, UART) |
| 43 | +``` |
| 44 | + |
| 45 | +### Key design details |
| 46 | + |
| 47 | +**Fixed-point arithmetic** — all values use Q20.11 (32-bit signed, scale factor 2¹¹ = 2048). `element_multiply` uses a 64-bit intermediate and right-shifts 11 bits; all other ops work directly on signed 32-bit values. |
| 48 | + |
| 49 | +**Instruction set** — the slave interface decodes the top 3 bits of each 32-bit AXI-Stream word: |
| 50 | + |
| 51 | +| Opcode (bits 31:29) | Instruction | Payload | |
| 52 | +|---|---|---| |
| 53 | +| `000` | Store `X[i][j]` | bits 28:26 = row, 25:23 = col, 22:0 = value | |
| 54 | +| `001` | Store `Y[i]` | bits 28:26 = index, 25:0 = value | |
| 55 | +| `010` | Store `θ[i]` | bits 28:26 = index, 25:0 = value | |
| 56 | +| `011` | Run iteration | bits 28:0 = count; if >1, latches theta_new → theta each cycle | |
| 57 | +| `100` | Reset | clears all registers | |
| 58 | +| `101` | Store `α` | bits 28:0 = learning rate | |
| 59 | + |
| 60 | +**MicroBlaze ↔ IP communication** — uses the Fast Simplex Link (FSL) bus via `putfsl()`. Results are streamed back over the AXI master interface, one 32-bit word per clock cycle. |
| 61 | + |
| 62 | +**Timing** — a fixed-interval timer fires every 100,000 clock cycles; the ISR increments `irqCount`. Elapsed time: `T = period × irqCount × fit`. |
| 63 | + |
| 64 | +--- |
| 65 | + |
| 66 | +## Build on macOS with Docker |
| 67 | + |
| 68 | +Vivado does not run natively on macOS. Use Docker to run it in an Ubuntu container. |
| 69 | + |
| 70 | +### 1. Simulate (no Vivado needed) |
| 71 | + |
| 72 | +Install the open-source VHDL toolchain natively: |
| 73 | + |
| 74 | +```bash |
| 75 | +brew install ghdl gtkwave |
| 76 | +``` |
14 | 77 |
|
15 | | -Our custom IP core is a coprocessor in the true sense of the word because the MicroBlaze processor sends it instructions according to an instruction set. |
| 78 | +Simulate the core algorithm: |
16 | 79 |
|
17 | | -The coprocessor receives, decodes and processes these instructions on its slave interface. The result (a vector of elements) is buffered to the MicroBlaze processor one element (a 32 bit word) at a time. |
| 80 | +```bash |
| 81 | +cd ip_repo/gradientdescent_1.0 |
18 | 82 |
|
19 | | -This results in a fairly clean, stateless implementation of the coprocessor's slave interface. |
| 83 | +ghdl -a --std=08 \ |
| 84 | + src/Types.vhd \ |
| 85 | + src/matrix_transpose.vhd \ |
| 86 | + src/vector_subtract.vhd \ |
| 87 | + src/vector_multiply_by_scalar.vhd \ |
| 88 | + src/matrix_multiply_by_vector.vhd \ |
| 89 | + src/MiniBatchGradientDescent.vhd \ |
| 90 | + src/gradientdescent_testbench.vhd |
20 | 91 |
|
21 | | -### MicroBlaze processor |
| 92 | +ghdl -e --std=08 MiniBatchGradientDescentTest |
| 93 | +ghdl -r --std=08 MiniBatchGradientDescentTest --vcd=sim.vcd |
| 94 | +gtkwave sim.vcd |
| 95 | +``` |
22 | 96 |
|
23 | | -The MicroBlaze processor sends the coprocessor instructions (according to the instruction set) to store each element of the matrix X, vector Y, learning rate α, and parameters θ. After this, in a loop, it instructs the coprocessor to run iterations of gradient descent. |
| 97 | +To add self-checking assertions (the current testbench is stimulus-only), use [cocotb](https://www.cocotb.org/): |
24 | 98 |
|
25 | | -The software running on the MicroBlaze processor checks if the algorithm has converged on every iteration. A useful (but not totally accurate) approximation to declare convergence is: when the difference between the updated θ vector and the previous θ vector is below a certain threshold, then the algorithm has converged, otherwise, it hasn’t. |
| 99 | +```bash |
| 100 | +pip install cocotb |
| 101 | +``` |
26 | 102 |
|
27 | | -## Timing Performance |
| 103 | +```python |
| 104 | +# test_gradient_descent.py |
| 105 | +import cocotb |
| 106 | +from cocotb.triggers import Timer |
28 | 107 |
|
29 | | -A fixed interval timer is configured to fire events every 100000-th clock cycle. The MicroBlaze processor is interrupted on each one of these events, upon which an interrupt handler is called to increment a global counter `irqCount`. |
| 108 | +@cocotb.test() |
| 109 | +async def test_convergence(dut): |
| 110 | + await Timer(1, units="ns") |
| 111 | + theta0 = int(dut.theta_new[0].value) / 2048 # Q20.11 → float |
| 112 | + assert abs(theta0 - expected_theta0) < 0.01 |
| 113 | +``` |
30 | 114 |
|
31 | | -We can reset this counter when we want to start timing some task and simply look at the counter when we know said task has finished. |
| 115 | +### 2. Synthesize with Vivado in Docker |
32 | 116 |
|
33 | | -We can estimate the time elapsed `T` as a function of the interrupt counter `irqCount`, the clock’s period `period` and the frequency of the fixed interval timer `fit`. |
| 117 | +```bash |
| 118 | +# Start an Ubuntu container with the project mounted |
| 119 | +docker run -it --rm \ |
| 120 | + -v $(pwd):/workspace \ |
| 121 | + ubuntu:22.04 bash |
| 122 | +``` |
34 | 123 |
|
| 124 | +Inside the container, install Vivado 2024.x (download from AMD, free WebPACK tier — Artix-7 is included), then run synthesis non-interactively: |
| 125 | + |
| 126 | +```tcl |
| 127 | +# synth_core.tcl |
| 128 | +read_vhdl -vhdl2008 [glob ip_repo/gradientdescent_1.0/src/*.vhd] |
| 129 | +read_xdc linreg.srcs/constrs_1/imports/Downloads/Nexys4_Master.xdc |
| 130 | +synth_design -top MiniBatchGradientDescent -part xc7a100tcsg324-1 |
| 131 | +opt_design |
| 132 | +place_design |
| 133 | +route_design |
| 134 | +write_bitstream -force output.bit |
35 | 135 | ``` |
36 | | -T = period × irqCount × fit |
| 136 | + |
| 137 | +```bash |
| 138 | +vivado -mode batch -source synth_core.tcl |
| 139 | +``` |
| 140 | + |
| 141 | +--- |
| 142 | + |
| 143 | +## Flash to FPGA |
| 144 | + |
| 145 | +[openFPGALoader](https://trabucayre.github.io/openFPGALoader/) supports common FPGA boards via their on-board FTDI USB chip — no Vivado Hardware Manager or proprietary drivers needed. |
| 146 | + |
| 147 | +```bash |
| 148 | +# macOS |
| 149 | +brew install openfpgaloader |
| 150 | + |
| 151 | +# Ubuntu / Debian |
| 152 | +sudo apt install openfpgaloader |
37 | 153 | ``` |
38 | 154 |
|
39 | | -## Results |
| 155 | +### Common boards |
| 156 | + |
| 157 | +```bash |
| 158 | +# Detect connected board |
| 159 | +openFPGALoader --detect |
| 160 | + |
| 161 | +# Nexys4 / Nexys4 DDR (Artix-7) |
| 162 | +openFPGALoader -b nexys4 output.bit # SRAM (lost on power cycle) |
| 163 | +openFPGALoader -b nexys4 --write-flash output.bit # SPI flash (persistent) |
40 | 164 |
|
41 | | -On a small dataset we observed 8x speedup in performance relative to a pure software implementation running on the MicroBlaze processor alone. On larger datasets it is expected that this difference is more significant. |
| 165 | +# Basys3 (Artix-7) |
| 166 | +openFPGALoader -b basys3 output.bit |
| 167 | + |
| 168 | +# Arty A7-35T / A7-100T (Artix-7) |
| 169 | +openFPGALoader -b arty output.bit |
| 170 | +openFPGALoader -b arty_a7_100 output.bit |
| 171 | + |
| 172 | +# iCE40 boards (e.g. iCEBreaker) |
| 173 | +openFPGALoader -b icebreaker output.bit |
| 174 | + |
| 175 | +# ECP5 boards (e.g. ULX3S) |
| 176 | +openFPGALoader -b ulx3s output.bit |
| 177 | +``` |
42 | 178 |
|
43 | | -However, there is still room for improvement. Rather than computing the gradient over all the samples of the dataset, one could compute the gradient for every k-subsample of the dataset in parallel and combine them together, effectively doing what is called map-reduce batch gradient descent. |
| 179 | +> **Note:** this design targets the Nexys4 (xc7a100tcsg324-1). To target a different Artix-7 board, change the `-part` flag in `synth_core.tcl` and the `-b` flag above. Other FPGA families require porting the block design (MicroBlaze, AXI interconnect). |
0 commit comments