Skip to content

Commit fb2ed3a

Browse files
committed
refactor
1 parent 9944d68 commit fb2ed3a

1 file changed

Lines changed: 155 additions & 19 deletions

File tree

README.md

Lines changed: 155 additions & 19 deletions
Original file line numberDiff line numberDiff line change
@@ -1,43 +1,179 @@
11
# linreg
22

3-
Hardware-accelerated vectorized gradient descent for linear regression.
3+
Hardware-accelerated gradient descent for linear regression on an FPGA.
44

5-
## Architecture
5+
The design pairs a **MicroBlaze soft processor** with a **custom AXI4-Stream IP core** that computes the gradient descent update rule in a single combinational pass:
66

7-
To speed up linear regression one can leverage the inherent parallelism in matrix multiplication which custom hardware can effectively do in parallel.
7+
```
8+
θ := θ − (α/m) · Xᵀ · (Xθ − Y)
9+
```
810

9-
Our design uses Xilinx's MicroBlaze processor (to control the flow of the gradient descent algorithm) and a custom IP Core (to compute the parameter updates) interfaced by AXI Stream.
11+
On a small dataset this yields an **8× speedup** over a pure software MicroBlaze implementation.
1012

1113
![](https://i.imgur.com/IPszi4A.png)
1214

13-
### Custom IP Core
15+
---
16+
17+
## Source Navigation
18+
19+
```
20+
ip_repo/gradientdescent_1.0/
21+
├── src/ ← core algorithm (portable VHDL-2008)
22+
│ ├── Types.vhd Q20.11 fixed-point types and arithmetic
23+
│ ├── MiniBatchGradientDescent.vhd top-level combinational pipeline
24+
│ ├── matrix_multiply_by_vector.vhd m×n parallel multiply-accumulate
25+
│ ├── matrix_transpose.vhd pure wire routing, zero logic cost
26+
│ ├── vector_subtract.vhd parallel element-wise subtraction
27+
│ ├── vector_multiply_by_scalar.vhd parallel Q20.11 scalar scaling
28+
│ └── gradientdescent_testbench.vhd stimulus-only simulation testbench
29+
└── hdl/ ← AXI4-Stream interface wrappers
30+
├── gradientdescent_v1_0.vhd top-level IP wrapper (exposes m, n generics)
31+
├── gradientdescent_v1_0_S00_AXIS.vhd slave: decodes 6-instruction protocol,
32+
│ holds X/Y/θ/α registers, latches
33+
│ theta_new → theta for multi-iteration runs
34+
└── gradientdescent_v1_0_M00_AXIS.vhd master: streams theta_new back word-by-word
35+
36+
linreg.sdk/microblaze/src/ ← MicroBlaze C application
37+
├── helloworld.c main loop: drives accelerator, reads results, benchmarks
38+
├── instructions.c/h encodes opcodes into 32-bit FSL words (putfsl)
39+
└── linreg.c/h convergence check and result printing
40+
41+
linreg.srcs/constrs_1/.../Nexys4_Master.xdc ← constraints (only clock pin E3/100 MHz is active)
42+
linreg.srcs/sources_1/bd/design_1/ ← Xilinx-generated block design (MicroBlaze, AXI interconnect, BRAM, UART)
43+
```
44+
45+
### Key design details
46+
47+
**Fixed-point arithmetic** — all values use Q20.11 (32-bit signed, scale factor 2¹¹ = 2048). `element_multiply` uses a 64-bit intermediate and right-shifts 11 bits; all other ops work directly on signed 32-bit values.
48+
49+
**Instruction set** — the slave interface decodes the top 3 bits of each 32-bit AXI-Stream word:
50+
51+
| Opcode (bits 31:29) | Instruction | Payload |
52+
|---|---|---|
53+
| `000` | Store `X[i][j]` | bits 28:26 = row, 25:23 = col, 22:0 = value |
54+
| `001` | Store `Y[i]` | bits 28:26 = index, 25:0 = value |
55+
| `010` | Store `θ[i]` | bits 28:26 = index, 25:0 = value |
56+
| `011` | Run iteration | bits 28:0 = count; if >1, latches theta_new → theta each cycle |
57+
| `100` | Reset | clears all registers |
58+
| `101` | Store `α` | bits 28:0 = learning rate |
59+
60+
**MicroBlaze ↔ IP communication** — uses the Fast Simplex Link (FSL) bus via `putfsl()`. Results are streamed back over the AXI master interface, one 32-bit word per clock cycle.
61+
62+
**Timing** — a fixed-interval timer fires every 100,000 clock cycles; the ISR increments `irqCount`. Elapsed time: `T = period × irqCount × fit`.
63+
64+
---
65+
66+
## Build on macOS with Docker
67+
68+
Vivado does not run natively on macOS. Use Docker to run it in an Ubuntu container.
69+
70+
### 1. Simulate (no Vivado needed)
71+
72+
Install the open-source VHDL toolchain natively:
73+
74+
```bash
75+
brew install ghdl gtkwave
76+
```
1477

15-
Our custom IP core is a coprocessor in the true sense of the word because the MicroBlaze processor sends it instructions according to an instruction set.
78+
Simulate the core algorithm:
1679

17-
The coprocessor receives, decodes and processes these instructions on its slave interface. The result (a vector of elements) is buffered to the MicroBlaze processor one element (a 32 bit word) at a time.
80+
```bash
81+
cd ip_repo/gradientdescent_1.0
1882

19-
This results in a fairly clean, stateless implementation of the coprocessor's slave interface.
83+
ghdl -a --std=08 \
84+
src/Types.vhd \
85+
src/matrix_transpose.vhd \
86+
src/vector_subtract.vhd \
87+
src/vector_multiply_by_scalar.vhd \
88+
src/matrix_multiply_by_vector.vhd \
89+
src/MiniBatchGradientDescent.vhd \
90+
src/gradientdescent_testbench.vhd
2091

21-
### MicroBlaze processor
92+
ghdl -e --std=08 MiniBatchGradientDescentTest
93+
ghdl -r --std=08 MiniBatchGradientDescentTest --vcd=sim.vcd
94+
gtkwave sim.vcd
95+
```
2296

23-
The MicroBlaze processor sends the coprocessor instructions (according to the instruction set) to store each element of the matrix X, vector Y, learning rate α, and parameters θ. After this, in a loop, it instructs the coprocessor to run iterations of gradient descent.
97+
To add self-checking assertions (the current testbench is stimulus-only), use [cocotb](https://www.cocotb.org/):
2498

25-
The software running on the MicroBlaze processor checks if the algorithm has converged on every iteration. A useful (but not totally accurate) approximation to declare convergence is: when the difference between the updated θ vector and the previous θ vector is below a certain threshold, then the algorithm has converged, otherwise, it hasn’t.
99+
```bash
100+
pip install cocotb
101+
```
26102

27-
## Timing Performance
103+
```python
104+
# test_gradient_descent.py
105+
import cocotb
106+
from cocotb.triggers import Timer
28107

29-
A fixed interval timer is configured to fire events every 100000-th clock cycle. The MicroBlaze processor is interrupted on each one of these events, upon which an interrupt handler is called to increment a global counter `irqCount`.
108+
@cocotb.test()
109+
async def test_convergence(dut):
110+
await Timer(1, units="ns")
111+
theta0 = int(dut.theta_new[0].value) / 2048 # Q20.11 → float
112+
assert abs(theta0 - expected_theta0) < 0.01
113+
```
30114

31-
We can reset this counter when we want to start timing some task and simply look at the counter when we know said task has finished.
115+
### 2. Synthesize with Vivado in Docker
32116

33-
We can estimate the time elapsed `T` as a function of the interrupt counter `irqCount`, the clock’s period `period` and the frequency of the fixed interval timer `fit`.
117+
```bash
118+
# Start an Ubuntu container with the project mounted
119+
docker run -it --rm \
120+
-v $(pwd):/workspace \
121+
ubuntu:22.04 bash
122+
```
34123

124+
Inside the container, install Vivado 2024.x (download from AMD, free WebPACK tier — Artix-7 is included), then run synthesis non-interactively:
125+
126+
```tcl
127+
# synth_core.tcl
128+
read_vhdl -vhdl2008 [glob ip_repo/gradientdescent_1.0/src/*.vhd]
129+
read_xdc linreg.srcs/constrs_1/imports/Downloads/Nexys4_Master.xdc
130+
synth_design -top MiniBatchGradientDescent -part xc7a100tcsg324-1
131+
opt_design
132+
place_design
133+
route_design
134+
write_bitstream -force output.bit
35135
```
36-
T = period × irqCount × fit
136+
137+
```bash
138+
vivado -mode batch -source synth_core.tcl
139+
```
140+
141+
---
142+
143+
## Flash to FPGA
144+
145+
[openFPGALoader](https://trabucayre.github.io/openFPGALoader/) supports common FPGA boards via their on-board FTDI USB chip — no Vivado Hardware Manager or proprietary drivers needed.
146+
147+
```bash
148+
# macOS
149+
brew install openfpgaloader
150+
151+
# Ubuntu / Debian
152+
sudo apt install openfpgaloader
37153
```
38154

39-
## Results
155+
### Common boards
156+
157+
```bash
158+
# Detect connected board
159+
openFPGALoader --detect
160+
161+
# Nexys4 / Nexys4 DDR (Artix-7)
162+
openFPGALoader -b nexys4 output.bit # SRAM (lost on power cycle)
163+
openFPGALoader -b nexys4 --write-flash output.bit # SPI flash (persistent)
40164

41-
On a small dataset we observed 8x speedup in performance relative to a pure software implementation running on the MicroBlaze processor alone. On larger datasets it is expected that this difference is more significant.
165+
# Basys3 (Artix-7)
166+
openFPGALoader -b basys3 output.bit
167+
168+
# Arty A7-35T / A7-100T (Artix-7)
169+
openFPGALoader -b arty output.bit
170+
openFPGALoader -b arty_a7_100 output.bit
171+
172+
# iCE40 boards (e.g. iCEBreaker)
173+
openFPGALoader -b icebreaker output.bit
174+
175+
# ECP5 boards (e.g. ULX3S)
176+
openFPGALoader -b ulx3s output.bit
177+
```
42178

43-
However, there is still room for improvement. Rather than computing the gradient over all the samples of the dataset, one could compute the gradient for every k-subsample of the dataset in parallel and combine them together, effectively doing what is called map-reduce batch gradient descent.
179+
> **Note:** this design targets the Nexys4 (xc7a100tcsg324-1). To target a different Artix-7 board, change the `-part` flag in `synth_core.tcl` and the `-b` flag above. Other FPGA families require porting the block design (MicroBlaze, AXI interconnect).

0 commit comments

Comments
 (0)