Skip to content

Commit 4f98535

Browse files
committed
Merge remote-tracking branch 'origin/dev' into tlb-pipe
2 parents b120818 + b631f0b commit 4f98535

36 files changed

Lines changed: 2829 additions & 608 deletions

.circleci/build-toolchains.sh

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -28,5 +28,5 @@ if [ ! -d "$HOME/$1-install" ]; then
2828
cd $HOME
2929

3030
# init all submodules including the tools (doesn't use CI_MAKE_PROC due to mem. constraints)
31-
CHIPYARD_DIR="$LOCAL_CHIPYARD_DIR" NPROC=$CI_MAKE_PROC $LOCAL_CHIPYARD_DIR/scripts/build-toolchains.sh esp-tools
31+
CHIPYARD_DIR="$LOCAL_CHIPYARD_DIR" NPROC=$CI_MAKE_NPROC $LOCAL_CHIPYARD_DIR/scripts/build-toolchains.sh esp-tools
3232
fi

.circleci/defaults.sh

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -14,7 +14,7 @@
1414
#############
1515

1616
# make parallelism
17-
CI_MAKE_NPROC=8
17+
CI_MAKE_NPROC=4
1818
LOCAL_MAKE_NPROC=$CI_MAKE_NPROC
1919

2020
# verilator version

CHIPYARD.hash

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1 +1 @@
1-
939e3a9f94d5bfef9671f49c37cd3acd5fc26128
1+
1e2f778a6705033d67ccbcc932e66083e4646f15

README.md

Lines changed: 14 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -18,7 +18,7 @@ Gemmini is implemented as a RoCC accelerator with non-standard RISC-V custom ins
1818

1919
At the heart of the accelerator lies a systolic array which performs matrix multiplications. By default, the matrix multiplication support both _output-stationary_ and _weight-stationary_ dataflows, which programmers can pick between at runtime. However, the dataflow can also be hardened at elaboration time.
2020

21-
The systolic array's inputs and outputs are stored in an explicity managed scratchpad, made up of banked SRAMs. A DMA engine facilitates the tranfer of data between main memory and the scratchpad.
21+
The systolic array's inputs and outputs are stored in an explicity managed scratchpad, made up of banked SRAMs. A DMA engine facilitates the transfer of data between main memory and the scratchpad.
2222

2323
Because weight-stationary dataflows require an accumulator outside the systolic array, we add a final SRAM bank, equipped with adder units, which can be conceptually considered an extension of the scratchpad memory space. The systolic array can store results to any address in the accumulator, and can also read new inputs from any address in the accumulator. The DMA engine can also tranfer data directly between the accumulator and main memory, which is often necessary to load in biases.
2424

@@ -75,7 +75,7 @@ The ``software`` directory of the generator includes the aforementioned library
7575
The Gemmini generator generates a C header file based on the generator parameters. This header files gets compiled together with the matrix multiplication library to tune library performance. The generated header file can be found under ``software/gemmini-rocc-tests/include/gemmini_params.h``
7676

7777
Gemmini can also be used to run ONNX-specified neural-networks through a port of Microsoft's ONNX-Runtime framework. The port is included as the [onnxruntime-riscv](https://github.com/pranav-prakash/onnxruntime-riscv) repository submoduled in the `software` directory.
78-
To start using ONNX-Runtime, run `git submodule update --init --recursive software/onnxruntime-riscv`, and read the documentation at [here](https://github.com/pranav-prakash/onnxruntime-riscv/blob/systolic/systolic_runner/docs).
78+
To start using ONNX-Runtime, run `git submodule update --init --recursive software/onnxruntime-riscv`, and read the documentation [here](https://github.com/pranav-prakash/onnxruntime-riscv/blob/systolic/systolic_runner/docs).
7979

8080
## Build and Run Gemmini Tests
8181

@@ -317,3 +317,15 @@ This section describes an additional set of RoCC instructions that configure and
317317
### `COMPUTE_CISC` runs a complete hardware tiling sequence with the configured A, B, C, D, M, N, K, RPT_BIAS values
318318
**Format:** `compute_cisc`
319319
- `funct` = 17
320+
321+
# Citing Gemmini
322+
If Gemmini helps you in your academic research, you are encouraged to cite our paper. Here is an example bibtex:
323+
```
324+
@article{genc2019gemmini,
325+
title={Gemmini: An Agile Systolic Array Generator Enabling Systematic Evaluations of Deep-Learning Architectures},
326+
author={Genc, Hasan and Haj-Ali, Ameer and Iyer, Vighnesh and Amid, Alon and Mao, Howard and Wright, John and Schmidt, Colin and Zhao, Jerry and Ou, Albert and Banister, Max and Shao, Yakun Sophia and Nikolic, Borivoje and Stoica, Ion and Asanovic, Krste},
327+
journal={arXiv preprint arXiv:1911.09925},
328+
year={2019}
329+
}
330+
```
331+

SPIKE.hash

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1 +1 @@
1-
3db7a449d97bf40a101ef541089054e6af59d7df
1+
bc3222e351cdd645b6fd2605fd9611e3bc0d9cae

src/main/scala/gemmini/AccumulatorMem.scala

Lines changed: 150 additions & 67 deletions
Original file line numberDiff line numberDiff line change
@@ -17,19 +17,21 @@ class AccumulatorReadReq[T <: Data](n: Int, shift_width: Int, scale_t: T) extend
1717
override def cloneType: this.type = new AccumulatorReadReq(n, shift_width, scale_t.cloneType).asInstanceOf[this.type]
1818
}
1919

20-
class AccumulatorReadResp[T <: Data: Arithmetic](rdataType: Vec[Vec[T]], fullDataType: Vec[Vec[T]]) extends Bundle {
21-
val data = rdataType.cloneType
22-
val full_data = fullDataType.cloneType
20+
class AccumulatorReadResp[T <: Data: Arithmetic, U <: Data](fullDataType: Vec[Vec[T]], scale_t: U, shift_width: Int) extends Bundle {
21+
val data = fullDataType.cloneType
2322
val fromDMA = Bool()
24-
25-
override def cloneType: this.type = new AccumulatorReadResp(rdataType.cloneType, fullDataType.cloneType).asInstanceOf[this.type]
23+
val scale = scale_t.cloneType
24+
val relu6_shift = UInt(shift_width.W)
25+
val act = UInt(2.W)
26+
val acc_bank_id = UInt(2.W) // TODO don't hardcode
27+
override def cloneType: this.type = new AccumulatorReadResp(fullDataType.cloneType, scale_t, shift_width).asInstanceOf[this.type]
2628
}
2729

28-
class AccumulatorReadIO[T <: Data: Arithmetic, U <: Data](n: Int, shift_width: Int, rdataType: Vec[Vec[T]], fullDataType: Vec[Vec[T]], scale_t: U) extends Bundle {
29-
val req = Decoupled(new AccumulatorReadReq(n, shift_width, scale_t))
30-
val resp = Flipped(Decoupled(new AccumulatorReadResp(rdataType.cloneType, fullDataType.cloneType)))
30+
class AccumulatorReadIO[T <: Data: Arithmetic, U <: Data](n: Int, shift_width: Int, fullDataType: Vec[Vec[T]], scale_t: U) extends Bundle {
31+
val req = Decoupled(new AccumulatorReadReq[U](n, shift_width, scale_t))
32+
val resp = Flipped(Decoupled(new AccumulatorReadResp[T, U](fullDataType, scale_t, shift_width)))
3133

32-
override def cloneType: this.type = new AccumulatorReadIO(n, shift_width, rdataType.cloneType, fullDataType.cloneType, scale_t.cloneType).asInstanceOf[this.type]
34+
override def cloneType: this.type = new AccumulatorReadIO(n, shift_width, fullDataType.cloneType, scale_t.cloneType).asInstanceOf[this.type]
3335
}
3436

3537
class AccumulatorWriteReq[T <: Data: Arithmetic](n: Int, t: Vec[Vec[T]]) extends Bundle {
@@ -42,16 +44,19 @@ class AccumulatorWriteReq[T <: Data: Arithmetic](n: Int, t: Vec[Vec[T]]) extends
4244
override def cloneType: this.type = new AccumulatorWriteReq(n, t).asInstanceOf[this.type]
4345
}
4446

45-
class AccumulatorMemIO [T <: Data: Arithmetic, U <: Data](n: Int, t: Vec[Vec[T]], rdata: Vec[Vec[T]], scale_t: U) extends Bundle {
46-
val read = Flipped(new AccumulatorReadIO(n, log2Ceil(t.head.head.getWidth), rdata, t, scale_t))
47+
class AccumulatorMemIO [T <: Data: Arithmetic, U <: Data](n: Int, t: Vec[Vec[T]], scale_t: U) extends Bundle {
48+
val read = Flipped(new AccumulatorReadIO(n, log2Ceil(t.head.head.getWidth), t, scale_t))
4749
// val write = Flipped(new AccumulatorWriteIO(n, t))
4850
val write = Flipped(Decoupled(new AccumulatorWriteReq(n, t)))
4951

50-
override def cloneType: this.type = new AccumulatorMemIO(n, t, rdata, scale_t).asInstanceOf[this.type]
52+
override def cloneType: this.type = new AccumulatorMemIO(n, t, scale_t).asInstanceOf[this.type]
5153
}
5254

53-
class AccumulatorMem[T <: Data, U <: Data](n: Int, t: Vec[Vec[T]], rdataType: Vec[Vec[T]], mem_pipeline: Int, scale_args: ScaleArguments[T, U], read_small_data: Boolean, read_full_data: Boolean)
54-
(implicit ev: Arithmetic[T]) extends Module {
55+
class AccumulatorMem[T <: Data, U <: Data](
56+
n: Int, t: Vec[Vec[T]], scale_args: ScaleArguments[T, U],
57+
acc_singleported: Boolean, num_acc_sub_banks: Int
58+
)
59+
(implicit ev: Arithmetic[T]) extends Module {
5560
// TODO Do writes in this module work with matrices of size 2? If we try to read from an address right after writing
5661
// to it, then we might not get the written data. We might need some kind of cooldown counter after addresses in the
5762
// accumulator have been written to for configurations with such small matrices
@@ -64,9 +69,8 @@ class AccumulatorMem[T <: Data, U <: Data](n: Int, t: Vec[Vec[T]], rdataType: Ve
6469
import ev._
6570

6671
// TODO unify this with TwoPortSyncMemIO
67-
val io = IO(new AccumulatorMemIO(n, t, rdataType, scale_args.multiplicand_t))
72+
val io = IO(new AccumulatorMemIO(n, t, scale_args.multiplicand_t))
6873

69-
val mem = TwoPortSyncMem(n, t, t.getWidth / 8) // TODO We assume byte-alignment here. Use aligned_to instead
7074

7175
// For any write operation, we spend 2 cycles reading the existing address out, buffering it in a register, and then
7276
// accumulating on top of it (if necessary)
@@ -75,83 +79,162 @@ class AccumulatorMem[T <: Data, U <: Data](n: Int, t: Vec[Vec[T]], rdataType: Ve
7579
val acc_buf = ShiftRegister(io.write.bits.acc, 2)
7680
val mask_buf = ShiftRegister(io.write.bits.mask, 2)
7781
val w_buf_valid = ShiftRegister(io.write.fire(), 2)
78-
79-
val w_sum = VecInit((RegNext(mem.io.rdata) zip wdata_buf).map { case (rv, wv) =>
82+
val acc_rdata = Wire(t)
83+
acc_rdata := DontCare
84+
val read_rdata = Wire(t)
85+
read_rdata := DontCare
86+
val block_read_req = WireInit(false.B)
87+
val w_sum = VecInit((RegNext(acc_rdata) zip wdata_buf).map { case (rv, wv) =>
8088
VecInit((rv zip wv).map(t => t._1 + t._2))
8189
})
8290

83-
mem.io.waddr := waddr_buf
84-
mem.io.wen := w_buf_valid
85-
mem.io.wdata := Mux(acc_buf, w_sum, wdata_buf)
86-
mem.io.mask := mask_buf
87-
88-
mem.io.raddr := Mux(io.write.fire() && io.write.bits.acc, io.write.bits.addr, io.read.req.bits.addr)
89-
mem.io.ren := io.read.req.fire() || (io.write.fire() && io.write.bits.acc)
90-
91-
class PipelinedRdataAndActT extends Bundle {
92-
val data = mem.io.rdata.cloneType
93-
val full_data = mem.io.rdata.cloneType
94-
val scale = io.read.req.bits.scale.cloneType
95-
val relu6_shift = io.read.req.bits.relu6_shift.cloneType
96-
val act = io.read.req.bits.act.cloneType
97-
val fromDMA = io.read.req.bits.fromDMA.cloneType
91+
if (!acc_singleported) {
92+
val mem = TwoPortSyncMem(n, t, t.getWidth / 8) // TODO We assume byte-alignment here. Use aligned_to instead
93+
mem.io.waddr := waddr_buf
94+
mem.io.wen := w_buf_valid
95+
mem.io.wdata := Mux(acc_buf, w_sum, wdata_buf)
96+
mem.io.mask := mask_buf
97+
acc_rdata := mem.io.rdata
98+
read_rdata := mem.io.rdata
99+
mem.io.raddr := Mux(io.write.fire() && io.write.bits.acc, io.write.bits.addr, io.read.req.bits.addr)
100+
mem.io.ren := io.read.req.fire() || (io.write.fire() && io.write.bits.acc)
101+
} else {
102+
val mask_len = t.getWidth / 8
103+
val mask_elem = UInt((t.getWidth / mask_len).W)
104+
val reads = Wire(Vec(2, Decoupled(UInt())))
105+
reads(0).valid := io.write.valid && io.write.bits.acc
106+
reads(0).bits := io.write.bits.addr
107+
reads(0).ready := true.B
108+
reads(1).valid := io.read.req.valid
109+
reads(1).bits := io.read.req.bits.addr
110+
reads(1).ready := true.B
111+
block_read_req := !reads(1).ready
112+
for (i <- 0 until num_acc_sub_banks) {
113+
def isThisBank(addr: UInt) = addr(log2Ceil(num_acc_sub_banks)-1,0) === i.U
114+
def getBankIdx(addr: UInt) = addr >> log2Ceil(num_acc_sub_banks)
115+
val mem = SyncReadMem(n / num_acc_sub_banks, Vec(mask_len, mask_elem))
116+
117+
val ren = WireInit(false.B)
118+
val raddr = WireInit(getBankIdx(reads(0).bits))
119+
val nEntries = 3
120+
// Writes coming 2 cycles after read leads to bad bank behavior
121+
// Add another buffer here
122+
class W_Q_Entry[T <: Data](mask_len: Int, mask_elem: T) extends Bundle {
123+
val valid = Bool()
124+
val data = Vec(mask_len, mask_elem)
125+
val mask = Vec(mask_len, Bool())
126+
val addr = UInt(log2Ceil(n/num_acc_sub_banks).W)
127+
override def cloneType: this.type = new W_Q_Entry(mask_len, mask_elem).asInstanceOf[this.type]
128+
}
129+
val w_q = Reg(Vec(nEntries, new W_Q_Entry(mask_len, mask_elem)))
130+
for (e <- w_q) {
131+
when (e.valid) {
132+
assert(!(
133+
io.write.valid && io.write.bits.acc &&
134+
isThisBank(io.write.bits.addr) && getBankIdx(io.write.bits.addr) === e.addr &&
135+
((io.write.bits.mask.asUInt & e.mask.asUInt) =/= 0.U)
136+
))
137+
when (io.read.req.valid && isThisBank(io.read.req.bits.addr) && getBankIdx(io.read.req.bits.addr) === e.addr) {
138+
reads(1).ready := false.B
139+
}
140+
}
141+
}
142+
val w_q_head = RegInit(1.U(nEntries.W))
143+
val w_q_tail = RegInit(1.U(nEntries.W))
144+
when (reset.asBool) {
145+
w_q.foreach(_.valid := false.B)
146+
}
147+
val wen = WireInit(false.B)
148+
val wdata = Mux1H(w_q_head.asBools, w_q.map(_.data))
149+
val wmask = Mux1H(w_q_head.asBools, w_q.map(_.mask))
150+
val waddr = Mux1H(w_q_head.asBools, w_q.map(_.addr))
151+
when (wen) {
152+
w_q_head := w_q_head << 1 | w_q_head(nEntries-1)
153+
for (i <- 0 until nEntries) {
154+
when (w_q_head(i)) {
155+
w_q(i).valid := false.B
156+
}
157+
}
158+
}
159+
160+
when (w_buf_valid && isThisBank(waddr_buf)) {
161+
assert(!((w_q_tail.asBools zip w_q.map(_.valid)).map({ case (h,v) => h && v }).reduce(_||_)))
162+
w_q_tail := w_q_tail << 1 | w_q_tail(nEntries-1)
163+
for (i <- 0 until nEntries) {
164+
when (w_q_tail(i)) {
165+
w_q(i).valid := true.B
166+
w_q(i).data := Mux(acc_buf, w_sum, wdata_buf).asTypeOf(Vec(mask_len, mask_elem))
167+
w_q(i).mask := mask_buf
168+
w_q(i).addr := getBankIdx(waddr_buf)
169+
}
170+
}
171+
172+
}
173+
val bank_rdata = mem.read(raddr, ren && !wen).asTypeOf(t)
174+
when (RegNext(ren && reads(0).valid && isThisBank(reads(0).bits))) {
175+
acc_rdata := bank_rdata
176+
} .elsewhen (RegNext(ren)) {
177+
read_rdata := bank_rdata
178+
}
179+
when (wen) {
180+
mem.write(waddr, wdata, wmask)
181+
}
182+
// Three requestors, 1 slot
183+
// Priority is incoming reads for RMW > writes from RMW > incoming reads
184+
when (reads(0).valid && isThisBank(reads(0).bits)) {
185+
ren := true.B
186+
when (isThisBank(reads(1).bits)) {
187+
reads(1).ready := false.B
188+
}
189+
} .elsewhen ((w_q_head.asBools zip w_q.map(_.valid)).map({ case (h,v) => h && v }).reduce(_||_)) {
190+
wen := true.B
191+
when (isThisBank(reads(1).bits)) {
192+
reads(1).ready := false.B
193+
}
194+
} .otherwise {
195+
ren := isThisBank(reads(1).bits)
196+
raddr := getBankIdx(reads(1).bits)
197+
}
198+
}
98199
}
99200

100-
val q = Module(new Queue(new PipelinedRdataAndActT, 1, true, true))
101-
q.io.enq.bits.data := mem.io.rdata
102-
q.io.enq.bits.full_data := mem.io.rdata
201+
val q = Module(new Queue(new AccumulatorReadResp(t, scale_args.multiplicand_t, log2Ceil(t.head.head.getWidth)), 1, true, true))
202+
q.io.enq.bits.data := read_rdata
103203
q.io.enq.bits.scale := RegNext(io.read.req.bits.scale)
104204
q.io.enq.bits.relu6_shift := RegNext(io.read.req.bits.relu6_shift)
105205
q.io.enq.bits.act := RegNext(io.read.req.bits.act)
106206
q.io.enq.bits.fromDMA := RegNext(io.read.req.bits.fromDMA)
207+
q.io.enq.bits.acc_bank_id := DontCare
107208
q.io.enq.valid := RegNext(io.read.req.fire())
108209

109-
val p = Pipeline(q.io.deq, mem_pipeline, Seq.fill(mem_pipeline)((x: PipelinedRdataAndActT) => x) :+ {
110-
x: PipelinedRdataAndActT =>
111-
val activated_rdata = VecInit(x.data.map(v => VecInit(v.map { e =>
112-
// val e_scaled = e >> x.shift
113-
val e_scaled = scale_args.scale_func(e, x.scale)
114-
val e_clipped = e_scaled.clippedToWidthOf(rdataType.head.head)
115-
val e_act = MuxCase(e_clipped, Seq(
116-
(x.act === Activation.RELU) -> e_clipped.relu,
117-
(x.act === Activation.RELU6) -> e_clipped.relu6(x.relu6_shift)))
118210

119-
e_act
120-
})))
211+
val p = q.io.deq
121212

122-
val result = WireInit(x)
123-
result.data := activated_rdata
213+
io.read.resp.bits.data := p.bits.data
214+
io.read.resp.bits.fromDMA := p.bits.fromDMA
215+
io.read.resp.bits.relu6_shift := p.bits.relu6_shift
216+
io.read.resp.bits.act := p.bits.act
217+
io.read.resp.bits.scale := p.bits.scale
218+
io.read.resp.bits.acc_bank_id := DontCare // This is set in Scratchpad
219+
io.read.resp.valid := p.valid
220+
p.ready := io.read.resp.ready
124221

125-
result
126-
})
127222

128223
val q_will_be_empty = (q.io.count +& q.io.enq.fire()) - q.io.deq.fire() === 0.U
129224
io.read.req.ready := q_will_be_empty && (
130225
// Make sure we aren't accumulating, which would take over both ports
131226
!(io.write.fire() && io.write.bits.acc) &&
132227
// Make sure we aren't reading something that is still being written
133228
!(RegNext(io.write.fire()) && RegNext(io.write.bits.addr) === io.read.req.bits.addr) &&
134-
!(w_buf_valid && waddr_buf === io.read.req.bits.addr)
135-
)
136-
io.read.resp.bits.data := p.bits.data
137-
io.read.resp.bits.full_data := p.bits.full_data
138-
io.read.resp.bits.fromDMA := p.bits.fromDMA
139-
io.read.resp.valid := p.valid
140-
p.ready := io.read.resp.ready
229+
!(w_buf_valid && waddr_buf === io.read.req.bits.addr) &&
230+
!block_read_req
231+
)
141232

142-
if (read_small_data)
143-
io.read.resp.bits.data := p.bits.data
144-
else
145-
io.read.resp.bits.data := 0.U.asTypeOf(p.bits.data) // TODO make this DontCare instead
146233

147-
if (read_full_data)
148-
io.read.resp.bits.full_data := p.bits.full_data
149-
else
150-
io.read.resp.bits.full_data := 0.U.asTypeOf(q.io.enq.bits.full_data) // TODO make this DontCare instead
151234

152235
// io.write.current_waddr.valid := mem.io.wen
153236
// io.write.current_waddr.bits := mem.io.waddr
154-
io.write.ready := !io.write.bits.acc || (!(io.write.bits.addr === mem.io.waddr && mem.io.wen) &&
237+
io.write.ready := !io.write.bits.acc || (!(io.write.bits.addr === waddr_buf && w_buf_valid) &&
155238
!(io.write.bits.addr === RegNext(io.write.bits.addr) && RegNext(io.write.fire())))
156239

157240
// assert(!(io.read.req.valid && io.write.en && io.write.acc), "reading and accumulating simultaneously is not supported")

0 commit comments

Comments
 (0)