fastmachinelearning
diff --git a/‎1_getting_started/1a_train_keras.ipynb‎
Lines changed: 15 additions & 5 deletions b/‎1_getting_started/1a_train_keras.ipynb‎
Lines changed: 15 additions & 5 deletions
diff --git a/‎1_getting_started/1b_train_pytorch.ipynb‎
Lines changed: 16 additions & 6 deletions b/‎1_getting_started/1b_train_pytorch.ipynb‎
Lines changed: 16 additions & 6 deletions
diff --git a/‎2_quantization/2a_qkeras.ipynb‎
Lines changed: 16 additions & 3 deletions b/‎2_quantization/2a_qkeras.ipynb‎
Lines changed: 16 additions & 3 deletions
diff --git a/‎3_advanced_config/3a_reuse_factor.ipynb‎
Lines changed: 13 additions & 3 deletions b/‎3_advanced_config/3a_reuse_factor.ipynb‎
Lines changed: 13 additions & 3 deletions
diff --git a/‎3_advanced_config/3b_profiling.ipynb‎
Lines changed: 16 additions & 6 deletions b/‎3_advanced_config/3b_profiling.ipynb‎
Lines changed: 16 additions & 6 deletions
@@ -42,15 +42,15 @@
     "\n",
     "Jets are collimated sprays of particles produced when quarks or gluons are knocked out of colliding protons at the LHC. Identifying the origin of a jet in real time is a core task for LHC trigger systems, which must decide within a few microseconds whether to keep or discard each collision event.\n",
     "\n",
-    "The dataset contains 16 high-level jet substructure observables derived from simulated proton-proton collisions at √s = 13 TeV. These include energy correlation functions, N-subjettiness ratios, a groomed jet mass, and constituent multiplicity. The goal is to classify each jet into one of five categories:\n",
+    "The dataset contains 16 high-level jet substructure observables derived from simulated proton-proton collisions at \u221as = 13 TeV. These include energy correlation functions, N-subjettiness ratios, a groomed jet mass, and constituent multiplicity. The goal is to classify each jet into one of five categories:\n",
     "\n",
     "| Label | Jet origin |\n",
     "|-------|------------|\n",
     "| `g`   | Gluon |\n",
     "| `q`   | Light quark |\n",
-    "| `w`   | W boson decay (W → qq') |\n",
-    "| `z`   | Z boson decay (Z → qq') |\n",
-    "| `t`   | Top quark decay (t → bqq') |"
+    "| `w`   | W boson decay (W \u2192 qq') |\n",
+    "| `z`   | Z boson decay (Z \u2192 qq') |\n",
+    "| `t`   | Top quark decay (t \u2192 bqq') |"
    ]
   },
   {
@@ -204,7 +204,7 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "An accuracy of ~75% is expected for this 5-class problem — random guessing gives only 20%, and some classes (notably gluon vs. light quark) are physically very similar and genuinely hard to separate even with more sophisticated methods.\n",
+    "An accuracy of ~75% is expected for this 5-class problem \u2014 random guessing gives only 20%, and some classes (notably gluon vs. light quark) are physically very similar and genuinely hard to separate even with more sophisticated methods.\n",
     "\n",
     "The ROC (Receiver Operating Characteristic) curve shows, for each class, the trade-off between signal efficiency (true positive rate) and background efficiency (false positive rate) as the decision threshold is varied. The area under the curve (AUC) ranges from 0.5 (random classifier) to 1.0 (perfect). Higher and further to the upper-left is better."
    ]
@@ -225,6 +225,16 @@
     "\n",
     "Your model is trained and saved. Open **`1c_hls4ml_synth.ipynb`** to convert it to an FPGA design with hls4ml."
    ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "1c38f1d6",
+   "metadata": {},
+   "source": [
+    "## Further reading\n",
+    "\n",
+    "For more details, see: Duarte, Han, Harris et al., \"Fast inference of deep neural networks in FPGAs for particle physics\", JINST 13 P07027 (2018), [arXiv:1804.06913](https://arxiv.org/abs/1804.06913)"
+   ]
   }
  ],
  "metadata": {
 
@@ -43,15 +43,15 @@
     "\n",
     "Jets are collimated sprays of particles produced when quarks or gluons are knocked out of colliding protons at the LHC. Identifying the origin of a jet in real time is a core task for LHC trigger systems, which must decide within a few microseconds whether to keep or discard each collision event.\n",
     "\n",
-    "The dataset contains 16 high-level jet substructure observables derived from simulated proton-proton collisions at √s = 13 TeV. These include energy correlation functions, N-subjettiness ratios, a groomed jet mass, and constituent multiplicity. The goal is to classify each jet into one of five categories:\n",
+    "The dataset contains 16 high-level jet substructure observables derived from simulated proton-proton collisions at \u221as = 13 TeV. These include energy correlation functions, N-subjettiness ratios, a groomed jet mass, and constituent multiplicity. The goal is to classify each jet into one of five categories:\n",
     "\n",
     "| Label | Jet origin |\n",
     "|-------|------------|\n",
     "| `g`   | Gluon |\n",
     "| `q`   | Light quark |\n",
-    "| `w`   | W boson decay (W → qq') |\n",
-    "| `z`   | Z boson decay (Z → qq') |\n",
-    "| `t`   | Top quark decay (t → bqq') |"
+    "| `w`   | W boson decay (W \u2192 qq') |\n",
+    "| `z`   | Z boson decay (Z \u2192 qq') |\n",
+    "| `t`   | Top quark decay (t \u2192 bqq') |"
    ]
   },
   {
@@ -129,7 +129,7 @@
    "outputs": [],
    "source": [
     "class JetTagger(nn.Module):\n",
-    "    \"\"\"Simple 3-hidden-layer jet tagger: 16 → 64 → 32 → 32 → 5.\"\"\"\n",
+    "    \"\"\"Simple 3-hidden-layer jet tagger: 16 \u2192 64 \u2192 32 \u2192 32 \u2192 5.\"\"\"\n",
     "    def __init__(self):\n",
     "        super().__init__()\n",
     "        self.fc1 = nn.Linear(16, 64)\n",
@@ -223,7 +223,7 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "An accuracy of ~75% is expected for this 5-class problem — random guessing gives only 20%, and some classes (notably gluon vs. light quark) are physically very similar and genuinely hard to separate even with more sophisticated methods.\n",
+    "An accuracy of ~75% is expected for this 5-class problem \u2014 random guessing gives only 20%, and some classes (notably gluon vs. light quark) are physically very similar and genuinely hard to separate even with more sophisticated methods.\n",
     "\n",
     "The ROC (Receiver Operating Characteristic) curve shows, for each class, the trade-off between signal efficiency (true positive rate) and background efficiency (false positive rate) as the decision threshold is varied. The area under the curve (AUC) ranges from 0.5 (random classifier) to 1.0 (perfect). Higher and further to the upper-left is better."
    ]
@@ -244,6 +244,16 @@
     "\n",
     "Your model is trained and saved. Open **`1c_hls4ml_synth.ipynb`** to convert it to an FPGA design with hls4ml."
    ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "c2603bd5",
+   "metadata": {},
+   "source": [
+    "## Further reading\n",
+    "\n",
+    "For more details, see: Duarte, Han, Harris et al., \"Fast inference of deep neural networks in FPGAs for particle physics\", JINST 13 P07027 (2018), [arXiv:1804.06913](https://arxiv.org/abs/1804.06913)"
+   ]
   }
  ],
  "metadata": {
 
@@ -7,7 +7,7 @@
    "source": [
     "# Part 2a: Quantization with QKeras\n",
     "\n",
-    "In this notebook we retrain the jet tagger from Part 1 using **QKeras** (Quantized Keras). With quantization-aware training (QAT), the model is trained with low-precision, fixed-point weights, so the optimizer can correct the effect of quantization during training rather than after — enabling lower precisions (and thus resource consumption) without sacrificing accuracy.\n",
+    "In this notebook we retrain the jet tagger from Part 1 using **QKeras** (Quantized Keras). With quantization-aware training (QAT), the model is trained with low-precision, fixed-point weights, so the optimizer can correct the effect of quantization during training rather than after \u2014 enabling lower precisions (and thus resource consumption) without sacrificing accuracy.\n",
     "\n",
     "Make sure you have run `1_getting_started/1a_train_keras.ipynb` first, as we load its saved data and use its model as a baseline for comparison."
    ]
@@ -178,7 +178,7 @@
     "The baseline is shown with solid lines, the quantized model with dashed lines.\n",
     "\n",
     "We should also check that hls4ml can respect the choice to use 6-bits throughout the model and match the accuracy. We'll generate a configuration from this quantized model and plot its performance as the dotted line.\n",
-    "The generated configuration is printed out. You'll notice that it uses 7 bits for the type, but we specified 6 — that's because QKeras doesn't count the sign bit, so the type that actually gets used needs 1 more.\n",
+    "The generated configuration is printed out. You'll notice that it uses 7 bits for the type, but we specified 6 \u2014 that's because QKeras doesn't count the sign bit, so the type that actually gets used needs 1 more.\n",
     "\n",
     "We also use the `OutputRoundingSaturationMode` optimizer pass of hls4ml to set the Activation layers to round rather than truncate the cast. This is important for getting good model accuracy at small bit precision. "
    ]
@@ -294,7 +294,7 @@
     "\n",
     "For a more accurate picture of resource consumption, you should run **Vivado synthesis** (`vsynth`). This invokes the full Vivado synthesis flow on the generated RTL, producing estimates that are much closer to what you would see after implementation (place-and-route).\n",
     "\n",
-    "**This step can take 10–20 minutes.**"
+    "**This step can take 10\u201320 minutes.**"
    ]
   },
   {
@@ -316,6 +316,19 @@
    "source": [
     "hls4ml.report.read_vivado_report('../hls4ml_prjs/hls4ml_prj_quantized_part2')"
    ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "14989120",
+   "metadata": {},
+   "source": [
+    "## Further reading\n",
+    "\n",
+    "For more details, see:\n",
+    "\n",
+    "- Coelho Jr., Kuusela, Zhuang et al., \"Ultra Low-latency, Low-area Inference Accelerators using Heterogeneous Deep Quantization with QKeras and hls4ml\", arXiv (2020), [arXiv:2006.10159](http://arxiv.org/abs/2006.10159)\n",
+    "- Coelho Jr., Kuusela, Li et al., \"Automatic heterogeneous quantization of deep neural networks for low-latency inference on the edge for particle detectors\", Nature Machine Intelligence (2021), [doi:10.1038/s42256-021-00356-5](https://www.nature.com/articles/s42256-021-00356-5)"
+   ]
   }
  ],
  "metadata": {
 
@@ -85,7 +85,7 @@
    "cell_type": "markdown",
    "id": "cell-5",
    "metadata": {},
-   "source": "## What is the ReuseFactor?\n\nIn the default (`ReuseFactor = 1`) configuration, hls4ml instantiates one multiplier for every weight in the network. All multiplications for a given layer happen in a single clock cycle, giving the minimum possible latency — but using the most multipliers.\n\nSetting `ReuseFactor = N` tells hls4ml to time-multiplex the same multiplier hardware across `N` weight-input pairs. This means the layer takes `N` clock cycles to compute instead of one, but uses roughly `1/N` as many multipliers.\n\n![Reuse factor diagram](../images/part3a_reuse_factor.png)\n\nThe reuse factor must evenly divide the number of weights in each layer. For example, the first layer has `16 × 64 = 1024` weights, so valid reuse factors include 1, 2, 4, 8, 16, 32, 64, etc.\n\nChanging the reuse factor does **not** change the model accuracy — the same arithmetic is performed, just spread over more clock cycles. We will verify this below."
+   "source": "## What is the ReuseFactor?\n\nIn the default (`ReuseFactor = 1`) configuration, hls4ml instantiates one multiplier for every weight in the network. All multiplications for a given layer happen in a single clock cycle, giving the minimum possible latency \u2014 but using the most multipliers.\n\nSetting `ReuseFactor = N` tells hls4ml to time-multiplex the same multiplier hardware across `N` weight-input pairs. This means the layer takes `N` clock cycles to compute instead of one, but uses roughly `1/N` as many multipliers.\n\n![Reuse factor diagram](../images/part3a_reuse_factor.png)\n\nThe reuse factor must evenly divide the number of weights in each layer. For example, the first layer has `16 \u00d7 64 = 1024` weights, so valid reuse factors include 1, 2, 4, 8, 16, 32, 64, etc.\n\nChanging the reuse factor does **not** change the model accuracy \u2014 the same arithmetic is performed, just spread over more clock cycles. We will verify this below."
   },
   {
    "cell_type": "markdown",
@@ -94,7 +94,7 @@
    "source": [
     "## Set ReuseFactor = 4\n",
     "\n",
-    "Let's create a new configuration with `ReuseFactor = 4` set globally. Note that we use `granularity='model'` here, which applies one set of defaults to all layers — equivalent to the config from Part 1c."
+    "Let's create a new configuration with `ReuseFactor = 4` set globally. Note that we use `granularity='model'` here, which applies one set of defaults to all layers \u2014 equivalent to the config from Part 1c."
    ]
   },
   {
@@ -152,7 +152,7 @@
    "source": [
     "## Compare\n",
     "\n",
-    "Changing the reuse factor only affects resource usage and latency — not accuracy. Let's verify that the accuracy and ROC curves are identical."
+    "Changing the reuse factor only affects resource usage and latency \u2014 not accuracy. Let's verify that the accuracy and ROC curves are identical."
    ]
   },
   {
@@ -230,6 +230,16 @@
     "print('ReuseFactor = 1 (Part 1c baseline):')\n",
     "hls4ml.report.read_vivado_report('../hls4ml_prjs_/hls4ml_prj_base_part1')"
    ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "63bbdc7c",
+   "metadata": {},
+   "source": [
+    "## Further reading\n",
+    "\n",
+    "For more details, see: Schulte, Ramhorst, Sun et al., \"hls4ml: A Flexible, Open-Source Platform for Deep Learning Acceleration on Reconfigurable Hardware\", ACM Trans. Reconfigurable Technol. Syst. (2026), [doi:10.1145/3801979](https://dl.acm.org/doi/abs/10.1145/3801979)"
+   ]
   }
  ],
  "metadata": {
 
@@ -109,7 +109,7 @@
     "\n",
     "With `auto` precision:\n",
     "- `weight` and `bias` default to the global model precision (`ap_fixed<16,6>`).\n",
-    "- `result` and `accum` are computed to be wide enough to hold the worst-case accumulation without overflow — these are often much wider than 16 bits. While ensuring no accuracy loss, it may come at the expense of resources.\n",
+    "- `result` and `accum` are computed to be wide enough to hold the worst-case accumulation without overflow \u2014 these are often much wider than 16 bits. While ensuring no accuracy loss, it may come at the expense of resources.\n",
     "\n",
     "This configuration is a useful starting point for manual tuning -- you can inspect the profiling plots below to decide which layers can safely use narrower types and accordingly update the config."
    ]
@@ -141,7 +141,7 @@
     "The `hls4ml.model.profiling.numerical` function plots the distribution of weights and biases as a box-and-whisker chart. The **grey boxes** show the range representable with the data types set in the hls4ml config.\n",
     "\n",
     "The rule of thumb:\n",
-    "- The grey box should cover the full whisker **to the right** (large values) — otherwise weights saturate or wrap around.\n",
+    "- The grey box should cover the full whisker **to the right** (large values) \u2014 otherwise weights saturate or wrap around.\n",
     "- It is acceptable for the box not to reach the left whisker (small values): those weights are simply rounded to zero, which is *often* harmless.\n",
     "\n",
     "Providing data (here the first 1000 test samples for speed) also shows the same distributions at the **output of each layer**, which reveals whether the activation dynamic range is well-matched to the fixed-point type.\n",
@@ -182,9 +182,9 @@
    "source": [
     "## Customise precision\n",
     "\n",
-    "After inspecting the profiling plot, let's try narrowing the weight precision of `fc1` from 16 bits to 8 bits (`ap_fixed<8,2>` — 8 total bits, 2 integer bits). This reduces the multiplier width and can save significant LUT and DSP resources.\n",
+    "After inspecting the profiling plot, let's try narrowing the weight precision of `fc1` from 16 bits to 8 bits (`ap_fixed<8,2>` \u2014 8 total bits, 2 integer bits). This reduces the multiplier width and can save significant LUT and DSP resources.\n",
     "\n",
-    "**Note on the output layer:** Using `auto` precision can produce an accumulator at the output of the last fully-connected layer that is wider than the softmax look-up tables can handle. We therefore manually cap it with `fixed<16,6,RND,SAT>`, which also enables rounding and saturation — important when narrowing any type that feeds into a non-linear function."
+    "**Note on the output layer:** Using `auto` precision can produce an accumulator at the output of the last fully-connected layer that is wider than the softmax look-up tables can handle. We therefore manually cap it with `fixed<16,6,RND,SAT>`, which also enables rounding and saturation \u2014 important when narrowing any type that feeds into a non-linear function."
    ]
   },
   {
@@ -254,7 +254,7 @@
    "source": [
     "## Compile, trace, predict\n",
     "\n",
-    "Compile the hls4ml model and call `hls_model.trace` instead of `hls_model.predict`. This returns both the final predictions **and** a dictionary of intermediate layer outputs — one array per layer, keyed by layer name.\n",
+    "Compile the hls4ml model and call `hls_model.trace` instead of `hls_model.predict`. This returns both the final predictions **and** a dictionary of intermediate layer outputs \u2014 one array per layer, keyed by layer name.\n",
     "\n",
     "We collect the same dictionary from the original model for comparison. We only trace the first 1000 samples since tracing is slower than a plain forward pass."
    ]
@@ -303,7 +303,7 @@
    "source": [
     "## Inspect\n",
     "\n",
-    "We can now print, plot, or otherwise compare the output of each layer between the original model and the hls4ml fixed-point emulation. This makes it easy to spot which layer first deviates — a sign that the precision there is too narrow.\n",
+    "We can now print, plot, or otherwise compare the output of each layer between the original model and the hls4ml fixed-point emulation. This makes it easy to spot which layer first deviates \u2014 a sign that the precision there is too narrow.\n",
     "\n",
     "Let's print the first-layer output for the very first test sample."
    ]
@@ -353,6 +353,16 @@
     "leg = Legend(ax, lines, labels=[MODEL_TYPE, 'hls4ml (8-bit fc1)'], loc='lower right', frameon=False)\n",
     "ax.add_artist(leg)"
    ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "6afa959e",
+   "metadata": {},
+   "source": [
+    "## Further reading\n",
+    "\n",
+    "For more details, see: Schulte, Ramhorst, Sun et al., \"hls4ml: A Flexible, Open-Source Platform for Deep Learning Acceleration on Reconfigurable Hardware\", ACM Trans. Reconfigurable Technol. Syst. (2026), [doi:10.1145/3801979](https://dl.acm.org/doi/abs/10.1145/3801979)"
+   ]
   }
  ],
  "metadata": {