Skip to content

Commit 3e2bc38

Browse files
chore: update docs with dnn details
1 parent 742c5fb commit 3e2bc38

3 files changed

Lines changed: 451 additions & 558 deletions

File tree

docs/DNN/dnn.md

Lines changed: 105 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,105 @@
1+
# Deep Neural Networks
2+
3+
Implementation details for our dense DNN model
4+
5+
## Why 128 Neurons?
6+
7+
For each of the hidden layers we have chosen 128 neurons.
8+
The choice of 128 here is somewhat arbitrary — it's a hyperparameter — but it follows some common practices:
9+
10+
- Power of 2: 128 is a power of 2 (like 32, 64, 256), which can help with memory alignment and GPU optimization.
11+
- Balanced Complexity: It's large enough to capture meaningful patterns in the data (like edges or shapes in MNIST), but not so big that it causes overfitting or excessive computation.
12+
- Empirical Performance: Through trial and error, people have found that 128 often works well for small datasets like MNIST.
13+
- Historical Precedent: Many tutorials and papers use 128 as a default starting point for hidden units.
14+
15+
---
16+
17+
## Model Compilation
18+
19+
### Why Adam optimizer is a Good Choice
20+
21+
Adam stands for Adaptive Moment Estimation , and it's one of the most popular optimization algorithms in deep learning.
22+
23+
#### Key Advantages
24+
25+
- Adaptive Learning Rates:
26+
Each weight gets its own learning rate that adapts during training — faster convergence.
27+
- Combines Momentum & RMSProp: Uses both momentum (to accelerate SGD) and adaptive scaling (to handle noisy gradients).
28+
- Robust to Hyperparameters: Works well with default settings (like learning rate = 0.001), so less tuning needed.
29+
- Good Performance on MNIST: For simple datasets like MNIST, Adam usually converges quickly and reliably.
30+
31+
#### How It Works (Simplified)
32+
33+
Keeps track of moving averages of gradients (first moment) and gradient squared (second moment). Adjusts each parameter update based on these statistics. Helps avoid issues like vanishing gradients and oscillations during training. In short: Adam is fast, stable, and works well out-of-the-box, especially for small networks and datasets like MNIST.
34+
35+
---
36+
37+
### sparse_categorical_crossentropy
38+
39+
#### Why This Loss Function?
40+
41+
This loss is specifically designed for multi-class classification problems where:
42+
43+
The labels (y_train, y_test) are integers (e.g., 0 through 9)
44+
The output layer uses softmax activation to produce a probability distribution over classes.
45+
46+
#### What Does It Do?
47+
48+
It compares the predicted probability distribution (from softmax) with the true label (e.g., class 3), and penalizes predictions that are far from the true value.
49+
50+
#### Example
51+
52+
If your model predicts:
53+
54+
```python
55+
[0.05, 0.02, 0.03, 0.8, ...] # Predicting class 3
56+
```
57+
58+
But the true label is `3`
59+
60+
The loss will be low because the model assigned high probability to the correct class.
61+
62+
But if the model says:
63+
64+
```python
65+
[0.4, 0.3, 0.2, 0.1, ...] # Not confident about class 3
66+
```
67+
68+
Then the loss will be higher.
69+
70+
---
71+
72+
### Why Track Accuracy?
73+
74+
Accuracy measures how often the model makes the correct prediction.
75+
It's easy to understand: e.g., "97% accuracy" means 97% of predictions were correct.
76+
77+
#### When Accuracy Might Be Misleading
78+
79+
On imbalanced datasets (e.g., 90% of samples are class 0), accuracy can be misleading. But for MNIST, classes are balanced, so accuracy is a valid and useful metric.
80+
You can also add more metrics (like precision, recall, F1-score) if you want deeper insight into performance per class.
81+
82+
---
83+
84+
## Model Fitting
85+
86+
### Why Use 3 Epochs?
87+
88+
Let's first define what an epoch is:
89+
An epoch is one full pass through the entire training dataset.
90+
91+
#### Why 3 Might Be Used
92+
93+
- Speed: Training for only 3 epochs is fast — useful for quick experimentation or testing code.
94+
- Avoid Overfitting: If the dataset is very large or complex, fewer epochs may help prevent overfitting (not really the case for MNIST).
95+
- Baseline Start: Often people start with a small number of epochs to see if the model learns at all before increasing.
96+
97+
#### Is 3 Enough?
98+
99+
For MNIST, which is a simple and clean dataset, even 1 epoch might give decent results (~90%+ accuracy). However:<br><br>
100+
101+
1 epoch -> ~90-92% approximate accuracy<br>
102+
3 epochs -> ~95-97%<br>
103+
5-10 epochs -> ~98%+<br><br>
104+
105+
So while 3 epochs is better than 1, it's still on the lower side for achieving the best possible performance. Usually, people train MNIST models for 5-10 epochs to reach near-optimal accuracy.
File renamed without changes.

mnist_modelling.ipynb

Lines changed: 346 additions & 558 deletions
Large diffs are not rendered by default.

0 commit comments

Comments
 (0)