tutorial/ai-ml/machine-learning/deep-learning/neural-network-basics/activation-functions.mdx at 5365b8456dd7e4a1abd63ca4f2ba9297a9e32775 · codeharborhub/tutorial

title

Activation Functions

sidebar_label

Activation Functions

description

Why we need non-linearity and a deep dive into Sigmoid, Tanh, ReLU, and Softmax.

1. Why do we need Non-Linearity?

Real-world data is rarely a straight line. If we only used linear transformations ($z = wx + b$), the composition of multiple layers would just be another linear transformation.

Non-linear activation functions allow the network to "bend" the decision boundary to fit complex patterns like images, sound, and human language.

2. Common Activation Functions

A. Sigmoid

The Sigmoid function squashes any input value into a range between 0 and 1.

Formula: $\sigma(z) = \frac{1}{1 + e^{-z}}$
Best For: The output layer of binary classification models.
Downside: It suffers from the Vanishing Gradient problem; for very high or low inputs, the gradient is almost zero, which kills learning.

B. ReLU (Rectified Linear Unit)

ReLU is the default choice for hidden layers in modern deep learning.

Formula: $f(z) = \max(0, z)$
Pros: It is computationally very efficient and helps prevent vanishing gradients.
Cons: "Dying ReLU" — if a neuron's input is always negative, it stays at 0 and never updates its weights again.

C. Tanh (Hyperbolic Tangent)

Similar to Sigmoid, but it squashes values between -1 and 1.

Formula: $\tanh(z) = \frac{e^z - e^{-z}}{e^z + e^{-z}}$
Pros: It is "zero-centered," meaning the average output is closer to 0, which often makes training faster than Sigmoid.

3. Comparison Table

Function	Range	Common Use Case	Main Issue
Sigmoid	(0, 1)	Binary Classification Output	Vanishing Gradient
Tanh	(-1, 1)	Hidden Layers (legacy)	Vanishing Gradient
ReLU	[0, $\infty$)	Hidden Layers (Standard)	Dying Neurons
Softmax	(0, 1)	Multi-class Output	Only used in Output layer

4. The Softmax Function (Multi-class)

When you have more than two categories (e.g., classifying an image as a Cat, Dog, or Bird), we use Softmax in the final layer. It turns the raw outputs (logits) into a probability distribution that sums up to 1.0.

$$ \sigma(\mathbf{z})_i = \frac{e^{z_i}}{\sum_{j=1}^K e^{z_j}} $$

Where:

$\mathbf{z}$ = vector of raw class scores (logits)
$K$ = total number of classes
$\sigma(\mathbf{z})_i$ = probability of class $i$

5. Implementation with Keras

from tensorflow.keras.layers import Dense

# Using ReLU for hidden layers and Sigmoid for output
model.add(Dense(64, activation='relu'))
model.add(Dense(1, activation='sigmoid'))

# Alternatively, using Softmax for multi-class (3 classes)
model.add(Dense(3, activation='softmax'))

References

CS231n: Linear Classifiers and Activations

Now that you know how neurons fire, how do we measure how "wrong" their firing pattern is compared to the ground truth?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

1. Why do we need Non-Linearity?

2. Common Activation Functions

A. Sigmoid

B. ReLU (Rectified Linear Unit)

C. Tanh (Hyperbolic Tangent)

3. Comparison Table

4. The Softmax Function (Multi-class)

5. Implementation with Keras

References

Uh oh!

FilesExpand file tree

activation-functions.mdx

Latest commit

History

activation-functions.mdx

File metadata and controls

1. Why do we need Non-Linearity?

2. Common Activation Functions

A. Sigmoid

B. ReLU (Rectified Linear Unit)

C. Tanh (Hyperbolic Tangent)

3. Comparison Table

4. The Softmax Function (Multi-class)

5. Implementation with Keras

References