Why Do Neural Networks Need Activation Functions?

January 31, 2025

Understanding Why Depth Needs Non-Linearity

Deep learning models are built from layers of neurons, but one of the most important ingredients that makes them work effectively is the activation function.

If you’ve ever wondered:

Why can’t we just stack many linear layers?
What does an activation function actually do?
How does setting some values to 0 help learning?

This post will break it all down in an intuitive way.

1. What Happens Without Activation Functions?

At their core, neural networks perform a series of mathematical transformations on input data. A simple linear layer applies the transformation:

[ y = Wx + b ]

where:

( W ) is a weight matrix that scales inputs,
( x ) is the input vector (e.g., an image, text embedding, etc.),
( b ) is a bias term that shifts the output.

If we stack multiple linear layers, you might expect the network to learn more complex representations. However, something surprising happens:

[ h1 = W1 x + b1 ] [ h2 = W2 h1 + b_2 ]

Substituting ( h_1 ):

[ h2 = W2 (W1 x + b1) + b2 ] [ h2 = (W2 W1) x + (W2 b1 + b_2) ]

This is still just one big linear function! No matter how many layers we stack, we never get the ability to model complex, non-linear patterns.

This means:

The network can only separate data using straight lines (or planes in higher dimensions).
It cannot adapt to curved decision boundaries needed for real-world problems.

2. How Activation Functions Break This Limitation

To make neural networks truly powerful, we need to introduce non-linearity. This is where activation functions come in.

When we apply an activation function ( f(x) ) after each layer:

[ h1 = f(W1 x + b1) ] [ h2 = f(W2 h1 + b_2) ]

Now, we can’t simplify the layers into one big equation—each layer transforms the data differently based on the activation function, allowing the network to model complex relationships.

But why does this work?

3. How Does an Activation Function “Break It Up”?

The key to activation functions is that they change how information flows between layers.

Without activation functions:

Each layer predictably transforms the input in a fixed, linear way.
The next layer can always adjust for the previous layer’s transformation, meaning the network behaves like one big layer.

With activation functions:

Certain values get modified (or set to zero).
This forces the next layer to adapt in different ways depending on the input.
Instead of a single, predictable transformation, the network now learns piecewise, non-linear functions that approximate curves.

Example: ReLU Activation

[ \text{ReLU}(x) = \max(0, x) ]

If ( x ) is negative, it becomes 0 (neuron turns off). If ( x ) is positive, it stays the same (neuron passes information forward).

This breaks linearity because some neurons completely stop contributing in certain areas of the input space. The next layer must now adapt based on which neurons are active, rather than always seeing the same structured transformation.

This is what lets deep networks approximate curves instead of just drawing straight lines.

4. Why ReLU or GELU?

ReLU (Rectified Linear Unit) is the most common activation function because:

It’s simple and fast (just a max operation).
It prevents vanishing gradients (unlike Sigmoid and Tanh).
It creates sparse activations, helping the model learn more efficiently.

However, ReLU has a problem: some neurons “die” and never activate (if they always receive negative inputs).

GELU (Gaussian Error Linear Unit) is a smoother alternative used in models like Transformers (GPT, BERT). It behaves like ReLU for large values but gradually scales small negative values instead of cutting them off completely. This leads to better training stability in deep models.

Use ReLU for most applications (CNNs, general deep learning).
Use GELU for deep architectures where smooth activation helps, like NLP models.

5. Final Thoughts: Activation Functions Are the Key to Depth

If there’s one thing to take away from this post, it’s this:

Without activation functions, neural networks are just big linear equations.
Activation functions “break up” predictability, enabling deep networks to learn complex patterns.
ReLU is simple and fast; GELU is smoother and better for deep architectures.

Deep learning isn’t just about stacking layers—it’s about stacking them in a way that introduces the right kinds of transformations. Activation functions are the secret sauce that make deep learning truly deep.

Want to Go Deeper?

Try experimenting with activation functions in PyTorch or TensorFlow.
Look at how ReLU and GELU behave visually.
Read more about how activation functions impact learning dynamics.