In lesson 11, the forward pass gave us predictions. But a neural network is not useful just because it can produce an output once. The real question is: how does it improve? How does it look at a wrong prediction and become slightly better next time?

This lesson completes that missing half. It explains the full learning loop: make a prediction, measure how wrong it was, compute how each parameter contributed to that error, then update all parameters a little so the next prediction is better.

In triage terms, backpropagation is the mechanism that tells the model not only that it was wrong, but which internal pathways contributed most to the mistake.

One core term appears immediately in this lesson: cross-entropy is the loss function that measures how far the model’s predicted probability distribution is from the correct target distribution. It punishes confident wrong predictions especially strongly.

Core learnings about backpropagation

Cross-entropy quantifies classification error in probability space.
Backpropagation uses chain-rule factorization to compute gradients efficiently layer by layer.
Gradient descent updates each weight in the direction that reduces loss.
Optimization quality depends on learning rate, batch design, and gradient health.

The full training loop first

Before zooming in on gradients, it helps to see the whole process from a bird’s-eye view. Training a neural network consists of repeating these four steps over and over:

Forward pass, send the input through the network to get a prediction.
Compute loss, compare the prediction with the correct answer and measure the error.
Backpropagation, work backward through the network to compute how much each weight contributed to the error.
Parameter update, slightly adjust every weight and bias to reduce future error.

That is the red line of the entire lesson. Every concept below belongs to exactly one part of this loop. The loss tells us how wrong the network was. The gradient tells us which direction each parameter should move. Backpropagation is the efficient bookkeeping method that computes those gradients for the whole network. And gradient descent is the rule that actually changes the parameters.

What is a gradient, actually?

Before any equations: a gradient is simply the slope of a surface at a particular point, extended to many dimensions.

Imagine the network’s total error (called loss) as a landscape of hills and valleys. Every combination of all the weights in the network corresponds to one point on that landscape, and the height of that point is how large the error is. When the weights are wrong, you are on a hilltop (high error). When the weights are right, you are in a valley (low error).

The gradient at any point tells you two things:

Which direction is uphill, which way would make the error worse
How steep the slope is, how much moving in that direction would change the error

Gradient descent works by doing the opposite of uphill: it computes the gradient (the direction and steepness of “uphill”) and takes a small step in the downhill direction. Repeat this many times and the error decreases step by step until you reach a valley.

The learning rate $\eta$ controls how large each downhill step is:

Too large a learning rate: you overshoot the valley, bounce back and forth, and may never converge or even diverge (error keeps rising)
Too small a learning rate: each step is tiny and training takes an extremely long time
A good learning rate: you move steadily downhill and reach a valley in a reasonable number of steps

Backpropagation is the algorithm that computes “how much did each individual weight contribute to the current error?”, which tells you how to adjust each weight independently to reduce that error.

Read the symbols before the formulas

Before the equations, map each symbol to one concrete role:

$c$ indexes one class (for example, Benign or Critical).
$C$ is the total number of classes.
$\hat{y}_c$ is the model probability for class $c$ .
$y_c$ is the target indicator/probability for class $c$ .
$\eta$ (eta) is the learning rate, i.e., step size for each update.
∂L/∂w means “how much loss changes if weight $w$ changes slightly.”

This notation is the bridge between “the model was wrong” and “which exact parameters should move.”

Loss and gradient update in one view

This section covers step 2 and step 4 of the training loop: first measure the error, then update the parameters.

For a sample with target distribution $y$ and prediction $\hat{y}$ , cross-entropy is:

L = -\sum_{c=1}^{C} y_c \log(\hat{y}_c)

Why the logarithm? Because it heavily penalizes confident wrong predictions. If the correct diagnosis is “Critical” but the model assigns only 1% probability to it ( $\hat{y}_c = 0.01$ ), then $\log(0.01) \approx -4.6$ , making the loss very large. If the model correctly assigns 90% probability ( $\hat{y}_c = 0.9$ ), then $\log(0.9) \approx -0.1$ , contributing very little to the loss. In other words: being confidently wrong is punished far more than being uncertain. This is exactly the behavior you want in a diagnostic system.

A parameter update step is conceptually simple: new weight = old weight - learning rate × gradient.

If we name the gradient for weight $w$ as g_w, we can write the same update in plain notation as w_new = w - eta × g_w.

In the network visualization, the predicted class probability ŷ_c corresponds to the output probabilities shown after the forward pass. If the correct class is “Critical” but the model puts low probability there, L grows. The gradient for each edge weight, written in calculus as the loss change with respect to w, tells how that specific connection should change to move probability mass toward the correct class on future passes.

So what does this tell us in practice? Backpropagation translates one clinical prediction error into distributed corrective signals across all model parameters.

Chain rule over layered computations

This section is step 3 of the training loop: once we know the error, how do we assign responsibility for that error to millions of internal parameters?

To compute how much error came from weight $w$ deep inside the network, the gradient must travel backward through all the layers between $w$ and the output. Each layer’s contribution is multiplied together using the chain rule from calculus.

The chain rule says: if $A$ affects $B$ and $B$ affects $C$ , then the rate at which $A$ affects $C$ equals the rate at which $A$ affects $B$ multiplied by the rate at which $B$ affects $C$ . In other words, you can chain (multiply) local rates of change together. In a network, each layer’s local derivative describes how much that layer amplifies or reduces error signals passing through it.

The core identity is the chain rule in action: loss change with respect to w = (loss change with respect to a) × (activation change with respect to z) × (pre-activation change with respect to w).

Read this as a causal chain:

$z$ is a neuron’s pre-activation weighted sum.
$a$ is that neuron’s post-activation output.
The product of local derivatives tells how influence propagates from output error back to weight $w$ .

This factorization is computed repeatedly from output layers back toward input layers. Intermediate activations from the forward pass are cached because they are needed to evaluate these local derivatives.

In deep models, this repeated multiplication can cause gradient shrinking or explosion, which is why architecture and optimizer choices matter.

Practical walkthrough: one training-step interpretation

Use the triage network component as a conceptual anchor:

Select Meningitis signs present and run a forward pass.
Read output probabilities and identify deviation from expected target.
Interpret high loss as strong correction pressure on relevant pathways.
Repeat with Benign presentation and compare output shifts.
Relate repeated cycles to mini-batch training behavior over many samples.

What this means in practice: model learning is many tiny corrections, not one large rewrite, and convergence quality depends on stable gradient flow.

One training step visualized

The component below shows forward activation flow. The backward pass is conceptual here, but every shown weight is one parameter that would receive a gradient update under training.

Triage Network: Learning in Action

Each forward pass produces a prediction. The backward pass (not visualised) adjusts every weight to reduce error.

Training diagnostics that matter

In practice, we do not compute the gradient using the entire dataset at once. Instead, we pick a random mini-batch, a small random subset of 32 to 256 training examples, compute the gradient on that subset, update the weights, then pick the next mini-batch and repeat. This is called stochastic gradient descent (SGD) or mini-batch gradient descent. It has two advantages: it is much faster per step than processing all data, and the random noise in each batch’s gradient actually helps the network escape local traps in the error landscape.

Monitor training loss, validation loss, and gradient norms together. Diverging train/validation trends imply overfitting. Collapsing gradient norms signal vanishing gradient risk. Exploding gradients motivate clipping or optimizer adjustments.

Relation to earlier lessons

Lesson 11 described the forward computational path.
Lesson 12 adds the backward error-propagation path required for learning.
Together they complete the optimization loop introduced in lesson 9.

Concrete bridge: the forward pass answers “What does the model predict now?” Backpropagation answers “How must each parameter change next?”

Notation quick reference

Symbol	Meaning	Detailed link
$L$	loss value for current sample/batch	Loss and gradient update in one view
$y_c$	target probability for class $c$	Loss and gradient update in one view
$\hat{y}_c$	predicted probability for class $c$	Loss and gradient update in one view
$w$	model weight parameter	Loss and gradient update in one view
$\eta$	learning rate	Loss and gradient update in one view
$\partial L/\partial w$	gradient of loss wrt parameter $w$	Loss and gradient update in one view
$a$	post-activation value	Chain rule over layered computations
$z$	pre-activation value	Chain rule over layered computations

Concept deep dives

What comes next

In lesson 13, we move from training mechanics to architectural scaling: convolutions, residual design, and transfer learning in deep visual models.

References and Further Reading

Rumelhart, D., Hinton, G., and Williams, R. “Learning Representations by Back-Propagating Errors.” Nature 323, 1986.
Kingma, D. and Ba, J. “Adam: A Method for Stochastic Optimization.” ICLR, 2015.

This is Lesson 12 of 18 in the AI Starter Course.

Backpropagation: How Neural Networks Actually Learn

Lesson introduction