In lesson 10, decision trees gave us transparent step-by-step splits. Neural networks keep the same supervised learning goal, but they solve it in a very different way. Instead of learning an explicit chain of yes/no questions, they learn many small numerical detectors that work together.

For triage classification, that means the model can represent subtle interactions among symptom intensity, neck stiffness, headache severity, and oxygen status without handcrafting branch logic.

Two later terms are worth defining now. Softmax is the final transformation that turns raw output scores into probabilities that are all positive and sum to 1. Vanishing gradients describes the problem where backward learning signals become so small across many layers that early layers stop learning effectively.

Core learnings about feedforward neural networks

A neural network is a stack of simple computational units connected in layers.
Each neuron computes a weighted sum of its inputs, adds a bias, and passes the result through an activation function.
Hidden layers build intermediate representations that can capture complex interactions.
Softmax outputs class probabilities for multi-class diagnosis.
Forward propagation defines prediction; backpropagation (next lesson) defines learning.

What a neural network is

Before defining layers, neurons, or weights, it helps to answer the most basic question first: what is a neural network?

A neural network is a mathematical model made of many tiny computation units connected together. You give it some input numbers, those numbers flow through the network from left to right, and the network produces output numbers at the end.

For a triage example, the input numbers might be:

temperature,
stiff-neck indicator,
headache severity,
oxygen level.

The output numbers might be probabilities for diagnoses such as:

Low Acuity,
Viral Illness,
Neurology Consult,
Bacterial Meningitis.

So at the highest level, a neural network is still just the same mapping idea from lesson 1: it takes an input x and produces an output y. The difference is how it builds that mapping. A decision tree builds it with explicit branching rules. A neural network builds it with many learned numerical transformations composed together.

What a neural network looks like

Visually, a feedforward neural network looks like several vertical columns of circles connected by lines:

the input layer on the left,
one or more hidden layers in the middle,
the output layer on the right.

Each circle is called a neuron. Each line is a weighted connection between two neurons. Information flows from left to right only, that is why this model is called feedforward.

If you look at the network visualization in this lesson, that is exactly what you are seeing: values enter on the left as patient features, get transformed across hidden layers, and emerge on the right as diagnosis probabilities.

Why use a neural network at all?

At this point a sensible beginner question is: if decision trees already classify patients, why do we need neural networks?

The reason is that many real patterns are too complex to describe well with a short sequence of human-readable branch rules. Symptoms often interact in nonlinear ways. A high fever may mean one thing alone, something else when combined with neck stiffness, and something else again when paired with low oxygen and a severe headache.

Neural networks are useful because they can learn many interacting detectors at once, then combine them into higher-level patterns. That makes them much more flexible than simple linear models and often more powerful than single decision trees.

The building blocks: layers, neurons, weights, and biases

Once we know what the whole network is, we can break it into parts.

What a layer is

A layer is a group of neurons that all receive the same inputs and compute their outputs in parallel. The outputs of one layer become the inputs of the next layer.

Imagine the input layer as a row of sensor readings. The first hidden layer contains several neurons, each looking at all those readings and producing one new number. The second hidden layer then takes those new numbers and builds even more abstract features from them.

Why would we stack multiple layers? Because each layer can learn a different level of abstraction:

First layer: basic combinations like “fever + neck stiffness”
Second layer: larger patterns like “signals that together suggest infection”
Output layer: final diagnosis probabilities

This is the basic idea of deep learning: deeper layers build on what earlier layers discovered.

What a neuron is

A neuron is one small computational unit inside the network. It receives several input numbers, combines them into one score, and sends one output number forward.

You can think of a neuron as a tiny detector. One neuron might become sensitive to “high temperature plus severe headache.” Another might respond strongly to “neck stiffness plus low oxygen.” The network learns these detectors automatically from data.

What weights and biases are

Each neuron must decide how much it trusts each incoming signal. It does this with weights and a bias:

A weight w_ji is a single number attached to the connection between input i and neuron j. A large positive weight means “this input matters a lot and pushes the neuron to activate.” A weight near zero means “this input barely matters.” A negative weight means “this input suppresses the neuron.”
A bias b_j is a constant number added after the weighted sum. It lets the neuron shift its activation threshold. Think of it as the neuron’s built-in default tendency to fire or stay silent.

Both weights and biases start as random numbers and are adjusted during training (lesson 12) until the network makes accurate predictions.

Why activation functions are needed

If a neuron only computed a weighted sum of its inputs, then every layer would be a linear operation. The problem is that stacking multiple linear operations still gives you one overall linear operation. In other words, depth would add complexity to the diagram, but not real expressive power.

The solution is to apply a nonlinear activation function after each weighted sum. This breaks the linearity and allows the network to learn curved, interaction-heavy decision boundaries, things like “high fever AND severe headache AND neck stiffness together have a special meaning that no single feature has alone.”

ReLU: the most common activation function

The most widely used activation function today is ReLU (Rectified Linear Unit), with equation $\mathrm{ReLU}(z)=\max(0,\ z)$ .

This is the simplest possible nonlinearity: if the input $z$ is positive, output it unchanged; if it is zero or negative, output zero.

Think of it as a gate: the neuron either fires or stays silent. If the weighted sum of inputs does not cross zero, nothing passes through. If it does, the signal continues. This simple rule makes deep networks much easier to train.

Sigmoid: the original choice

The original activation function was the sigmoid (also written as the Greek letter sigma), with equation $\sigma(z)=1/(1+e^{-z})$ .

Sigmoid squashes any input to a value between 0 and 1. It looks like a smooth S-curve. That made it attractive early on, especially when people wanted neuron outputs to resemble probabilities.

But sigmoid has a major downside in deep networks: for very large positive or negative inputs, its slope becomes tiny, so learning slows down dramatically. This is one reason ReLU became standard in hidden layers.

How they differ: a quick comparison

Feature	ReLU	Sigmoid
Output range	0 to positive infinity	0 to 1
Shape	broken line (hockey stick)	smooth S-curve
Trains efficiently in deep networks?	Yes	Often no (vanishing gradients)
Typical use today	Hidden layers	Binary output, gates

The neuron equation in the triage setting

With the components defined, we can now write the full equation for one neuron’s output:

a_j = \phi\left(\sum_{i=1}^{n} w_{ji}x_i + b_j\right)

Read this left to right. First, inside the parentheses: multiply each input x_i by its weight w_ji, sum all of those products together, then add the bias b_j. The result is a single number z that represents how strongly this neuron is being activated before nonlinearity. Then $\phi$ (the activation function, such as ReLU) is applied to produce the final output a_j.

In this expression, indices matter:

i indexes input features (i = 1 might be temperature, i = 2 stiff neck, etc.).
j indexes neurons in the current layer (“which detector are we computing?”).
n is the number of input features feeding this neuron.

With that indexing in place: x_i are triage input features, w_ji are learned weights into neuron j, b_j is a bias term, and $\phi$ is the activation function. In the visualization below, each edge corresponds to one weight term w_jix_i, and each node value corresponds to an activation a_j after nonlinearity.

So what does this tell us in practice? A hidden neuron acts like a learned clinical feature detector whose response depends on a weighted combination of multiple raw inputs.

From hidden activations to diagnosis probabilities

After values flow through all hidden layers, the final layer produces one raw score z_k per diagnosis class. These raw scores can be any real number, positive or negative. To convert them into probabilities that sum to 1, we apply the softmax function: ŷ_k equals $\exp(z_k)$ divided by the sum of $\exp(z_m)$ over all classes.

Why the exponential function? First, exp(x) is always positive, so all raw scores become positive numbers. Second, it magnifies differences, so higher-scoring classes receive more probability mass. The denominator then normalizes the result so the probabilities add up to 1.

Index interpretation for this equation:

$k$ identifies one specific output class whose probability we are computing.
$m$ is the summation index over all classes in the denominator.
$K$ is the total number of classes.

In the triage model, each ŷ_k is the predicted probability of one diagnosis class. In the output panel, those percentages are exactly these softmax probabilities.

So what does this tell us in practice? The network does not output one hard label by default; it outputs a full uncertainty distribution over candidate outcomes.

Practical walkthrough: tracing one patient through the network

Use the visualization once with Our triage patient (meningitis):

Select the patient scenario in the sidebar.
Watch input activations populate the first layer.
Follow activation flow through hidden layer 1 and hidden layer 2.
Observe output probabilities and identify the top prediction.
Compare output spread across classes to assess model confidence structure.

What this means in practice: layered representations allow the model to combine weak signals into stronger diagnostic evidence patterns.

The triage neural network

The network below has four inputs, two hidden layers, and a 4-class softmax output. Select scenarios to inspect activation flow.

Triage Neural Network: Forward Pass

Watch activations propagate from patient features to diagnosis probabilities.

Relation to earlier lessons

By this point in the course, the contrast to lesson 10 should feel intuitive. Decision trees reach a prediction through a sequence of explicit yes/no splits. Neural networks reach a prediction by combining many small learned feature detectors through weighted nonlinear layers. Both still live inside the machine-learning frame from lesson 9, but the internal structure now looks much less like a checklist and much more like a distributed pattern detector.

That difference matters in practice: trees are often easier to inspect step by step, while neural nets are usually better at combining many weak signals into one strong prediction.

What comes next

In lesson 12, we cover backpropagation: how gradients are computed and propagated so every network weight can be updated to reduce prediction error.

Notation quick reference

Symbol	Meaning	Detailed link
x_i	input feature i	The neuron equation in the triage setting
w_ji	weight from input i to neuron j	The neuron equation in the triage setting
b_j	bias for neuron j	The neuron equation in the triage setting
$\phi$	activation function	The neuron equation in the triage setting
a_j	activation of neuron j	The neuron equation in the triage setting
z_k	pre-softmax score for class k	From hidden activations to diagnosis probabilities
ŷ_k	predicted probability for class k	From hidden activations to diagnosis probabilities
K	number of output classes	From hidden activations to diagnosis probabilities

Concept deep dives

References and Further Reading

Goodfellow, I., Bengio, Y., and Courville, A. Deep Learning. MIT Press, 2016.
Nielsen, M. Neural Networks and Deep Learning. Determination Press, 2015.

This is Lesson 11 of 18 in the AI Starter Course.

Neural Networks: Learning Representations Layer by Layer

Lesson introduction