In lesson 12, we learned how gradients move through layers so networks can train. This lesson asks a new question: if neural networks are trainable in general, what kind of network structure should we use for image data? In clinical AI, this is the difference between classifying vitals and interpreting chest X-rays or pathology slides.

Why CNNs exist

Before talking about pixels and filters, it helps to understand the problem CNNs were invented to solve.

A standard feedforward network from lesson 11 treats its input as a flat list of numbers. That works reasonably well for structured tabular data like temperature, blood pressure, and oxygen level. But images are different. In an image, nearby pixels matter together. An edge is created by neighboring pixels changing sharply. A lesion is visible because a whole local patch forms a meaningful shape. The exact spatial arrangement is the signal.

If we fed an image into a plain feedforward network, two problems appear immediately:

the network would need an enormous number of weights,
and it would ignore the fact that neighboring pixels belong together.

CNNs solve exactly these two problems. They are neural networks designed specifically for image-like data. Their entire red line is: look locally, reuse the same detector everywhere, and gradually build larger visual concepts from small ones.

What is an image to a computer?

Before the architecture details, it helps to understand what a neural network actually receives when given an image.

A digital image is nothing more than a large grid of numbers. Each square in the grid is called a pixel, and each pixel holds a number representing brightness (for a grayscale image, 0 = black, 255 = white). A chest X-ray at 512×512 resolution is therefore a grid of 512 × 512 = 262,144 numbers.

If we simply flattened this into a vector and fed it to a standard neural network (as in lesson 11), two problems arise. First, 262,144 inputs would require an enormous number of weights in the first layer alone, computationally extremely expensive. Second, and more importantly, a standard network would not know that pixel (100, 100) and pixel (101, 100) are neighbors. It would treat all pixels as equally unrelated, missing the most important property of images: local spatial patterns matter more than global coincidences.

A rib boundary is visible because adjacent pixels form an edge. An opacity in the lung is recognizable because a cluster of nearby pixels is brighter than usual. Convolutional neural networks (CNNs) hard-code the notion of “look at a local neighborhood” directly into the architecture.

What is a filter, and what does convolution mean?

The key tool in a CNN is a filter (also called a kernel): a small grid of learnable numbers, typically 3×3 or 5×5 in size.

Convolution means: slide this small filter across the entire image, one step at a time. At each position, multiply each filter number by the corresponding pixel number in the patch underneath, sum all those products, and write the result to an output grid. Repeat this for every position in the image.

The result is a feature map: a new grid of numbers where each value represents how strongly the filter’s pattern was detected at that location. A filter with edge-detecting weights will produce high values wherever sharp edges appear, and low values everywhere else.

What makes this powerful is weight sharing: the same filter is applied at every position across the image. The network does not learn separate weights for “top-left edge” and “bottom-right edge”, it learns what an edge looks like once, then finds it everywhere. This dramatically reduces the number of parameters.

The key insight is that image data has local spatial structure. Neighboring pixels are related; distant pixels often are not. CNNs hard-code that prior into the architecture, which makes learning dramatically more data-efficient than fully connected designs for visual tasks.

Convolution as structured feature extraction

A 2D convolution computes each feature-map value from a local patch. In plain language, one output entry is produced by placing the filter over one small image patch, multiplying each filter weight by the pixel underneath it, summing those products, and then adding the bias term.

Index roles in plain language:

$i$ and $j$ pick one output location in the feature map.
$u$ and $v$ iterate over kernel positions inside the local patch.
$k$ is kernel width/height (for a square $k\times k$ filter).

In this formula, $X$ is the input image (for example, a chest X-ray patch), $K$ is one learned filter, and $Y$ is the resulting feature map. The indices $(i,j)$ denote one spatial location in the output. In clinical terms, one filter might activate strongly for edge-like rib boundaries, another for diffuse opacity texture, another for shape transitions around the mediastinum.

So what does this tell us in practice? CNNs do not inspect every pixel independently; they repeatedly apply the same detector across locations, turning raw pixels into clinically meaningful local patterns.

Two more CNN terms appear throughout the rest of the lesson. A receptive field is the part of the input image that can influence one activation in the network. Striding means moving the filter by more than one pixel at a time, which reduces output size and computation.

Core learnings about CNNs and deep architectures

Convolutions use local receptive fields and shared weights to extract spatial features efficiently.
Pooling and striding build hierarchical representations from edges to complex patterns.
Residual connections stabilize very deep training by preserving gradient flow.
Transfer learning is the practical default in medical imaging because pretrained low-level features transfer well.

Why pooling and depth matter

After convolution, a pooling step reduces spatial resolution. The most common type is max pooling: divide the feature map into small blocks (for example 2×2), take the maximum number from each block, and discard the rest. This produces a smaller feature map that still retains the strongest activations.

Why is this useful? First, it makes the representation smaller and faster to process in subsequent layers. Second, it makes the detection slightly position-invariant: if the edge was at pixel (100,100) or pixel (101,101), the max over the neighborhood still fires. This mirrors how humans recognize patterns regardless of their exact position.

As depth increases, the model composes local motifs into larger concepts: filters in early layers detect simple gradients and color contrasts, middle layers combine those into contours and shapes, and later layers recognize complex structures like lesion boundaries or tissue patterns. This hierarchy is exactly why CNNs work well for imaging pipelines.

Residual learning for deep stability

Very deep networks degraded before residual blocks became standard. Residual learning rewrites a block as:

y = F(x) + x

Here, $x$ is the block input, $F(x)$ is the learned residual transform, and the skip path copies $x$ directly. In an imaging model, this means each block learns refinements rather than rebuilding the full representation from scratch. The direct path also improves optimization by preserving gradient flow across depth.

So what does this tell us in practice? Residual connections make deep models trainable and reliable enough for high-capacity clinical vision tasks.

Practical walkthrough: deploying a CNN workflow

Transfer learning is a technique that dramatically reduces the data and compute needed to build a working CNN for a specialized task. The idea is:

Someone else has already trained a CNN on millions of general images (for example ImageNet, a database of 1.2 million labeled photos of cats, cars, furniture, etc.).
The early layers of that network have learned to detect universal visual patterns: edges, textures, gradients, shapes.
Instead of training from scratch, you take that pretrained network and replace only the final layers with new layers suited to your task.
You then train (“fine-tune”) just those final layers on your smaller labeled dataset (for example, a few thousand chest X-rays).

Why does this work? Because the low-level features that detect edges in photographs of dogs also detect edges in X-rays. You don’t need millions of medical images to learn what an edge is, you borrow that already from the pretrained model.

Apply this sequence for a triage imaging classifier:

Start from a pretrained CNN backbone and freeze early layers.
Replace the final head with task-specific outputs (for example, pneumonia risk, edema risk).
Fine-tune later layers on your labeled clinical dataset with strict validation splits.
Evaluate not only AUC but calibration, subgroup drift, and failure clusters.
Integrate model outputs as decision support signals, not autonomous diagnosis.

What this means in practice: architecture quality and validation discipline matter as much as headline accuracy.

Relation to earlier lessons

Lesson 11 introduced layered neural representations in feedforward form.
Lesson 12 explained how backpropagation trains those layers.
Lesson 13 adds domain-aware structure so visual problems become tractable.

Concrete bridge: previous neural lessons answered “How do networks learn?” This lesson answers “What structure should a network have for image-like data?”

Notation quick reference

Symbol	Meaning	Detailed link
$X$	input image tensor	Convolution as structured feature extraction
$K$	convolution kernel/filter	Convolution as structured feature extraction
$Y$	output feature map	Convolution as structured feature extraction
$b$	bias term in convolution	Convolution as structured feature extraction
$(i,j)$	output spatial index	Convolution as structured feature extraction
$k$	kernel size	Convolution as structured feature extraction
$F(x)$	residual transform inside a block	Residual learning for deep stability
$x$	residual-block input	Residual learning for deep stability
$y$	residual-block output	Residual learning for deep stability

Concept deep dives

What comes next

In lesson 14, we pivot from representation learning in weights to explicit symbolic knowledge structures such as semantic networks, frames, and ontologies.

References and Further Reading

LeCun, Y., Bengio, Y., and Hinton, G. “Deep Learning.” Nature 521, 2015.
He, K. et al. “Deep Residual Learning for Image Recognition.” CVPR, 2016.
Rajpurkar, P. et al. “CheXNet.” arXiv, 2017.

This is Lesson 13 of 18 in the AI Starter Course.

Deep Learning and CNNs: Structure That Reflects the Problem

Lesson introduction