In lesson 9, we introduced machine learning as optimization over a hypothesis class. This lesson makes that concrete with decision trees: models that remain fully interpretable while still learning from data.

In a triage setting, this matters because each prediction can be explained as a path of thresholded clinical checks. Instead of “the model says so,” we can inspect the exact branch sequence that produced the diagnosis.

Two quick vocabulary notes before the core ideas: feature space is the abstract space whose axes are the input features, so each patient becomes one point in that space. Pruning means trimming back branches that fit the training data too specifically, which usually reduces overfitting on new patients.

Core learnings about decision trees

Decision trees split feature space recursively using impurity reduction.
Information gain determines which feature and threshold appear at each node.
Trees are highly interpretable but can overfit without pruning constraints.
Ensemble variants improve stability while preserving much of the tree-based intuition.

What a decision tree actually does

Before the math, here is the plain idea. A decision tree is a sequence of yes/no questions. At the top you ask one question about a feature (for example, “Is temperature above 38.5°C?”). Depending on the answer, you move left or right and ask the next question. You keep going until you reach a terminal box called a leaf, which gives the prediction.

The tree is “learned from data” in the sense that an algorithm figures out, from thousands of past patients, which questions to ask and in what order. It does this entirely by looking at which questions best separate the diagnoses.

Learning splits from impurity reduction

The algorithm needs a way to measure how “mixed” a group of patients is at any point. This is called impurity.

Zero impurity means all patients at this node have the same diagnosis, the group is pure. That is the best outcome for a leaf.
High impurity means the patients at this node are scattered across multiple diagnoses, the group is mixed. That is a sign we should ask another question.

The most common impurity measure is Gini impurity. For class proportions $p_1,\dots,p_k$ at a node, Gini impurity is:

G = 1 - \sum_{c=1}^{k} p_c^2

In this formula, $k$ is the number of possible diagnoses, $p_c$ is the fraction of patients at this node who have diagnosis $c$ , and $c$ indexes each possible diagnosis one by one. The sum $\sum_{c=1}^{k} p_c^2$ adds up all squared fractions. Subtracting from 1 inverts the result so that a pure node scores 0 and a maximally mixed node scores highest.

The model chooses the split with highest information gain:

\operatorname{Gain} = G(\text{parent}) - \sum_{b \in \{\text{left,right}\}} \frac{n_b}{n} G(b)

In plain terms: at every node, the algorithm tries every possible question (“Is temperature above 38.5? Above 39? Above 37?”) and picks the one that most reduces impurity in the resulting groups. $G(\text{parent})$ is the impurity before the split. The sum on the right is the weighted average impurity of the two child groups after the split. The difference, Gain, is how much purity improvement we get by asking this particular question. The algorithm always picks the question with the highest gain.

In the triage tree below, candidate splits include thresholds such as temperature $> 38.5$ or headache severity $> 6.5$ . Here, $n$ is the number of samples at the current node and $n_b$ is the count in one child branch after the split.

Interpreting tree paths as clinical rules

Each root-to-leaf path is a rule extracted from data. If a patient follows the path:

Temperature > 38.5,
Headache Severity > 6,
Stiff Neck Positive,

then the model lands in the leaf predicting bacterial meningitis. The value is not only classification accuracy but transparent reasoning structure that clinicians can inspect and challenge.

Practical walkthrough: reading one prediction path

Use the interactive model once with Our triage patient:

Select the patient scenario in the sidebar.
Follow the highlighted path from root to leaf.
At each internal node, map the split question to the patient feature value.
Confirm the terminal leaf and compare prediction vs. actual label.
Click each node in the path to inspect sample counts and impurity context.

What this means in practice: a tree prediction is a small, auditable decision protocol, not a black-box score.

The triage decision tree

The interactive tree below uses a synthetic triage dataset. The root asks about temperature; downstream nodes refine by headache severity and stiff neck evidence.

Triage Decision Tree

Classify a patient by selecting a scenario. The active decision path is highlighted.

From single trees to robust ensembles

Single trees are easy to inspect, but they can be unstable: small training changes can produce different split structures. Ensemble methods reduce this instability by combining many trees.

Random forests average over many de-correlated trees to reduce variance.
Gradient boosting builds trees sequentially to correct earlier errors.
Both often improve accuracy and robustness on tabular clinical data.

Relation to earlier lessons

Lesson 9 introduced empirical risk minimization and generalization.
Lesson 10 instantiates that framework in a model family where decisions are human-readable.
This reconnects to lesson 3’s rule logic, but with rules induced from data rather than hand-authored.

Concrete bridge: lesson 9 asked “How do we learn a mapping?” Lesson 10 shows one of the most interpretable mappings you can learn.

Notation quick reference

Symbol	Meaning	Detailed link
$p_c$	class proportion for class $c$ in a node	Learning splits from impurity reduction
$G$	Gini impurity of a node	Learning splits from impurity reduction
$n$	sample count at parent node	Learning splits from impurity reduction
Leaf	terminal prediction node	Interpreting tree paths as clinical rules
Feature importance	aggregate split contribution signal	From single trees to robust ensembles

Concept deep dives

What comes next

In lesson 11, we move from explicit split rules to distributed representations with neural networks, where decisions emerge from weighted nonlinear layers rather than one path through a tree.

References and Further Reading

Breiman, L. et al. Classification and Regression Trees. Wadsworth, 1984.
Russell, S. and Norvig, P. Artificial Intelligence: A Modern Approach, 4th ed. Pearson, 2020.

This is Lesson 10 of 18 in the AI Starter Course.

Decision Trees: Splitting Data into Clarity

Lesson introduction