Course navigation Course overview

All lessons

18 total

In lesson 9, we introduced machine learning as optimization over a hypothesis class. This lesson makes that concrete with decision trees: models that remain fully interpretable while still learning from data.

In a triage setting, this matters because each prediction can be explained as a path of thresholded clinical checks. Instead of “the model says so,” we can inspect the exact branch sequence that produced the diagnosis.

Two quick vocabulary notes before the core ideas: feature space is the abstract space whose axes are the input features, so each patient becomes one point in that space. Pruning means trimming back branches that fit the training data too specifically, which usually reduces overfitting on new patients.

Core learnings about decision trees

  • Decision trees split feature space recursively using impurity reduction.
  • Information gain determines which feature and threshold appear at each node.
  • Trees are highly interpretable but can overfit without pruning constraints.
  • Ensemble variants improve stability while preserving much of the tree-based intuition.

What a decision tree actually does

Before the math, here is the plain idea. A decision tree is a sequence of yes/no questions. At the top you ask one question about a feature (for example, “Is temperature above 38.5°C?”). Depending on the answer, you move left or right and ask the next question. You keep going until you reach a terminal box called a leaf, which gives the prediction.

The tree is “learned from data” in the sense that an algorithm figures out, from thousands of past patients, which questions to ask and in what order. It does this entirely by looking at which questions best separate the diagnoses.

Learning splits from impurity reduction

The algorithm needs a way to measure how “mixed” a group of patients is at any point. This is called impurity.

  • Zero impurity means all patients at this node have the same diagnosis, the group is pure. That is the best outcome for a leaf.
  • High impurity means the patients at this node are scattered across multiple diagnoses, the group is mixed. That is a sign we should ask another question.

The most common impurity measure is Gini impurity. For class proportions p1,,pkp_1,\dots,p_k at a node, Gini impurity is:

G=1c=1kpc2G = 1 - \sum_{c=1}^{k} p_c^2

In this formula, kk is the number of possible diagnoses, pcp_c is the fraction of patients at this node who have diagnosis cc, and cc indexes each possible diagnosis one by one. The sum c=1kpc2\sum_{c=1}^{k} p_c^2 adds up all squared fractions. Subtracting from 1 inverts the result so that a pure node scores 0 and a maximally mixed node scores highest.

The model chooses the split with highest information gain:

Gain=G(parent)b{left,right}nbnG(b)\operatorname{Gain} = G(\text{parent}) - \sum_{b \in \{\text{left,right}\}} \frac{n_b}{n} G(b)

In plain terms: at every node, the algorithm tries every possible question (“Is temperature above 38.5? Above 39? Above 37?”) and picks the one that most reduces impurity in the resulting groups. G(parent)G(\text{parent}) is the impurity before the split. The sum on the right is the weighted average impurity of the two child groups after the split. The difference, Gain, is how much purity improvement we get by asking this particular question. The algorithm always picks the question with the highest gain.

In the triage tree below, candidate splits include thresholds such as temperature >38.5> 38.5 or headache severity >6.5> 6.5. Here, nn is the number of samples at the current node and nbn_b is the count in one child branch after the split.

Interpreting tree paths as clinical rules

Each root-to-leaf path is a rule extracted from data. If a patient follows the path:

  • Temperature > 38.5,
  • Headache Severity > 6,
  • Stiff Neck Positive,

then the model lands in the leaf predicting bacterial meningitis. The value is not only classification accuracy but transparent reasoning structure that clinicians can inspect and challenge.

Practical walkthrough: reading one prediction path

Use the interactive model once with Our triage patient:

  1. Select the patient scenario in the sidebar.
  2. Follow the highlighted path from root to leaf.
  3. At each internal node, map the split question to the patient feature value.
  4. Confirm the terminal leaf and compare prediction vs. actual label.
  5. Click each node in the path to inspect sample counts and impurity context.

What this means in practice: a tree prediction is a small, auditable decision protocol, not a black-box score.

The triage decision tree

The interactive tree below uses a synthetic triage dataset. The root asks about temperature; downstream nodes refine by headache severity and stiff neck evidence.

Triage Decision Tree

Classify a patient by selecting a scenario. The active decision path is highlighted.

Classify a Patient
Decision Path
Select a patient or click a node to trace.
Node Info
Click a node to inspect.

From single trees to robust ensembles

Single trees are easy to inspect, but they can be unstable: small training changes can produce different split structures. Ensemble methods reduce this instability by combining many trees.

  • Random forests average over many de-correlated trees to reduce variance.
  • Gradient boosting builds trees sequentially to correct earlier errors.
  • Both often improve accuracy and robustness on tabular clinical data.

Relation to earlier lessons

  1. Lesson 9 introduced empirical risk minimization and generalization.
  2. Lesson 10 instantiates that framework in a model family where decisions are human-readable.
  3. This reconnects to lesson 3’s rule logic, but with rules induced from data rather than hand-authored.

Concrete bridge: lesson 9 asked “How do we learn a mapping?” Lesson 10 shows one of the most interpretable mappings you can learn.

Notation quick reference

SymbolMeaningDetailed link
pcp_cclass proportion for class cc in a nodeLearning splits from impurity reduction
GGGini impurity of a nodeLearning splits from impurity reduction
nnsample count at parent nodeLearning splits from impurity reduction
Leafterminal prediction nodeInterpreting tree paths as clinical rules
Feature importanceaggregate split contribution signalFrom single trees to robust ensembles

Concept deep dives

What comes next

In lesson 11, we move from explicit split rules to distributed representations with neural networks, where decisions emerge from weighted nonlinear layers rather than one path through a tree.


References and Further Reading

  • Breiman, L. et al. Classification and Regression Trees. Wadsworth, 1984.

  • Russell, S. and Norvig, P. Artificial Intelligence: A Modern Approach, 4th ed. Pearson, 2020.

This is Lesson 10 of 18 in the AI Starter Course.