In lesson 9, we introduced machine learning as optimization over a hypothesis class. This lesson makes that concrete with decision trees: models that remain fully interpretable while still learning from data.
In a triage setting, this matters because each prediction can be explained as a path of thresholded clinical checks. Instead of “the model says so,” we can inspect the exact branch sequence that produced the diagnosis.
Two quick vocabulary notes before the core ideas: feature space is the abstract space whose axes are the input features, so each patient becomes one point in that space. Pruning means trimming back branches that fit the training data too specifically, which usually reduces overfitting on new patients.
Core learnings about decision trees
- Decision trees split feature space recursively using impurity reduction.
- Information gain determines which feature and threshold appear at each node.
- Trees are highly interpretable but can overfit without pruning constraints.
- Ensemble variants improve stability while preserving much of the tree-based intuition.
What a decision tree actually does
Before the math, here is the plain idea. A decision tree is a sequence of yes/no questions. At the top you ask one question about a feature (for example, “Is temperature above 38.5°C?”). Depending on the answer, you move left or right and ask the next question. You keep going until you reach a terminal box called a leaf, which gives the prediction.
The tree is “learned from data” in the sense that an algorithm figures out, from thousands of past patients, which questions to ask and in what order. It does this entirely by looking at which questions best separate the diagnoses.
Learning splits from impurity reduction
The algorithm needs a way to measure how “mixed” a group of patients is at any point. This is called impurity.
- Zero impurity means all patients at this node have the same diagnosis, the group is pure. That is the best outcome for a leaf.
- High impurity means the patients at this node are scattered across multiple diagnoses, the group is mixed. That is a sign we should ask another question.
The most common impurity measure is Gini impurity. For class proportions at a node, Gini impurity is:
In this formula, is the number of possible diagnoses, is the fraction of patients at this node who have diagnosis , and indexes each possible diagnosis one by one. The sum adds up all squared fractions. Subtracting from 1 inverts the result so that a pure node scores 0 and a maximally mixed node scores highest.
The model chooses the split with highest information gain:
In plain terms: at every node, the algorithm tries every possible question (“Is temperature above 38.5? Above 39? Above 37?”) and picks the one that most reduces impurity in the resulting groups. is the impurity before the split. The sum on the right is the weighted average impurity of the two child groups after the split. The difference, Gain, is how much purity improvement we get by asking this particular question. The algorithm always picks the question with the highest gain.
In the triage tree below, candidate splits include thresholds such as temperature or headache severity . Here, is the number of samples at the current node and is the count in one child branch after the split.
Interpreting tree paths as clinical rules
Each root-to-leaf path is a rule extracted from data. If a patient follows the path:
- Temperature > 38.5,
- Headache Severity > 6,
- Stiff Neck Positive,
then the model lands in the leaf predicting bacterial meningitis. The value is not only classification accuracy but transparent reasoning structure that clinicians can inspect and challenge.
Practical walkthrough: reading one prediction path
Use the interactive model once with Our triage patient:
- Select the patient scenario in the sidebar.
- Follow the highlighted path from root to leaf.
- At each internal node, map the split question to the patient feature value.
- Confirm the terminal leaf and compare prediction vs. actual label.
- Click each node in the path to inspect sample counts and impurity context.
What this means in practice: a tree prediction is a small, auditable decision protocol, not a black-box score.
The triage decision tree
The interactive tree below uses a synthetic triage dataset. The root asks about temperature; downstream nodes refine by headache severity and stiff neck evidence.
Triage Decision Tree
Classify a patient by selecting a scenario. The active decision path is highlighted.
From single trees to robust ensembles
Single trees are easy to inspect, but they can be unstable: small training changes can produce different split structures. Ensemble methods reduce this instability by combining many trees.
- Random forests average over many de-correlated trees to reduce variance.
- Gradient boosting builds trees sequentially to correct earlier errors.
- Both often improve accuracy and robustness on tabular clinical data.
Relation to earlier lessons
- Lesson 9 introduced empirical risk minimization and generalization.
- Lesson 10 instantiates that framework in a model family where decisions are human-readable.
- This reconnects to lesson 3’s rule logic, but with rules induced from data rather than hand-authored.
Concrete bridge: lesson 9 asked “How do we learn a mapping?” Lesson 10 shows one of the most interpretable mappings you can learn.
Notation quick reference
| Symbol | Meaning | Detailed link |
|---|---|---|
| class proportion for class in a node | Learning splits from impurity reduction | |
| Gini impurity of a node | Learning splits from impurity reduction | |
| sample count at parent node | Learning splits from impurity reduction | |
| Leaf | terminal prediction node | Interpreting tree paths as clinical rules |
| Feature importance | aggregate split contribution signal | From single trees to robust ensembles |
Concept deep dives
What comes next
In lesson 11, we move from explicit split rules to distributed representations with neural networks, where decisions emerge from weighted nonlinear layers rather than one path through a tree.
References and Further Reading
-
Breiman, L. et al. Classification and Regression Trees. Wadsworth, 1984.
-
Russell, S. and Norvig, P. Artificial Intelligence: A Modern Approach, 4th ed. Pearson, 2020.
This is Lesson 10 of 18 in the AI Starter Course.