In lessons 1 through 8, every piece of reasoning was written by a human. A developer or domain expert decided which rules to check, which conditions to combine, and which conclusions to draw. The AI was as smart as whoever wrote the rules, and no smarter.

This lesson introduces a fundamentally different idea: what if the system worked out the rules itself, by looking at examples?

That is machine learning in one sentence. Instead of telling the system “if fever and stiff neck, then suspect meningitis,” you give it a thousand past patient records, each labelled with what the correct diagnosis turned out to be, and let it discover the pattern on its own.

For hospital triage, this is a major shift. Rather than manually coding every diagnostic combination, we provide many past patient cases and ask the model to infer a mapping from clinical features to outcomes. The question is no longer “Which rule should fire?” It becomes “Which pattern in the data generalises from past patients to new ones safely?”

What kinds of problems machine learning solves

Machine learning has three main settings. Each describes a different type of data problem:

Supervised learning, You have examples with correct labels. You train the model to predict the right label for new examples it has never seen. This is the setting used in this lesson and in lessons 10,13. In triage, supervised learning means: given thousands of past patients with known diagnoses, train a model that can diagnose new patients.

Unsupervised learning, You have examples but no labels. The model tries to discover structure on its own, for example, grouping patients into phenotype clusters based on symptom patterns, without being told what the clusters should be.

Reinforcement learning, There are no fixed examples at all. The model learns by taking actions and receiving feedback (rewards or penalties). This is how systems learn to play games, control robots, or manage treatment dosing schedules over time.

The rest of this lesson focuses on supervised learning, because it is the foundation for everything in lessons 10,15.

Core learnings about machine learning foundations

Supervised learning fits a mapping from inputs to labels by minimizing a loss over training examples.
Model design is a bias-variance tradeoff: simple models underfit, highly flexible models overfit without constraints.
Generalization on unseen data matters more than training accuracy.
Data quality and evaluation design are first-order concerns because learned behavior reflects dataset structure.

What a model actually is

The word “model” in machine learning just means a mathematical function. It takes an input (for example, a patient’s four measurements) and produces an output (for example, a diagnosis probability for each class). The model has internal adjustable numbers, called parameters, and training means finding the right values for those parameters so the model’s outputs match the correct answers in the training data.

Think of it like tuning a radio. You have a dial (one parameter) and you turn it until the signal is clear. Training a model is the same idea, except the “radio” has millions of dials, and the “signal” is measured as average prediction error across thousands of examples.

Learning from data as optimization

Before the equations, let us be concrete about what “learning from data” actually means in mechanical terms. A learning algorithm is just an optimization program. It is given a set of past examples, each example is a pair of (input, correct answer), and it searches for parameter values that make the model’s outputs match the correct answers as well as possible across those examples. It does not understand the problem the way a human does; it just finds a pattern that fits the numbers.

The set of candidate functions the algorithm is allowed to consider is called the hypothesis class $\mathcal{H}$ . Think of it as the shape library. If the hypothesis class is “all straight lines,” the learning algorithm can only produce linear predictions. If the hypothesis class is “all decision trees with at most 5 splits,” it can produce tree-shaped rules. Choosing the hypothesis class is the designer’s job; the algorithm then searches inside it for the best fit.

Formally, supervised learning chooses a function $f$ from a hypothesis class $\mathcal{H}$ to minimize empirical risk:

\hat{f} = \arg\min_{f \in \mathcal{H}} \frac{1}{n}\sum_{i=1}^n L\big(f(x_i), y_i\big)

Read the symbols left to right before interpreting the whole line:

$\arg\min$ means “the argument (model choice) that makes the objective smallest.”
$i$ indexes training examples from $1$ to $n$ .
$f(x_i)$ is the model prediction for example $i$ , and $L(f(x_i), y_i)$ is the error on that example.
The average $\frac{1}{n}\sum_{i=1}^{n}$ means we optimize for overall dataset performance, not a single case.

Here, $x_i$ is one patient feature vector (temperature, stiff-neck indicator, headache severity, oxygen status), $y_i$ is the observed label (for example, low acuity vs. urgent workup), and $L$ is the penalty for being wrong. In our triage example, this formula means: among all candidate decision functions in $\mathcal{H}$ , pick the one with the lowest average prediction error on recorded patient cases.

So what does this tell us in practice? The model is not “learning magic”; it is repeatedly testing candidate mappings against known outcomes and keeping the mapping that minimizes error under the chosen loss.

Why model class and data both matter

The hypothesis class $\mathcal{H}$ defines what the model can represent. If $\mathcal{H}$ is too restricted, even perfect optimization cannot recover the true clinical pattern. If $\mathcal{H}$ is too flexible, the model may memorize random quirks in the training data.

This is the root of the bias-variance tradeoff. We often summarize test behavior as:

\operatorname{TestError} = \operatorname{Bias}^2 + \operatorname{Variance} + \operatorname{IrreducibleNoise}

In the triage setting, high bias appears when a model is too rigid to capture nonlinear symptom interactions. High variance appears when a model overreacts to random details of the training cohort and changes predictions unpredictably for similar new patients. The decomposition reminds us that architecture choice, regularization, and data volume all push on different parts of the final error.

Two practical terms now matter before we use them in the workflow below. Regularization means deliberately discouraging the model from fitting noise too closely, usually by limiting complexity or penalizing overly large parameter values. Calibration means that predicted probabilities should match real-world frequencies, so a model that outputs 70% confidence should be right about 70% of the time on similar cases.

Practical walkthrough: diagnosing model behavior

Use this workflow on the triage domain to distinguish underfitting from overfitting:

Start with a simple model (for example, shallow tree or linear baseline) and record train/validation error.
Increase model capacity and compare both errors again.
If both errors stay high, interpret as high bias and enrich representation or features.
If training error drops but validation error rises, interpret as high variance and add regularization or more data.
Track calibration and subgroup metrics (age bands, comorbidity groups) before deployment decisions.

What this means in practice: machine learning quality is not one number. It is a profile of fit, generalization, calibration, and subgroup reliability.

Relation to earlier lessons

In lessons 2-8, knowledge and transitions were explicitly authored by humans.
In lesson 9, the decision mapping is inferred from data rather than written rule by rule.
The tradeoff remains familiar: we still balance expressiveness against tractability, but now through model class and optimization.

Concrete bridge: planning asked “Which valid sequence reaches the goal?” Machine learning asks “Which learned mapping predicts the right goal from features?”

Notation quick reference

Symbol	Meaning	Detailed link
$x_i$	feature vector for example $i$	Learning from data as optimization
$y_i$	target label for example $i$	Learning from data as optimization
$f$	candidate prediction function	Learning from data as optimization
$\mathcal{H}$	hypothesis class of allowed models	Why model class and data both matter
$L(\cdot)$	loss function	Learning from data as optimization
$n$	number of training examples	Learning from data as optimization
$\hat{f}$	empirically optimized model	Learning from data as optimization
Bias	systematic approximation error	Why model class and data both matter
Variance	sensitivity to training sample fluctuations	Why model class and data both matter

Concept deep dives

What comes next

In lesson 10, we instantiate these learning ideas with decision trees, where every root-to-leaf path is a transparent clinical rule learned from data.

References and Further Reading

Mitchell, T. Machine Learning. McGraw-Hill, 1997.
Hastie, T., Tibshirani, R., and Friedman, J. The Elements of Statistical Learning, 2nd ed. Springer, 2009.
Bishop, C. Pattern Recognition and Machine Learning. Springer, 2006.

This is Lesson 9 of 18 in the AI Starter Course.

Introduction to Machine Learning: Learning from Data Instead of Rules

Lesson introduction