The neural networks and decision trees in recent lessons produce point predictions: one label, one number. But a physician making a triage decision is not working with point predictions. She is reasoning over a distribution of possibilities, updating her beliefs as each new finding arrives: “fever raises my meningitis suspicion; now I’ve confirmed stiff neck, so I’m much more concerned; the headache is severe and sudden-onset, so I’m ordering a lumbar puncture.”

This structured, belief-updating mode of reasoning is exactly what probabilistic AI formalises.

Core learnings about probabilistic AI

Probabilistic AI represents uncertainty explicitly with probability distributions instead of single hard labels.
Bayesian networks compress large joint distributions through conditional independence structure.
Inference updates beliefs as evidence arrives, producing posterior probabilities suitable for risk-aware decisions.
Calibration quality determines whether predicted probabilities are trustworthy in clinical workflows.

Probability as Degrees of Belief

Before the equations: in everyday language, probability is a number between 0 and 1 that expresses how likely something is. 0 means “impossible,” 1 means “certain,” 0.5 means “equally likely to happen or not.” A probability distribution assigns these numbers to all possible outcomes, and they must sum to 1. For diagnoses, $P(\text{meningitis}) = 0.004$ means we believe meningitis is possible but rare in the general population.

In the Bayesian interpretation, we distinguish two kinds of probability:

A prior probability, written as P(h), is our belief before we see any new evidence from the current patient. It reflects background knowledge like disease prevalence.
A posterior probability, written as P(h | E), is our updated belief after incorporating observed evidence E. It is the prior adjusted by what we just learned.

For example: before examining the patient, the prior probability of meningitis might be 0.004, meaning rare. After seeing fever, stiff neck, and severe headache, the posterior probability of meningitis may jump dramatically, perhaps toward 0.40. In words, Bayes’s rule says: posterior = likelihood × prior ÷ probability of the evidence.

Variable mapping for this course context:

h (hypothesis): a candidate diagnosis statement (for example, “patient has meningitis”).
E (evidence): observed findings (for example, fever + stiff neck + severe headache).
P(h): prior belief before seeing current patient evidence.
P(E | h): likelihood of seeing evidence E if hypothesis h is true.
P(h | E): updated posterior belief after incorporating evidence.
P(E): normalizing probability of the evidence under all hypotheses.

This is the fundamental update equation. It says: the posterior is proportional to the likelihood of the evidence given the hypothesis, times the prior on the hypothesis. The denominator P(evidence) is a normalising constant that ensures the posterior is a valid probability distribution.

The Problem with Full Joint Distributions

For a clinical scenario with 20 binary variables, the full joint probability distribution has 2^20 = 1,048,576 entries. It is intractable to specify and store. We need a compact representation that captures the important dependencies.

Conditional independence is the key insight. P(stiff neck | meningitis, influenza) = P(stiff neck | meningitis): given meningitis status, stiff neck is independent of whether the patient has influenza. Most real-world variables are conditionally independent of most others given a small set of direct causes. Exploiting this structure compresses exponential tables into manageable pieces.

Bayesian Networks: Encoding Structure

A Bayesian network (BN) is a directed acyclic graph (DAG) where each node is a random variable, meaning a quantity whose exact value is uncertain and described with probabilities. In this lesson, examples include fever status, headache severity, and whether meningitis is present.

A Bayesian network is a directed acyclic graph (DAG) where:

Nodes represent random variables.
Edges represent direct causal or statistical dependencies (parent causes child).
Each node has a conditional probability table (CPT): P(variable | parents).

The full joint distribution factors as the product of CPTs:

P(X1, X2, ..., Xn) = product over all i of P(Xi | Parents(Xi))

Read this factorization term by term:

$X_i$ is variable $i$ in the model.
$Parents(X_i)$ are the direct causes/dependencies of $X_i$ in the graph.
The product over all $i$ multiplies each local conditional table into one consistent joint distribution.

This product is correct because of the conditional independence assumptions encoded in the DAG structure.

The Triage Bayesian Network

The network below encodes the causal structure of our triage scenario. Influenza and Meningitis are the root causes. Fever can be caused by either. Stiff neck is caused only by Meningitis. Headache can be caused by either. The final node, Meningitis Diagnosis, depends on Meningitis (ground truth), Stiff Neck, and Fever.

Set evidence by selecting observed findings in the sidebar, and watch how the query, P(Meningitis Diagnosis), updates as evidence accumulates. Then click any node to inspect its CPT.

Triage Bayesian Network

Set observed findings to update the posterior probability of meningitis diagnosis.

Inference in Bayesian Networks

Given a set of observed variables E and a query variable Q, Bayesian inference computes P(Q | E). The naive approach is enumeration: sum the joint probability over all hidden variables. This is exact but exponential in the number of hidden variables.

Practical algorithms include:

Variable elimination: Systematically marginalise out hidden variables in an order chosen to minimise intermediate table sizes. Here, marginalise out means “sum over the possible values of a variable so it no longer appears explicitly in the final probability expression.” This is much more efficient than full enumeration when the structure is sparse.

Belief propagation (message passing): For networks with a tree structure, marginal probabilities can be computed exactly in linear time by passing “messages” between adjacent nodes. Loopy belief propagation extends this to graphs with cycles, trading exactness for efficiency.

MCMC sampling (Gibbs sampling): Sample from the joint distribution by iterating through each variable, sampling it from its conditional distribution given the current values of all other variables. Converges to the true posterior for any network structure but requires many samples.

Naive Bayes: The Simple Special Case

When all features are assumed conditionally independent given the class label, the Bayesian network collapses to Naive Bayes:

P(class | features) proportional to P(class) * product of P(feature_i | class)

Despite the strong (often wrong) independence assumption, Naive Bayes classifiers perform surprisingly well on high-dimensional sparse data like text and work reliably even with tiny training sets. A clinical Naive Bayes model for triage classification would estimate from the training data: P(fever | meningitis), P(stiff neck | meningitis), P(severe headache | meningitis), and combine them with the meningitis prior to produce a posterior probability.

Why Calibration Matters Clinically

A neural network that predicts “meningitis probability = 0.7” is useful only if that 0.7 is calibrated: across all patients where the model said 0.7, approximately 70% should actually have meningitis.

Probabilistic models are more naturally calibrated than discriminative classifiers. Neural networks trained with cross-entropy loss are sometimes poorly calibrated (overconfident), requiring a post-hoc calibration step (Platt scaling, temperature scaling).

A triage AI that says “I am 92% confident” when it should say “55% confident” will cause physicians to anchor incorrectly on the wrong diagnosis.

Key Takeaways

Bayesian networks represent joint distributions compactly via conditional independence structure.
Each node’s CPT encodes P(variable | its parent set).
Inference updates priors to posteriors as evidence accumulates, using exact or approximate algorithms.
The triage scenario naturally maps to a causal BN with disease nodes causing symptom nodes.
Calibration is a clinical safety requirement: predicted probabilities must reflect actual frequencies.
Naive Bayes provides a tractable special case when conditional independence can be assumed.

Relation to earlier lessons

Lesson 15 showed how foundation models can map unstructured evidence to predictions.
Lesson 16 adds explicit uncertainty math so outputs are belief distributions, not just top labels.
The triage domain continuity stays intact: we still reason about meningitis risk, but now in probability space.

Concrete bridge: lesson 15 emphasized expressive prediction. This lesson emphasizes calibrated belief updating under uncertainty.

Notation quick reference

Symbol/Term	Meaning	Detailed link
$P(h)$	prior probability of hypothesis $h$	Probability as Degrees of Belief
$P(h\mid E)$	posterior probability after evidence $E$	Probability as Degrees of Belief
$P(E\mid h)$	likelihood of evidence under hypothesis	Probability as Degrees of Belief
CPT	conditional probability table	Bayesian Networks: Encoding Structure
BN	Bayesian network	Bayesian Networks: Encoding Structure
Enumeration	exact inference by summing hidden variables	Inference in Bayesian Networks
Calibration	agreement between confidence and empirical frequency	Why Calibration Matters Clinically
Naive Bayes	conditional-independence classifier special case	Naive Bayes: The Simple Special Case

What comes next

In lesson 17, we extend static uncertainty models to temporal and structured uncertainty with Markov chains, HMMs, DBNs, and explicit decision-making under uncertainty.

For continued reading flow, revisit Lesson 15: Modern Architectures and then continue to Lesson 17: Uncertainty and Graphical Models.

References and Further Reading

Pearl, J. Probabilistic Reasoning in Intelligent Systems. Morgan Kaufmann, 1988.
Russell, S. and Norvig, P. Artificial Intelligence: A Modern Approach, 4th ed. Pearson, 2020. Chapters 12-14.
Murphy, K. Probabilistic Machine Learning: An Introduction. MIT Press, 2022.

This is Lesson 16 of 18 in the AI Starter Course.

Probabilistic AI: Reasoning with Degrees of Belief

Lesson introduction