In lesson 14, we made knowledge explicit through symbolic structures and ontologies. This lesson shifts back to high-capacity neural sequence modeling: the Transformer architecture that now underpins modern language and multimodal foundation models.

For triage AI, this matters because clinical evidence often starts as unstructured text. A model that can connect findings across long notes and reason over token relationships at scale can dramatically improve extraction, summarization, and decision support.

One phrase appears early in this lesson: multi-head attention means the model runs several attention calculations in parallel, with each head free to focus on a different kind of relationship in the same sequence.

Core learnings about Transformers and foundation models

Self-attention lets each token condition on all other tokens in context.
Multi-head attention captures different relation types in parallel.
Transformer blocks combine attention, feedforward layers, residual paths, and normalization for stable depth.
Foundation models are pretrained broadly, then adapted to tasks via prompting, fine-tuning, and alignment methods.

What is a token?

When a language model processes text, it first splits the text into tokens, small pieces of text that are the model’s unit of input. In English, a token is roughly one word or a short word fragment. “Patient has fever and stiff neck” might become 9 tokens: ["Patient", "has", "fever", "and", "stiff", "ne", "ck", "and", ...].

Each token is then converted to a vector (a list of numbers) called an embedding. This embedding encodes the token’s meaning in a high-dimensional space, similar meanings cluster nearby. The Transformer then processes this sequence of vectors.

Why did attention replace recurrent models?

Before Transformers, the dominant approach was recurrent neural networks (RNNs): they read a sentence one token at a time, maintaining a “memory” that was updated at each step. The problem is that RNNs struggle with long sentences, information from early in the sentence gets diluted by the time the model reaches the end. Imagine trying to remember the first word of a 500-word paragraph by the time you reach the last word, keeping only a fixed-size summary.

Attention solves this by letting every token look directly at every other token in the sequence, regardless of distance. “Fever” at position 3 can directly compare itself to “neck stiffness” at position 47 without having to pass through every intermediate token. This is both faster to compute (parallelizable) and better at capturing long-range dependencies.

What are query, key, and value vectors?

The attention mechanism uses three vectors computed from each token: a query, a key, and a value. The intuition comes from information retrieval:

Query ( $Q_i$ ): token $i$ asks “what information do I need from the rest of the sequence?” This is the question being posed.
Key ( $K_j$ ): token $j$ broadcasts “here is what kind of information I contain.” This is the label on each potential answer.
Value ( $V_j$ ): the actual information that token $j$ contributes if selected as relevant.

The attention score $Q_i \cdot K_j$ (dot product) measures how well token $i$ ‘s question matches token $j$ ‘s label. If “stiff neck” is the query and “meningitis risk” is nearby key, the dot product will be high, indicating that “neck stiffness” should pay close attention to “meningitis risk.” The softmax converts all these scores into weights that sum to 1, so each token gets a weighted blend of all other tokens’ values.

The attention formula, in full

Putting it together, the formal attention score can be read in safe notation as:

alpha_ij = softmax((Q_i · K_j) / sqrt(d_k))

Read the indices and denominator explicitly:

$i$ is the token being updated (“who is asking?”).
$j$ is a candidate context token (“who might provide relevant information?”).
$Q_i K_j^\top$ is a similarity score between token $i$ ‘s query and token $j$ ‘s key.
The softmax normalizes scores over all $j$ so attention weights sum to 1.
Dividing by $\sqrt{d_k}$ keeps scores numerically stable when key dimension $d_k$ grows.

Here, i is the token currently being updated, j is a context token, Q_i is the query vector, K_j is the key vector, and d_k is key dimension. In a triage note, if token i corresponds to “stiffness” and token j corresponds to “neck,” a high alpha_ij indicates strong relevance in composing the updated representation.

So what does this tell us in practice? Attention is a learned relevance map over context, allowing the model to directly connect distant but clinically related phrases.

From token relevance to model capability

One attention head can focus on local syntax, another on temporal references, another on symptom-cause phrasing. Stacking heads and layers builds increasingly abstract contextual features. Residual connections preserve gradient flow, while layer normalization stabilizes optimization across depth.

This architecture scales efficiently on modern hardware and has become the default across language, vision, and biology workloads.

Foundation model lifecycle

A foundation model is pretrained on large corpora using generic objectives (for example next-token prediction or masked token prediction). Task adaptation then happens by:

prompt design,
supervised fine-tuning,
preference-based alignment.

In clinical contexts, adaptation typically also requires domain grounding and strict output constraints to reduce hallucination risk.

Practical walkthrough: triage note extraction with a Transformer

Use this workflow for a clinically grounded NLP pipeline:

Provide a de-identified triage note plus a structured extraction schema.
Ask the model for slot outputs (symptoms, onset, vitals mentions, risk flags) with evidence spans.
Validate extracted fields against ontology constraints and allowable value sets.
Route uncertain or conflicting outputs for human review.
Persist only validated structured entities into downstream reasoning modules.

What this means in practice: Transformers are strongest when coupled with schema constraints and verification layers, not treated as autonomous truth generators.

Relation to earlier lessons

Lessons 11-13 built neural representation and optimization foundations.
Lesson 14 added explicit symbolic structure for consistency and interoperability.
Lesson 15 combines scale-driven neural sequence modeling with those safety and structure concerns.

Concrete bridge: CNNs exploited spatial locality in images; Transformers exploit learned relational locality in sequences.

Notation quick reference

Symbol/Term	Meaning	Detailed link
$Q_i$	query vector for token $i$	Why attention replaced recurrence
$K_j$	key vector for token $j$	Why attention replaced recurrence
$\alpha_{ij}$	attention weight from token $i$ to $j$	The attention formula, in full
$d_k$	key vector dimensionality	Why attention replaced recurrence
Head	one parallel attention channel	From token relevance to model capability
Residual	skip connection around a sublayer	From token relevance to model capability
Foundation model	large pretrained adaptable model	Foundation model lifecycle
Alignment	post-pretraining behavior shaping	Foundation model lifecycle

Concept deep dives

What comes next

In lesson 16, we turn to probabilistic AI and explicit uncertainty handling, moving from sequence representation to calibrated belief updates under evidence.

References and Further Reading

Vaswani, A. et al. “Attention Is All You Need.” NeurIPS, 2017.
Devlin, J. et al. “BERT.” NAACL, 2019.
Brown, T. et al. “Language Models are Few-Shot Learners.” NeurIPS, 2020.

This is Lesson 15 of 18 in the AI Starter Course.

Modern Architectures: The Transformer and the Era of Foundation Models

Lesson introduction