Feb 20, 2026

On attention and transformers

This is part #9 of my notes on CS231n. The course is openly available, including the video lectures and assignments.

These notes are based on this lecture by Justin Johnson.

These notes build on top of RNNs. They introduce the attention mechanism as it first emerged in recurrent encoder decoder architectures.

Below we have an encoder RNN that processes a sequence $x_1, \dots x_i$ as input $h_i = f_W(x_i, h_{i-1})$ , we take the last hidden state $h_i$ and call it context $c=h_i$ . Then the decoder RNN $s_t = g_U(y_{t-1}, s_{t-1}, c)$ produces the output sequence $y_1, \dots y_t$ .

As the input sequence grows the context $c$ becomes a bottleneck, there is a limit to what it can "compress".

To address this problem, attention starts to emerge. By asking "what if we could look back at the entire sequence for each step $t$ of the output sequence?"

And perhaps more concretely what if we could recompute the context vector $c$ at every step of the output sequence, by selectively attending to the input that conditions the output.

To look back at the input sequence, we compute a scalar value for each step of the input sequence that tells us how related each encoder hidden state step is to the decoder hidden state. Mathematically: $e_{t,i} = f_{att}(s_{t-1}, h_i)$ , where $f_att$ could be a linear layer. We apply softmax to convert that into a probability distribution $a$ , i.e. each is between 0 and 1 and they sum to 1. And take the weighted sum of the hidden states $c_t = \sum_ia_{t,i}h_i$ to compute the context vector $c_t$ . This way we have a different $c$ at each step $t$ . Just like before, $c_t$ is used to compute the next decoder hidden state $s_t = g_U(y_{t-1}, s_{t-1}, c_t)$ . $g_U$ here is any RNN unit (e.g. LSTM, GRU ...)

With this mechanism, and via gradient descent, the network learns to make the context vector attend to the relevant part of the input sequence. (Bahdanau, Cho & Bengio, 2014)

At step 2, note we use $s_1$ instead of $s_0$ to attend the input sequence and compute $c_2$ :

We repeat this process for each step of the output sequence.

This removes the single-vector bottleneck and on top, at each timestep of the decoder, the context vector attends different parts of the input sequence.

This ability to attend selectively is very powerful, let's abstract it into its own primitive and cut the RNN out.

Let's put some names on what the attention mechanism is actually doing:

Let's call data vectors $X \in \mathbb{R}^{N_X\times D_Q}$ what used to be encoder RNN states $h$ .
Let's call query vectors $Q \in \mathbb{R}^{N_Q\times D_Q}$ what used to be decoder RNN states $s$ .
Let's call output vectors $Y \in \mathbb{R}^{N_Q\times D_X}$ what used to be our context vector $c$ . $Y$ is computed

$Y = AX$

$Y_i = \sum_j A_{ij} X_j$

as a result of:
- Computing data-query similarities, what used to be ( $e$ ):
  
  $E = QX^T /\sqrt{D_Q} \in \mathbb{R}^{N_Q\times N_X}$
  
  $E_{ij} = Q_iX_j /\sqrt{D_Q}$
- Computing the Attention weights probability distribution, used to be $a$ :
  
  $A= softmax(E) \in \mathbb{R}^{N_Q\times N_X}$ where each row of A sums to 1

Because the data vectors $X$ are being used both for computing the query similarities $E$ and to provide the content that gets aggregated into the output $Y$ . We project $X$ into:

Keys $K = XW_K \in \mathbb{R}^{N_X\times D_Q}$ which we will use to compute the similarities:

$E = QK^T /\sqrt{D_Q} \in \mathbb{R}^{N_Q\times N_X}$
Values $V = XW_V \in \mathbb{R}^{N_X\times D_V}$ used in the output vector:

$Y = AV \in \mathbb{R}^{N_Q\times D_V}$

This let's the model learn one representation for where to attend (keys) and another for what information to return (values). $W_K \in \mathbb{R}^{D_X\times D_Q}$ and $W_V \in \mathbb{R}^{D_X\times D_V}$ are another set of learnable matrices.

With the RNN stripped and these new naming convention the Cross-Attention Layer looks like:

$\colorbox{#b2f2bb}{Q}$ is the given query vector input

$\colorbox{#a5d8ff}{X}$ is the given data vector input

${\colorbox{#ffd8a8}{K}} = XW_K$

${\colorbox{#d0bfff}{V}} = XW_V$

$E = QK^T /\sqrt{D_Q}$

$A= softmax(E, dim=1)$

${\colorbox{#ffec99}{Y}} = AV$

Another variant is a Self-Attention Layer, where the only input is $X$ :

$\colorbox{#a5d8ff}{X} \in \mathbb{R}^{N\times D_{in}}$ is the given data vector input

${\colorbox{#b2f2bb}{Q}} = XW_Q \in \mathbb{R}^{N\times D_{out}}$ , $W_Q \in \mathbb{R}^{D_{in}\times D_{out}}$ being a new learnable matrix.

${\colorbox{#ffd8a8}{K}} = XW_K \in \mathbb{R}^{N\times D_{out}}$ , $W_K \in \mathbb{R}^{D_{in}\times D_{out}}$

${\colorbox{#d0bfff}{V}} = XW_V \in \mathbb{R}^{N\times D_{out}}$ , $W_V \in \mathbb{R}^{D_{in}\times D_{out}}$

$E = QK^T /\sqrt{D_Q} \in \mathbb{R}^{N\times N}$

$A= softmax(E, dim=1) \in \mathbb{R}^{N\times N}$

${\colorbox{#ffec99}{Y}} = AV \in \mathbb{R}^{N\times D_{out}}$

In practice we usually use multi head attention, where the model computes several attention heads ( $H$ ) in parallel, each with its own learned projections:

${\colorbox{#a5d8ff}{X}} \in \mathbb{R}^{N\times D}$

${\colorbox{#b2f2bb}{Q}} = \colorbox{#a5d8ff}{X}W_Q \in \mathbb{R}^{H \times N\times D_{H}}$ , $W_Q \in \mathbb{R}^{D\times HD_{H}}$

${\colorbox{#ffd8a8}{K}} = \colorbox{#a5d8ff}{X}W_K \in \mathbb{R}^{H \times N\times D_{H}}$ , $W_K \in \mathbb{R}^{D\times HD_{H}}$

${\colorbox{#d0bfff}{V}} = \colorbox{#a5d8ff}{X}W_V \in \mathbb{R}^{H \times N\times D_{H}}$ , $W_V \in \mathbb{R}^{D\times HD_{H}}$

$E = {\colorbox{#b2f2bb}{Q}}$ ${\colorbox{#ffd8a8}{K}}^T / \sqrt{D_Q} \in \mathbb{R}^{H\times N \times N}$

$A= softmax(E, dim=1) \in \mathbb{R}^{H\times N \times N}$

${\colorbox{#ffec99}{Y}} = A{\colorbox{#d0bfff}{V}} \in \mathbb{R}^{N\times HD_{H}}$

${\colorbox{#ff8787}{O}} = {\colorbox{#ffec99}{Y}}W_O \in \mathbb{R}^{N\times D}$ , $W_O \in \mathbb{R}^{HD_H\times D}$ being a new learnable matrix to fuse the output of each head.

$D_{H}$ being the head dimension and usually $D_{H} = D / H$

e.g. H = 3

Which as a great surprise, it can all be computed with 4 big matmuls:

QKV Projection

$[\,{\colorbox{#b2f2bb}{Q}} \;|\; {\colorbox{#ffd8a8}{K}} \;|\; {\colorbox{#d0bfff}{V}}\,] = {\colorbox{#a5d8ff}{X}}[\,W_Q \;|\; W_K \;|\; W_V\,]$

$[\,Q \;|\; K \;|\; V\,] \in \mathbb{R}^{N \times 3HD_H} = (N \times D) (D \times 3HD_H)$
QK Similarity

$E = {\colorbox{#b2f2bb}{Q}}{\colorbox{#ffd8a8}{K}}^T$

$E \in \mathbb{R}^{H \times N \times N} = (H\times N\times D_H)\,(H\times D_H\times N)$
V-Weighting

${\colorbox{#ffec99}{Y}} = A{\colorbox{#d0bfff}{V}}$

$Y \in \mathbb{R}^{H\times N\times D_H}= (H\times N\times N)(H\times N\times D_H)$

Reshape to ${\colorbox{#ffec99}{Y}} \in \mathbb{R}^{N \times HD_H}$
Output Projection

${\colorbox{#ff8787}{O}} = {\colorbox{#ffec99}{Y}}W_O$

$O \in \mathbb{R}^{N \times D} = (N \times HD_H) (HD_H \times D)$

The Transformer

A Transformer block first applies multi-head self-attention then adds a residual connection and layer-normalizes; it then applies a position-wise feed-forward network (FFN/MLP) independently to each vector, followed by another residual addition and a final layer normalization:

A Transformer(Vaswani et al., 2017) is built by stacking multiple Transformer blocks sequentially:

That is the post-LN (LayerNorm after sublayer) description. Many modern GPT-style models are pre-LN (LayerNorm before sublayer)

The size of which has only grown:

Original: 12 blocks; D=1024, H=16, N=512, 213M params

GPT-2: 48 blocks; D=1600, H=25, N=1024, 1.5B params

GPT-3: 96 blocks; D=12288, H=96, N=2048, 175B params

Transformers are being successfully applied beyond text to images with ViT (Dosovitskiy et al., 2020), to video with ViViT, to multimodal image-text learning with CLIP, and also to audio, protein sequences, time series...

Since its introduction, it has been extended with many architectural and domain specific refinements, the core ideas presented here remain the essential conceptual starting point.

One variation worth mentioning is Mixture of Experts (MoE). Instead of a single dense feed-forward network in each block, we learn E different MLPs, called experts. A routing function sends each token to a subset of them. This allows the model to scale total parameter count significantly while keeping per token compute much lower than activating all experts.

Other important refinements include causal masking, which restricts token $t$ from attending to future tokens and is central to autoregressive decoders in the original Transformer, and Rotary Position Embeddings (RoPE), which encode positional information directly into attention.