-
Q = Queries (shape: seq_len × d_k)
-
K = Keys (shape: seq_len × d_k)
-
V = Values (shape: seq_len × d_v)
-
Compute raw attention scores between each query and key using the dot product:
score=QKTtext{score} = QK^Tscore=QKT
-
Scale the scores to stabilize gradients:
scorescaled=QKTdktext{score}_{text{scaled}} = frac{QK^T}{sqrt{d_k}}scorescaled=dkQKT
-
Apply softmax to normalize the scores into a probability distribution:
weights=softmax(QKTdk)text{weights} = text{softmax}left(frac{QK^T}{sqrt{d_k}}right)weights=softmax(dkQKT)
-
Compute weighted sum of values using these attention weights:
output=weights⋅Vtext{output} = text{weights} cdot Voutput=weights⋅V
-
Project input (same vectors used for Q, K, V) into multiple lower-dimensional spaces (heads).
-
Perform attention independently in each head.
-
Concatenate outputs from all heads.
-
Apply a final linear projection to combine them into a single output.
-
Encoder Self-Attention:
-
Decoder Self-Attention:
-
Encoder-Decoder Attention:
-
Language is sequential → requires modeling dependencies between tokens.
-
Traditional models used recurrence (RNNs/LSTMs), which are slow and hard to parallelize.
-
Attention computes contextual relevance between tokens, regardless of position.
-
Transformer uses only attention, structured hierarchically in layers.
-
This architecture learns deep contextual embeddings for sequences more efficiently.
-
Attention modulates the gain (signal strength) of neurons representing specific features, locations, or tasks.
-
It does so using top-down signals from higher-order regions (like prefrontal cortex) to modulate lower sensory areas (like V1, V4, or MT).
-
This increases signal-to-noise ratio and enables priority encoding in working memory and decision-making circuits.
-
Prefrontal Cortex (PFC):
-
Posterior Parietal Cortex (PPC):
-
Sensory Cortices (e.g., V1, V4, A1):
-
Thalamus (especially pulvinar nucleus):
-
Acts as a gatekeeper, regulating which sensory signals reach the cortex and are prioritized.
-
May be functionally analogous to the softmax mechanism, filtering what passes through.
-
Acetylcholine: Enhances signal precision in primary sensory cortices; sharpens attention.
-
Norepinephrine: Increases alertness and arousal; modulates responsiveness.
-
Dopamine: Modulates salience and reward prediction, often influencing which stimuli gain attention.
-
Computes contextual relevance between tokens.
-
Uses Q–K–V triplets to determine which tokens matter.
-
Dynamically weights and aggregates representations.
-
Computes contextual salience of stimuli.
-
Uses PFC/PPC to direct attention to relevant sensory inputs.
-
Dynamically modulates neural firing and connectivity to enhance relevant information.
-
Use target-driven modulation (task or prompt).
-
Rely on contextual comparison to filter and weight input.
-
Are resource-limited, optimizing processing by allocating computation efficiently.
-
Limitations of Recurrent Neural Networks (RNNs):
-
Success of Attention Mechanisms in Seq2Seq Models:
-
Prior work (e.g., Bahdanau et al., 2015) added attention on top of RNNs, showing significant performance gains.
-
Attention enabled the decoder to dynamically “look back” over input tokens—this proved both more effective and interpretable.
-
Hypothesis:
If attention is so powerful on top of RNNs, why not remove recurrence entirely and use attention everywhere?
-
Makes no reference to neuroscience, cognitive science, or biological attention.
-
Framed attention purely in mathematical and engineering terms.
-
Focuses on efficiency, scalability, and empirical performance, not on brain-like architecture.
-
Neural machine translation,
-
Seq2seq modeling,
-
Positional encoding for sequence order (since recurrence was removed),
-
And multi-head attention to increase representational capacity.
-
The design is an example of convergent evolution:
→ Both biological and artificial systems independently discovered context-sensitive weighting of inputs as a superior solution to sparse, serial processing.
-
The authors were solving for parallelization, long-range dependency handling, and modularity, not cognitive plausibility.
-
Comparing multi-head attention to distributed attention systems (e.g., dorsal/ventral streams).
-
Mapping attention layers to cortical hierarchies.
-
Investigating shared properties like sparsity, locality, and top-down modulation.
-
Problem: RNNs were inefficient and struggled with context.
-
Prior Success: Attention boosted RNN performance.
-
Solution: Eliminate recurrence entirely—rely solely on attention.
-
Result: Transformer—empirically superior, parallelizable, general-purpose.
-
Neurobiological similarity: Emerged post hoc as an interesting equivalency, not an original design goal.