Architecture

Back to AI Hub
Architecture Series // Transformers

Transformers & Attention:
The Death of Sequence

Before 2017, AI processed language like a human reading a book: one word at a time, from left to right. This was the era of RNNs and LSTMs. But the publication of "Attention is All You Need" changed everything, introducing the Transformer—an architecture that looks at an entire sentence simultaneously, effectively "killing" the sequence.

"The Transformer allows the model to weigh the importance of different words in a sentence regardless of their distance from each other, enabling true global context."

1. The Attention Mechanism

In traditional models, if a sentence was too long, the model would "forget" the beginning by the time it reached the end (the Vanishing Gradient problem). Self-Attention solves this by allowing every word in a sequence to "attend" to every other word.

Global Context

Consider the sentence: "The animal didn't cross the street because it was too tired."
What does "it" refer to? The animal or the street?
A Transformer uses attention to create a strong mathematical link between "it" and "animal," effectively understanding the context without needing to process the words in a linear chain.

2. The Mathematical Engine: QKV

To implement attention, Transformers use a system of three vectors for every word: Query (Q), Key (K), and Value (V). Think of this like a database search.

The Retrieval Analogy

1. Query (Q): "What am I looking for?" (The current word's request).
2. Key (K): "What do I contain?" (The label other words use to find this word).
3. Value (V): "What information do I actually hold?" (The content to be extracted).

The model calculates a score by taking the dot product of the Query and the Key. The higher the score, the more "attention" is paid to that word's Value. This is the fundamental calculation that happens billions of times per second in a GPT model.

3. The Problem of Order: Positional Encoding

Because Transformers process all words at once, they are technically "bag-of-words" models—they have no idea which word comes first or last. To fix this, they use Positional Encoding.

Feature RNN / LSTM Transformer
Processing Sequential (One by one) Parallel (All at once)
Memory Short-term / Fading Global / Persistent
Speed Slow (Cannot be parallelized) Fast (GPU optimized)
Ordering Implicit (by position) Explicit (via Positional Encoding)

By adding a specific mathematical wave (using sine and cosine functions) to the word vectors, the model can "feel" the position of each word while still benefiting from parallel processing.

4. The Scaling Laws & Emergence

The true power of the Transformer is that it is highly scalable. Because it can be trained across thousands of GPUs in parallel, we were able to increase the number of parameters from millions to trillions.

Emergent Properties

Researchers discovered a phenomenon called "Emergence." As these models scaled, they suddenly developed abilities they weren't explicitly trained for, such as the ability to write code, solve logic puzzles, and perform translation between languages they had seen very little of. This is the direct result of the Transformer's ability to find complex, high-dimensional patterns across massive datasets.

Architecture Discussion