There’s a very clean line in AI history.
Before 2017. And after 2017.
Before 2017, models struggled through language the way a tired intern reads legal fine print. One word at a time. Sequential. Slow. Painful. These were Recurrent Neural Networks, or RNNs, and while they were clever, they were fundamentally bottlenecked by time.
Then eight researchers at Google published a paper with a deceptively calm title: Attention Is All You Need.
That paper didn’t just introduce a new architecture. It flipped the table.
And if you’re using GPT, Claude, DeepSeek, Gemini, or anything remotely modern, you’re living inside that decision.
From Sequential Struggle to Parallel Power
Here’s what most people miss.
RNNs processed language sequentially. That means if the model wanted to understand the last word in a sentence, it had to grind through every word before it. That design made it nearly impossible to fully exploit modern GPUs, which are built for parallel computation.
The Transformer asked a simple question:
What if we processed everything at once?
Instead of crawling through text word by word, the Transformer looks at the entire sequence simultaneously. All tokens. All relationships. All at once.
Brilliant.
But that creates a problem.
If you remove order, how do you distinguish between:
“The dog bit the man”
and
“The man bit the dog”
and
“The man bit the dog”
That’s not a small detail.
The Positional Encoding Hack
The solution wasn’t a simple counter like 1, 2, 3, 4.
That would fail the moment you exceed training length.
Instead, the authors used sine and cosine wave functions to encode position. Yes, trigonometry. Not because they were trying to impress anyone, but because periodic functions allow extrapolation beyond the training range.
That’s the hack.
By injecting sinusoidal patterns into word embeddings, the model gains a sense of distance and order without ever reading sequentially.
No recurrence. No loops. Just mathematical signals layered on top of embeddings.
And suddenly, order exists again.
This was the first sign that the Transformer wasn’t just an architecture. It was a series of elegant engineering shortcuts that bent constraints without breaking them.
Q, K, V: The Asymmetric Search Engine
Let’s talk about attention.
Scaled Dot-Product Attention sounds academic. But the intuition is clean.
Think of it as an internal search engine.
- Query: What am I trying to understand right now?
- Key: What signals are available in the rest of the sentence?
- Value: What meaning do those signals carry?
For every word, the model computes how strongly the Query aligns with each Key. That alignment score determines how much of each Value flows into the output representation.
Here’s the subtle part most people gloss over.
All three matrices start from the same input. But they are trained to behave differently.
The Query seeks.
The Key invites.
The Value delivers.
The Key invites.
The Value delivers.
That broken symmetry is the magic.
It’s how the model knows that in:
“The animal didn’t cross the street because it was too tired.”
The word “it” should align with “animal” and not “street.”
It’s not reading. It’s weighting relationships across the entire context window simultaneously.
That’s attention.
Yes, The Beatles Were Involved
For something that now powers global AI infrastructure, the origins are almost playful.
“Attention Is All You Need” is a nod to the Beatles’ “All You Need Is Love.”
The name “Transformer” was not selected through a corporate brand workshop. Jakob Uszkoreit simply liked how it sounded.
An early internal document was even illustrated with characters from the 1980s Transformers franchise.
They leaned into it. They called themselves Team Transformer.
And now the name sits at the core of every frontier model on Earth.
Sometimes revolutions start casually.
The Working Memory Cheat Code: KV Caching
Here’s something that surprises most executives.
When you’re chatting with a model, it doesn’t remember in the human sense. It reprocesses the entire conversation every time it generates a new token.
Without optimization, inference cost would explode as conversations grow.
Enter KV caching.
Instead of recomputing all Keys and Values for the full history at every step, the model stores them in memory. When generating the next token, it only computes the newest Key and Value and reuses the rest.
This can make inference roughly three times faster.
But there’s a trade-off.
Compute drops. Memory demand rises.
The cost of one additional token per layer scales roughly as:
12d_emb² + 2n d_emb
That n term is the killer. As context length grows, memory consumption grows linearly.
Which means those million-token context windows people crave are constrained less by compute and more by VRAM.
And if you’ve been following Kuware, you already know what I’m going to say.
VRAM is king.
This is exactly why production systems increasingly rely on retrieval-based architectures instead of endlessly expanding context windows.
The Quadratic Crisis
There’s a deeper structural issue.
Every token attends to every other token.
That means attention scales at O(n²).
Double the context. Quadruple the work.
That’s the Quadratic Crisis.
For years, we’ve tried to engineer around it.
FlashAttention reduced memory bottlenecks by chunking computations more efficiently. Reformer used locality-sensitive hashing to thin the matrix. Longformer applied local attention patterns to reduce compute.
And when those internal attention patterns are later modified through fine-tuning, the risk of structural drift becomes very real.
But fundamentally, the quadratic nature remains.
Which is why a new class of architectures is gaining traction.
Mamba and the Rise of Selection
Mamba introduces something different.
It’s a Selective State Space Model.
Instead of letting everything attend to everything, it compresses information into an internal state. And it decides what to keep and what to discard.
That’s the key word. Selection.
By mastering forgetting, Mamba scales linearly, O(n), rather than quadratically.
It can process sequences approaching a million tokens without the same explosion in cost.
That’s not just an optimization. That’s a philosophical shift.
The Transformer era was about seeing everything.
The next era may be about choosing wisely.
DeepSeek, MoE, and the Illusion of Scale
Even frontier models are feeling the pressure of scaling laws.
Take DeepSeek V3.
It uses a Mixture-of-Experts architecture with 671 billion parameters. But only around 38 billion activate during inference.
That’s how it navigates the scaling curve efficiently. It looks massive on paper. But during runtime, it’s selective.
Still, it pays the fundamental Transformer tax.
Attention is still quadratic.
And as data grows, that tax compounds.
The Chinchilla Rule and What Comes Next
The Chinchilla Rule suggests roughly 20 tokens of training data per parameter for optimal scaling.
We are running out of high-quality human text.
That’s the real bottleneck.
It’s not just compute. It’s not just memory. It’s data quality.
So what happens next?
If the last decade was defined by Attention, the ability to see everything at once, maybe the next decade will be defined by Selection.
Architectures that decide what matters.
Architectures that compress intelligently.
Architectures that don’t try to remember everything.
Because in intelligence, biological or artificial, knowing what to forget is just as important as knowing what to remember.
And that might be the next pivot.
If you’re building AI systems today, this isn’t academic trivia.
It determines:
- Hardware strategy
- Inference cost
- Context limits
- Model selection
- And ultimately, competitive advantage
We are at an inflection point again.
The Beatles gave us Attention.
The next breakthrough might come from forgetting.
And that shift will define the next generation of AI.