Data Representation

Back to AI Hub
Basics Series // Tokens & Embeddings

Tokens & Embeddings:
The Bridge to Mathematics

Artificial Intelligence cannot read. It cannot hear, and it certainly cannot "understand" a word in the way humans do. To an AI, a sentence is not a string of meaning, but a sequence of numbers. The process of converting human language into these numbers happens in two critical stages: Tokenization and Embedding.

"Tokens are the atoms of AI; Embeddings are the coordinates that tell the AI where those atoms live in the universe of meaning."

1. Tokenization: Slicing the Language

Tokenization is the process of breaking a raw string of text into smaller units called "tokens." A token is not necessarily a word; it can be a character, a sub-word, or a punctuation mark.

Sub-word Tokenization (BPE)

Modern LLMs use Byte Pair Encoding (BPE). Instead of storing every possible word (which would make the vocabulary too large) or every single character (which would make the sequences too long), BPE finds the most common patterns of characters and merges them.

Example: The word "unhappiness" might be tokenized as:
["un", "happi", "ness"]
This allows the AI to understand that "un-" is a prefix meaning "not," regardless of whether it's attached to "happy" or "fortunate."

The Vocabulary Gap

Every model has a fixed Vocabulary Size (e.g., 50,257 for GPT-2). Any word not in this vocabulary is broken down into smaller and smaller pieces until it can be represented by the available tokens. This is why AI sometimes struggles with very complex medical terms or obscure code—it has to "spell" them out in tiny fragments.

2. Embeddings: The Geometry of Meaning

Once a word is turned into a token ID (e.g., "apple" $\rightarrow$ 4512), it is still just a number. The number 4512 has no inherent relationship to 4513. To solve this, we use Embeddings.

High-Dimensional Vector Space}

An embedding is a dense vector (a list of numbers, often 768 or 1536 dimensions long) that represents a token's meaning. Imagine a 3D map: one axis represents "Royalty," one represents "Gender," and one represents "Age."

In this space, the vector for "King" would be very close to "Queen" because they both share high "Royalty" scores. However, they would be separated along the "Gender" axis. This is the "Geometry of Thought."

Method Representation Meaning Efficiency
One-Hot Encoding [0, 0, 1, 0, 0] None (Isolated) Very Low
Dense Embedding [0.12, -0.5, 0.88...] Relative Distance Very High

3. The Full Processing Pipeline

When you type a prompt into an AI, the following sequence occurs in milliseconds:

Input Text $\rightarrow$ Tokenizer $\rightarrow$ Token IDs $\rightarrow$ Embedding Layer $\rightarrow$ Transformer Blocks $\rightarrow$ Output

1. The Slice: "Hello world" becomes ["Hello", " world"].
2. The ID: These become [15496, 995].
3. The Projection: The ID 15496 is looked up in a massive table and replaced with a vector of 1,536 numbers.
4. The Context: These vectors are passed to the Attention mechanism to be modified based on the surrounding words.

4. The Context Window & Token Limits

You have likely seen "Token Limits" (e.g., 128k context). This is not a random number; it is a hardware and mathematical constraint. Because the Attention mechanism compares every token to every other token, the computational cost grows quadratically.

If you double the number of tokens in a prompt, the AI doesn't do double the work—it does four times the work. This is why the "Context Window" is the most valuable real estate in AI engineering; it determines how much of a book or codebase the AI can "keep in its head" at one time before it starts forgetting the beginning.

Vector Discourse