Loading course content...
Loading course content...
AI models like ChatGPT don't process complete words as humans do—they break text into smaller units called tokens. A token can be a whole word (like "apple"), part of a word (like "blue" and "berries" for "blueberries"), a single character, or punctuation. This tokenization approach solves three critical problems: it prevents the vocabulary from becoming impossibly large with millions of entries, it enables the AI to handle new words that didn't exist in training data by breaking them into recognizable fragments, and it allows related words like "walk," "walked," and "walking" to share common patterns rather than being learned as completely separate entities. The process of splitting text is called tokenization, performed by a tool like OpenAI's tiktoken using Byte-Pair Encoding (BPE). Capitalization and spacing matter—"red," " red," and "Red" are all different tokens. When generating text, the AI chooses from approximately 100,000 tokens rather than selecting from every possible English word, predicting one token at a time through autoregressive generation. This explains why AI struggles with tasks like counting letters or spelling backwards—it never sees complete word structures. Token limits like "8,000 tokens" measure these chunks, not words, with one token roughly equaling 0.75 words in English. You can experiment with tokenization using OpenAI's online tokenizer tool at platform.openai.com/tokenizer.
The Payoff: Understanding tokenization reveals why AI behaves the way it does—its limitations with letter-counting and spelling, its varying performance across languages, and how context limits are measured in tokens rather than words.
Next Challenge: While you now understand how AI breaks text into tokens, the question remains: how does the AI actually use these tokenized inputs to predict what should come next in a response?
You've learned that AI generates text one word at a time through autoregressive generation, predicting each next word based on your prompt and all previously generated words. This sequential process happens in real-time as you watch ChatGPT's response stream across your screen.
When you type a sentence into ChatGPT, you might assume the AI reads it the same way you do—as a sequence of complete words. But that's not how it works.
AI models like ChatGPT, Claude, and Gemini don't process complete words. Instead, they break text down into smaller units called tokens. A token can be a whole word, part of a word, a single character, or even punctuation.
For example, the word "apple" is common enough that it's treated as one token. But "blueberries" gets split into two tokens: "blue" and "berries". The word "lollipop" breaks down into three tokens: "l", "oll", and "ipop".
This explains why AI sometimes struggles with tasks that seem simple to humans, like counting letters in a word or spelling backwards—it's not seeing the complete word structure that you see.
You might wonder why AI doesn't just work with complete words like humans do. The answer comes down to three practical problems that tokens solve.
Problem 1: The vocabulary would be impossibly large. If AI tried to memorize every possible word as a single unit, it would need to store millions of entries—every word in every language, plus technical terms, names, made-up words, typos, and more. This would make the model too large and slow to be practical.
Problem 2: New words would break the system. Language constantly evolves. New words appear daily—brand names, slang, technical terms. If the AI only knew complete words, it would have no way to handle "ChatGPT" or "cryptocurrency" when these terms didn't exist in its training data.
Problem 3: Related words wouldn't share meaning. Words like "walk", "walked", "walking", and "walker" all share the same root concept. If the AI treated each as a completely separate word, it would have to learn their relationship from scratch rather than recognizing they share the common base "walk".
Tokens solve all three problems by breaking text into reusable pieces. Common words stay whole (efficient). Rare words break into smaller parts (handles anything). Related words share common fragments (learns patterns faster).
The process of splitting text into tokens is called tokenization, and it happens before the AI can do anything with your input.
When you submit a prompt to ChatGPT, the system first runs your text through a tokenizer—a separate tool that breaks your sentence into the specific tokens that the AI model understands. OpenAI's models use a tokenizer called tiktoken, which implements an algorithm called Byte-Pair Encoding (BPE).
Here's how tokenization works in practice:
The sentence "All cat ladies love ginger tabby cats" gets broken into 20 tokens. Some words become single tokens, while others split into multiple pieces depending on how common they are in the AI's training data.
Capitalization matters: The word "red" at the start of a sentence becomes a different token than "red" in the middle of a sentence, and both are different from "Red" with a capital R mid-sentence. The AI sees these as three distinct tokens because they appear in different contexts during training.
Spaces count: Tokens often include the space before a word as part of the token itself. This is why " red" (with a leading space) is a different token than "Red" (without one).
You can experiment with tokenization yourself using OpenAI's online tokenizer tool at platform.openai.com/tokenizer. Type any text and watch it highlight how the text gets split into individual tokens.
Now you understand that AI doesn't generate complete "words"—it generates tokens. This distinction is important.
When ChatGPT predicts what comes next in a response, it's not choosing from every English word. It's choosing from a fixed vocabulary of approximately 100,000 tokens (the exact number varies by model). Some of these tokens are whole words, but many are fragments like "ing", "un", "tion", or even single letters for rare words.
This token-based approach is why AI sometimes has unexpected behaviors. It can't easily spell a word backwards because it never saw the complete word as a single unit—it only saw the tokens. It struggles with counting letters in a word for the same reason. It's also why AI handles some languages better than others: languages that were common in training data get efficient tokenization (fewer tokens per word), while less common languages require more tokens for the same text, filling up the model's memory faster.
The length limits you see in AI tools—like "8,000 tokens" or "32,000 tokens"—aren't measuring words. They're measuring these tokenized chunks. In English, one token roughly equals 0.75 words on average, so 1,000 tokens is approximately 750 words. But this ratio changes depending on the language and vocabulary you use.
You now know that AI processes tokens instead of words, and why this approach makes sense for handling any text the AI encounters. But this raises the next question: once the AI has your tokenized input, how does it actually use these tokens to predict what should come next?
Please share your thoughts about the course.