How AI Makes Sense of Human Language: A Deep Dive into Tokens and Embeddings

Josselin Thibault

Feb 125 min read

Updated: Feb 25

Large Language Models (LLMs) are pattern recognition engines operating on an unprecedented scale.

When people interact with AI models like ChatGPT or Claude, it often feels like magic – as if the AI truly understands our words and thoughts.

Spoiler alert: they don't. Yet the reality of how these models handle language, if mundane, is deeply fascinating.

Large Language Models (LLMs) are pattern recognition engines operating on an unprecedented scale. Of course they don't understand language the way humans do – with comprehension of meaning, context and real-world grounding.

Instead, they process language as intricate mathematical relationships between words and phrases, having analyzed staggering volumes of text to learn the statistical patterns of how human writing flows.

This mathematical approach to language has led some critics to describe LLMs as stochastic parrots – able to reproduce human-like text with astonishing fidelity but without authentic human understanding.

Regardless of philosophical debates, the technical wizardry enabling these models is revolutionizing how humans and machines collaborate. To grasp how these systems work, we need to understand two fundamental concepts: tokens and embeddings.

These building blocks transform human language into something computers can process, and they're the key to how modern AI makes sense of our words.

The Journey from Text to Tokens

At its most basic, tokenization is the process of breaking text into pieces an AI can process.

Imagine you're teaching someone a new language. You wouldn't start with entire paragraphs – you'd begin with basic building blocks like common words and phrases.

AI models need a similar approach, but choosing the right size for these building blocks is crucial, and the choice might not be obvious.

Character-level tokenization would be like reading a book one letter at a time. While thorough, it's incredibly inefficient. The AI would need to process enormous sequences to understand even simple words, making it hard to capture meaningful patterns.

Word-level tokenization might seem logical – after all, we think in words. However, human language is remarkably complex and constantly evolving.

New words emerge daily (think "doomscrolling" or "metaverse"), names vary across cultures, and words shape-shift with grammar (run, running, ran). A word-based system would need an impossibly large vocabulary to keep up.

Subword tokenization offers an elegant middle ground. This approach breaks words into meaningful pieces, allowing AI models to understand both common and rare words efficiently.

It's similar to how we naturally break down unfamiliar words – if you encounter "cryptocurrency," you might recognize "crypto" and "currency" separately.

The Science of Subword Tokenization

One popular method for creating subword tokens is Byte Pair Encoding (BPE). While the name might sound intimidating, the core concept is surprisingly intuitive: find the most common patterns in language and use them as building blocks. Let's see how this works in practice.

BPE in Action: A Folksong Example

To understand BPE, let's examine how it processes a simple folk song:

heres_good_luck_to_the_pint_pot_

good_luck_to_the_barley_mow_

jolly-good_luck_to_the_pint_pot_

good_luck_to_the_barley_mow_

The algorithm starts by breaking everything down to individual characters and counting them. This initial inventory reveals patterns we might not notice at first glance:

Next, encoding looks for the of characters:

The most common pair "he" (appearing 5 times) becomes a new token in our vocabulary. After adding this new token, the algorithm recounts pairs and repeats the process. Over multiple iterations, larger patterns emerge:

Common pairs become tokens ("he", "th")
Frequent syllables form ("ing", "ed")
Common words emerge ("the", "and")
Even word combinations might become tokens ("of_the", "in_the")

This simple example illustrates how BPE naturally discovers useful language patterns. When applied to massive amounts of text, this process creates a vocabulary of tokens that efficiently represents human language.

This approach offers several advantages: Common words stay whole ("the", "and", "in"), frequent word parts become reusable pieces ("-ing", "un-", "-ation"), rare words can be built from smaller pieces, and new words can often be handled using existing subword tokens.

Modern AI models typically use around 30,000 tokens, carefully balanced to handle both common and rare words effectively.

From Tokens to Understanding: Embeddings

Once text is broken into tokens, how does an AI model understand what they mean?

This is where embeddings come in – they're the secret sauce that turns tokens into mathematical representations that capture meaning.

Think of embeddings as giving each token coordinates in a vast multidimensional space. It's similar to how you might plot points on a map, except instead of just latitude and longitude, embeddings typically use hundreds or thousands of dimensions to capture the nuances of language.

This mathematical representation is powerful because similar words cluster together in this space (dog, puppy, canine), related concepts maintain consistent relationships, and the distance between embeddings can represent meaningful semantic relationships.

One famous example demonstrates this mathematical representation of meaning: if you take the embedding for "king," subtract the embedding for "man," and add the embedding for "woman," you end up very close to the embedding for "queen." For a deeper dive on the inner workings of this, I recommend looking at this blog post.

The model has effectively captured the conceptual relationship between these words.

But embeddings go beyond simple word analogies. They can capture subtle contextual meanings – understanding that "bank" means something different in "river bank" versus "bank account," or that "light" could be about illumination, weight, or even making a fire.

Why This Matters: The Transformational Picture

The combination of clever tokenization and meaningful embeddings isn't just a technical achievement – it's transforming how we interact with technology in our daily lives.

Consider machine translation, which has improved dramatically in recent years. When you use Google Translate or DeepL, these systems aren't just matching words between languages; they're working with tokens and embeddings to understand the deeper meaning of your text, producing translations that were impossible just a few years ago.

Search engines have evolved beyond simple keyword matching to understand the intent behind your queries, even when phrased in natural language.

When you ask "what's that movie with the guy who played Batman in the 90s," search engines can connect concepts like "Batman," "1990s," and "actor" to understand you're probably looking for Michael Keaton.

This semantic understanding extends to content moderation systems, which now detect subtle forms of harmful content by understanding context and connotations rather than relying on simple word blacklists.

However, challenges remain. Technical vocabulary poses a particular problem, as specialized fields often use common words in unique ways or create entirely new terms. The computational resources required for these systems present another ongoing challenge, as training and running large language models demands significant energy and processing power.

As we push models toward more sophisticated understanding of language, balancing capability with computational efficiency becomes increasingly critical.

Looking Forward, New Possibilities

This transformation from human language to mathematics and back again enables the seemingly magical ability of machines to understand human language. The impact extends beyond software applications toward a broader cultural shift.

Looking to the future, new possibilities are emerging:

Researchers are developing more efficient approaches that could make sophisticated language AI accessible on personal devices, not just in cloud data centers.
Multilingual models are becoming increasingly adept at understanding content across languages, breaking down language barriers in real-time communication.
We're also seeing improvements in contextual understanding that could lead to AI systems that better grasp implied meaning, humor, and cultural references.

As machines become better at processing human language, we're moving away from having to learn specific commands or interfaces toward simply expressing our needs in natural language, a topic my colleagues have discussed.

This shift makes technology more accessible to everyone, regardless of their technical expertise, fundamentally changing how humans and machines interact.

Agentic Foundry: AI For Real-World Results

Learn how agentic AI boosts productivity, speeds decisions and drives growth

— while always keeping you in the loop.