What Happens When AI Stops Thinking in Words

Every large language model you’ve used — GPT-4, Claude, Gemini, LLaMA — thinks in tokens. Fragments of words. Byte-pair encodings that split “understanding” into “under” + “stand” + “ing” and process each piece through the same attention mechanism, one at a time, left to right.

This works remarkably well. But it imposes a constraint so fundamental that most people don’t notice it: the model’s unit of reasoning is smaller than the model’s unit of meaning. You think in ideas. The model thinks in syllables. Every concept must be assembled from parts, token by token, with no guarantee that the whole adds up to what you intended.

In December 2024, Meta AI published a paper that asks the obvious question: what if we just skipped the tokens entirely?

Their answer is the Large Concept Model — an architecture that reasons over sentence-level semantic embeddings rather than individual tokens. Instead of predicting the next word, it predicts the next idea. The results are surprising, the limitations are instructive, and the implications are worth thinking through carefully.

The Token Bottleneck

To understand why LCMs matter, you need to understand what token-level processing actually costs you.

When a transformer processes text, it converts each token into a high-dimensional vector and computes attention across all tokens in the context. The computational cost scales quadratically with sequence length — O(n²) for n tokens. A 1,000-word essay is roughly 1,300 tokens. A 10,000-word document is 13,000 tokens. The attention mechanism must compute 169 million pairwise relationships for the longer document, compared to 1.7 million for the shorter one.

But here’s the deeper issue: those 13,000 tokens don’t represent 13,000 ideas. They represent maybe 200-400 distinct concepts, encoded redundantly across thousands of subword fragments. The model is spending most of its computation processing the syntax and surface form of language rather than its meaning. It’s like trying to understand a symphony by analyzing individual sound waves instead of listening to the melody.

I call this the granularity mismatch: the gap between the level at which humans reason about text (sentences, paragraphs, arguments) and the level at which transformers process it (tokens, subwords, characters). Every token-level model pays this tax. The question is whether there’s a way around it.

How Large Concept Models Work

Meta’s LCM paper, “Large Concept Models: Language Modeling in a Sentence Representation Space” (December 2024), proposes a direct solution: move the entire reasoning process into a semantic embedding space where each point represents a complete sentence.

The architecture has three stages:

┌─────────────────────────────────────────────────────┐
│  1. ENCODE: Text → Sentence Embeddings              │
│     "The market crashed" → vector [0.23, -0.71, ...] │
│     Using SONAR encoder (1,024-dim, 200+ languages)  │
├─────────────────────────────────────────────────────┤
│  2. REASON: Embedding → Embedding                    │
│     Predict next concept from sequence of concepts   │
│     Three approaches: Base / Diffusion / Quantized   │
├─────────────────────────────────────────────────────┤
│  3. DECODE: Embedding → Text                         │
│     vector [0.41, 0.18, ...] → "Investors panicked"  │
│     Using SONAR decoder (any target language)         │
└─────────────────────────────────────────────────────┘

Stage 1 uses SONAR, Meta’s multilingual sentence embedding model that maps sentences from over 200 languages into a shared 1,024-dimensional vector space. The key property: sentences with similar meanings land near each other in this space, regardless of the language they were written in. “The sky is blue” in English, Japanese, and Arabic all map to approximately the same point.

Stage 2 is where the LCM does its actual reasoning. Instead of predicting the next token, it predicts the next sentence embedding — a complete concept — given the sequence of previous concepts. This is fundamentally different from autoregressive token generation. The model operates on a sequence of maybe 30-50 concepts instead of 1,300 tokens, with each concept carrying the full semantic weight of an entire sentence.

Stage 3 decodes the predicted embedding back into natural language text using SONAR’s decoder. Because the embedding space is language-agnostic, the same concept can be decoded into any of the 200+ supported languages — without the model ever being trained on translation explicitly.

The researchers explored three architectures for Stage 2:

Base LCM — a standard autoregressive transformer operating on continuous embeddings. It predicts the next concept vector by attending to all previous concepts. Simple but effective for short sequences.

Diffusion-based LCM — borrows the iterative refinement process from image generation models. Instead of predicting the next concept in one shot, it starts with noise and gradually denoises it into a coherent concept embedding. This captures richer distributions over possible next concepts, which matters when the continuation is genuinely ambiguous.

Quantized LCM (QLCM) — discretizes the continuous embedding space into a vocabulary of concept tokens using a residual vector quantizer. Each sentence becomes a short sequence of discrete codes (typically 8 codes per sentence), and the model operates on these codes using standard sequence-to-sequence methods. This bridges the LCM approach with conventional language modeling techniques.

The Three Levels of Abstraction

To place LCMs in the broader landscape, it helps to think about language models as operating at different levels of abstraction:

┌─────────────────────────────────────┐
│  Level 3: Concept-Level (LCMs)      │
│  Unit: sentence embedding           │
│  Sequence: ~30-50 concepts          │
│  Captures: meaning, intent, ideas   │
├─────────────────────────────────────┤
│  Level 2: Token-Level (LLMs)        │
│  Unit: subword token                │
│  Sequence: ~1,000-100,000 tokens    │
│  Captures: syntax, grammar, style   │
├─────────────────────────────────────┤
│  Level 1: Character-Level           │
│  Unit: character/byte               │
│  Sequence: ~5,000-500,000 chars     │
│  Captures: spelling, morphology     │
└─────────────────────────────────────┘

Most of the field’s energy has gone into Level 2. LCMs propose that for certain tasks — summarization, planning, cross-lingual transfer — Level 3 is simply a better place to work. You lose fine-grained control over word choice and syntax, but you gain the ability to reason about ideas directly.

This isn’t a replacement for LLMs. It’s a complement. Different tasks live at different abstraction levels. Generating poetry requires Level 2 (word-level craft matters). Summarizing a legal brief requires Level 3 (concept-level reasoning matters). The choice of abstraction level should match the task, not the model’s architecture.

What the Results Actually Show

The Meta paper’s results are both impressive and humbling. The 7-billion parameter LCM, trained on sentences extracted from a large multilingual corpus, produced these findings:

Summarization is the sweet spot. On the summary expansion benchmark, the diffusion-based LCM (specifically the “Two-Tower” variant) achieved the strongest performance. The model didn’t just extract and rearrange text — it genuinely paraphrased and restructured ideas, producing summaries that read like a human wrote them rather than highlighted them. This makes architectural sense: summarization is fundamentally a concept-level operation. You’re compressing ideas, not rearranging words.

Multilingual transfer is remarkable. Despite being trained primarily on English data, the LCM outperformed specialized multilingual models on most of the 45 languages tested. This result flows directly from SONAR’s architecture: because the embedding space is language-agnostic, reasoning learned in one language transfers automatically to all others. The model doesn’t need to “learn” each language separately — it learns to reason about concepts, and SONAR handles the translation.

The numbers are concrete. The 7B model was trained on 2.3 billion documents (2.7 trillion tokens, 142.4 billion sentence concepts) with a context window of 2,048 concepts. On CNN/DailyMail and XSum summarization benchmarks, the Two-Tower diffusion variant was competitive with instruction-tuned models including T5-3B, Gemma-7B, and LLaMA-3.1-8B — while operating on sequences an order of magnitude shorter.

Expansion is the weak spot. When asked to expand a summary into longer text, the model struggled. It repeated ideas, produced circular reasoning, and lost coherence beyond 5-7 sentences. This is the generation horizon problem: concept-level models are good at compressing information but lack the fine-grained control needed for sustained, coherent long-form generation. Token-level models don’t have this problem because they control output one word at a time.

The concept space is fragile. Small perturbations in the embedding space can cause large, unpredictable changes in the decoded text. A tiny shift in the vector representing “the company’s revenue grew” might decode to “the company’s reputation suffered” — semantically distant but geometrically close. This embedding instability is the most fundamental limitation of the approach, and it gets worse for technical or precise content where exact wording matters.

The Planning Problem

Perhaps the most interesting section of Meta’s paper is their proposal for the Large Planning Concept Model (LPCM) — a version that generates an explicit plan before producing content.

The idea is compelling: before generating a sequence of concept embeddings, first generate a high-level plan — a sequence of abstract intentions — and then fill in the concepts that satisfy that plan. This is how skilled writers actually work. You don’t write an essay one sentence at a time, left to right. You outline the argument, identify the key points, and then flesh out each section.

The LPCM adds a planning layer above the concept layer:

Plan:     [introduce problem] → [present data] → [propose solution] → [acknowledge limits]
           ↓                    ↓                  ↓                    ↓
Concepts: [3-5 sentences]      [3-5 sentences]    [3-5 sentences]     [2-3 sentences]
           ↓                    ↓                  ↓                    ↓
Text:      Paragraph 1          Paragraph 2        Paragraph 3         Paragraph 4

This hierarchical approach addresses the generation horizon problem by constraining the concept sequence to follow a coherent structure. Early results show improved coherence on longer documents, though the planning module itself introduces new challenges around plan faithfulness — the model sometimes generates concepts that satisfy the local context but deviate from the original plan.

Where This Breaks

LCMs have real limitations, and understanding them clarifies when this architecture is and isn’t the right choice.

Sentence-level granularity is a ceiling, not just a feature. The model can’t reason about individual words. It can’t write poetry where word choice matters. It can’t generate code where every character is syntactically meaningful. It can’t produce dialogue where rhythm and word-level pacing create voice. For these tasks, token-level models are strictly superior.

Embedding instability compounds over sequences. Each concept prediction has some error. Over a sequence of 30 concepts, these errors accumulate. By concept 20, the model may have drifted significantly from the intended meaning. This is the concept-level equivalent of the “hallucination” problem in LLMs, but harder to detect because the output can be semantically plausible while being factually wrong.

SONAR is a bottleneck. The quality of the entire system depends on SONAR’s ability to faithfully encode and decode sentences. Any information that SONAR loses during encoding — nuance, emphasis, technical precision — is irrecoverable. The model can’t generate what it can’t represent. And SONAR’s 1,024-dimensional embedding space, while powerful for general semantics, may not capture the distinctions that matter for specialized domains like legal text, medical records, or mathematical proofs.

Training data requirements are different. Token-level LLMs can be trained on raw text. Concept-level LCMs require text that has been segmented into sentences and encoded into SONAR embeddings — an additional preprocessing step that introduces its own errors and biases. Sentence boundary detection is imperfect, especially for languages with non-standard punctuation or for informal text like social media posts.

The diffusion approach is slow. Diffusion-based concept generation requires multiple denoising steps per concept (typically 10-100 steps). For a 30-concept summary, that’s 300-3,000 forward passes — significantly slower than autoregressive generation. The QLCM variant is faster but sacrifices the richer distributional properties that make diffusion effective.

What This Means for the Field

LCMs matter not because they’ll replace LLMs — they won’t, at least not for general-purpose language generation — but because they demonstrate that the token is not the only viable unit of reasoning for neural language models.

This insight has implications beyond the specific LCM architecture:

Hierarchical reasoning. The most capable future systems will likely operate at multiple abstraction levels simultaneously — planning at the concept level, generating at the token level, and verifying at both. The LCM paper provides evidence that concept-level reasoning is learnable and useful, even if the current implementation is limited.

Multilingual AI without multilingual training. SONAR’s language-agnostic embedding space suggests a path toward AI systems that can reason in any language after being trained in just one. This has enormous implications for languages with limited training data — the “low-resource language” problem that token-level models still struggle with.

Compression as reasoning. LCMs perform a kind of lossy compression: they encode rich, nuanced text into fixed-size vectors, reason over those vectors, and then decode them back. The quality of this compression determines the quality of the reasoning. This frames the problem differently than token-level models, and may lead to architectures that are fundamentally better at tasks that require abstraction — summarization, planning, conceptual reasoning — even if they’re worse at tasks that require precision.

The granularity mismatch between human reasoning and token-level processing is real. LCMs are the first serious attempt to close it. The attempt is imperfect — fragile embeddings, limited generation horizons, slow diffusion — but the direction is right. We should be building models that reason about ideas, not just words. The question is how to do it without losing the precision that makes token-level models so useful.

The follow-up work is already proving the direction right. SONAR-LLM (August 2025) eliminates the diffusion sampler entirely, building a decoder-only transformer that “thinks in sentence embeddings and speaks in tokens” — significantly outperforming both diffusion and MSE variants while achieving near-linear computational scaling up to 1M tokens. Dynamic Large Concept Models (December 2025) go further, learning semantic boundaries from latent representations rather than using fixed sentence boundaries, achieving +2.69% across 12 zero-shot benchmarks under matched compute.

Meta’s original LCM doesn’t answer the granularity question. But the trajectory from LCM → SONAR-LLM → Dynamic LCM shows the field converging on an answer: the right unit of reasoning isn’t fixed — it’s learned.

Sharad Jain builds AI systems in Bengaluru. He previously worked on data infrastructure at Meta and founded autoscreen.ai, a production voice AI platform. He writes about emerging AI architectures at sharadja.in.

Further reading:

Meta AI, “Large Concept Models: Language Modeling in a Sentence Representation Space” — December 2024
Meta AI, SONAR: Multilingual Sentence Embeddings — sentence-level encoder supporting 200+ languages
Anthropic, “Building Effective Agents” — contrasting approach: token-level agents with workflow patterns