Since ChatGPT’s inception in late 2022, it’s a Generative AI era we’ve been living in with the word “LLMs” becoming the core of everyone’s life.
But recently, you must have heard of some “big tech guys” mentioning that LLM growth is plateauing.
So, what’s next? Meta has an answer
Meta recently introduced LCMs, Large Concept Models, which looks to be the next big step, a major upgrade to LLMs.
What are Large Concept Models?
Meta’s Large Concept Models (LCMs) represent a novel approach to language modelling that operates at a higher level of abstraction compared to traditional Large Language Models (LLMs).
Instead of processing text at the token level, LCMs work with concepts, which are language- and modality-agnostic representations of higher-level ideas or actions.
In Meta’s LCM framework, a concept is defined as an abstract, atomic idea. In practice, a concept often corresponds to a sentence in text or an equivalent speech utterance. This allows the model to reason at a higher semantic level, independent of the specific language or modality (e.g., text, speech, or images).
What does this even mean?
Let’s see an example
Traditional Language Models (LLMs): Word-by-Word Prediction
Imagine you’re writing a story, and you’re using a traditional language model like ChatGPT. It works by predicting the next word (or “token”) based on the words you’ve already written. For example:
You write: “The cat sat on the…”
The model predicts: “mat.”
It’s like filling in the blanks one word at a time. This works well, but it’s very focused on individual words and doesn’t always think about the bigger picture or the overall meaning of the sentence.
Meta’s Large Concept Models (LCMs): Idea-by-Idea Prediction
Now, imagine instead of predicting the next word, the model predicts the next idea or concept. A concept is like a complete thought or sentence, not just a single word. For example:
You write: “The cat sat on the mat. It was a sunny day. Suddenly…”
The model predicts: “a loud noise came from the kitchen.”
Here, the model isn’t just guessing the next word; it’s thinking about the entire idea that should come next. It’s like planning the next part of the story in chunks, not word by word.
This is just crazy !!
Why is this cool?
Language-Independent:
The model doesn’t care if the input is in English, French, or any other language. It works with the meaning of the sentence, not the specific words. For example:
Input in English: “The cat is hungry.”
Input in French: “Le chat a faim.”
- Both sentences mean the same thing, so the model treats them as the same concept.
Multimodal (Works with Text, Speech, etc.):
The model can also work with speech or even images. For example:
If you say, “The cat is hungry,” or show a picture of a hungry cat, the model understands the same concept: “A cat needs food.”
Better for Long-Form Content:
When writing a long story or essay, the model can plan the flow of ideas instead of getting stuck on individual words. For example:
If you’re writing a research paper, the model can help you outline the main points (concepts) and then expand on them.
Input Processing:
- The input text is first segmented into sentences, and each sentence is encoded into a fixed-size embedding using a pre-trained sentence encoder (e.g., SONAR). SONAR supports up to 200 languages and can handle both text and speech inputs.
- These embeddings represent the concepts in the input sequence.
Large Concept Model (LCM):
- The LCM processes the sequence of concept embeddings and predicts the next concept in the sequence. The model is trained to perform autoregressive sentence prediction in the embedding space.
- The output of the LCM is a sequence of concept embeddings, which can then be decoded back into text or speech using the SONAR decoder.
Output Generation:
- The generated concept embeddings are decoded into text or speech, producing the final output. Since the LCM operates at the concept level, the same reasoning process can be applied to different languages or modalities without retraining.
- The LCM supports zero-shot generalization, meaning it can be applied to languages or modalities it was not explicitly trained on, as long as the SONAR encoder and decoder support them.
A few key points to understand here are:
SONAR Embedding Space:
SONAR is a multilingual and multimodal sentence embedding space that supports 200 languages for text and 76 languages for speech.
SONAR’s embeddings are fixed-size vectors that capture the semantic meaning of sentences, making them suitable for concept-level reasoning.
Diffusion & Quantized Based Generation:
Meta explored several approaches for training the LCM, including diffusion-based generation. Diffusion models are used to predict the next concept embedding by learning a conditional probability distribution over the continuous embedding space.
Another approach involves quantizing the SONAR embeddings into discrete units and training the LCM to predict the next quantized concept. This allows for more controlled generation and sampling, similar to how LLMs sample tokens from a vocabulary.
LCM vs LLMs
1. Level of Abstraction
- LLM: Works at the token level, predicting the next word or subword in a sequence.
- LCM: Works at the concept level, predicting the next sentence or idea in a sequence.
2. Input Representation
- LLM: Processes individual tokens (words or subwords) in a specific language.
- LCM: Processes sentence embeddings (concepts) that are language- and modality-agnostic.
3. Output Generation
- LLM: Generates text word by word, focusing on local coherence.
- LCM: Generates text sentence by sentence, focusing on global coherence and higher-level reasoning.
4. Language and Modality Support
- LLM: Typically trained for specific languages and modalities (e.g., text). Though, multi-modal LLMs can support multiple modalities.
- LCM: Designed to handle multiple languages and modalities (e.g., text, speech, images) through a shared concept space.
5. Training Objective
- LLM: Trained to minimize token prediction error (e.g., cross-entropy loss).
- LCM: Trained to minimize concept prediction error (e.g., mean squared error in embedding space).
6. Reasoning and Planning
- LLM: Implicitly learns hierarchical reasoning but operates locally (token by token).
- LCM: Explicitly models hierarchical reasoning, planning at the sentence or idea level.
7. Zero-Shot Generalization
- LLM: Struggles with zero-shot tasks in languages or modalities it wasn’t trained on.
- LCM: Excels at zero-shot generalization across languages and modalities due to its concept-based approach.
8. Efficiency with Long Contexts
- LLM: Struggles with long contexts due to quadratic complexity in attention mechanisms.
- LCM: More efficient with long contexts as it processes sequences of sentence embeddings, which are shorter than token sequences.
9. Applications
- LLM: Best for word-level tasks like text completion, translation, and question answering.
- LCM: Best for sentence-level tasks like summarization, story generation, and multimodal reasoning.
10. Flexibility
- LLM: Limited to text-based tasks and requires retraining for new languages or modalities.
- LCM: Flexible across languages and modalities without retraining, thanks to its concept-based design.
In conclusion, Meta’s Large Concept Models (LCMs) represent a significant leap forward in language modelling. By operating at the concept level, LCMs offer a more abstract, language-agnostic, and multimodal approach to reasoning and generation. While LLMs excel at word-by-word tasks, LCMs shine in higher-level applications like summarization, story generation, and cross-modal understanding. As AI continues to evolve, LCMs could pave the way for more intuitive, human-like interactions with machines, transforming industries from education to entertainment.
The future of AI is not just about predicting the next word — it’s about understanding the next idea !
Source: www.medium.com