Small Language Models: Edge AI Innovation From AI21

While most of the AI world is racing to build ever-bigger language models like OpenAI’s GPT-5 and Anthropic’s Claude Sonnet 4.5, the Israeli AI startup AI21 is taking a different path.

AI21 has just unveiled Jamba Reasoning 3B, a 3-billion-parameter model. This compact, open-source model can handle massive context windows of 250,000 tokens (meaning that it can “remember” and reason over much more text than typical language models) and can run at high speed, even on consumer devices. The launch highlights a growing shift: smaller, more efficient models could shape the future of AI just as much as raw scale.

“We believe in a more decentralized future for AI—one where not everything runs in massive data centers,” says Ori Goshen, Co-CEO of AI21, in an interview with IEEE Spectrum. “Large models will still play a role, but small, powerful models running on devices will have a significant impact” on both the future and the economics of AI, he says. Jamba is built for developers who want to create edge-AI applications and specialized systems that run efficiently on-device.

AI21’s Jamba Reasoning 3B is designed to handle long sequences of text and challenging tasks like math, coding, and logical reasoning—all while running with impressive speed on everyday devices like laptops and mobile phones. Jamba Reasoning 3B can also work in a hybrid setup: simple jobs are handled locally by the device, while heavier problems get sent to powerful cloud servers. According to AI21, this smarter routing could dramatically cut AI infrastructure costs for certain workloads—potentially by an order of magnitude.

A Small but Mighty LLM

With 3 billion parameters, Jamba Reasoning 3B is tiny by today’s AI standards. Models like GPT-5 or Claude run well past 100 billion parameters, and even smaller models, such as Llama 3 (8B) or Mistral (7B), are more than twice the size of AI21’s model, Goshen notes.

That compact size makes it more remarkable that AI21’s model can handle a context window of 250,000 tokens on consumer devices. Some proprietary models, like GPT-5, offer even longer context windows, but Jamba sets a new high-water mark among open-source models. The previous open-model record of 128,000 tokens was held by Meta’s Llama 3.2 (3B), Microsoft’s Phi-4 Mini, and DeepSeek R1, which are all much larger models. Jamba Reasoning 3B can process more than 17 tokens per second even when working at full capacity—that is, with extremely long inputs that use its full 250,000-token context window. Many other models slow down or struggle once their input length exceeds 100,000 tokens.

Goshen explains that the model is built on an architecture called Jamba, which combines two types of neural network designs: transformer layers, familiar from other large language models, and Mamba layers, which are designed to be more memory-efficient. This hybrid design enables the model to handle long documents, large codebases, and other extensive inputs directly…

Read full article: Small Language Models: Edge AI Innovation From AI21

The post “Small Language Models: Edge AI Innovation From AI21” by Kate Park was published on 10/08/2025 by spectrum.ieee.org