Llama 4: Did Meta Just Push the Panic Button?

rw-book-cover

Metadata

Author: Nathan Lambert
Full Title: Llama 4: Did Meta Just Push the Panic Button?
URL: https://www.interconnects.ai/p/llama-4

Highlights

Where Llama 2’s and Llama 3’s releases were arguably some of the top few events in AI for their respective release years, Llama 4 feels entirely lost. Meta has attempted to reinvent their formula of models with substantial changes in size, architecture, and personality, but a coherent narrative is lacking. Meta has fallen into the trap of taking too long to ship, so the bar is impossible to cross successfully. (View Highlight)
The time between major versions is growing, and the number of releases seen as exceptional by the community is dropping. (View Highlight)
Where Llama models used to be scaled across different sizes with almost identical architectures, these new models are designed for very different classes of use-cases. • Llama 4 Scout is similar to a Gemini Flash model or any ultra-efficient inference MoE. • Llama 4 Maverick’s architecture is very similar to DeepSeek V3 with extreme sparsity and many active experts. • Llama 4 Behemoth is likely similar to Claude Opus or Gemini Ultra, but we don’t have substantial information on these. (View Highlight)
This release came on a Saturday, which is utterly bizarre for a major company launching one of its highest-profile products of the year. The consensus was that Llama 4 was going to come at Meta’s LlamaCon later this month. (View Highlight)
One of the flagship features is the 10M (on Scout, Maverick is 1M) token context window on the smallest model, but even that didn’t have any released evaluations beyond Needle in a Haystack (NIAH), which is seen as a necessary condition, but not one that is sufficient to say it is a good long-context model. (View Highlight)
Many, many people have commented on how Llama 4’s behavior is drastically different in LMArena — which was their flagship result of the release — than on other providers (even when following Meta’s recommended system prompt). Turns out, from the blog post, that it is just a different model:

Llama 4 Maverick offers a best-in-class performance to cost ratio with an experimental chat version scoring ELO of 1417 on LMArena. (View Highlight)
Sneaky. The results below are fake, and it is a major slight to Meta’s community to not release the model they used to create their major marketing push. We’ve seen many open models that come around to maximize on ChatBotArena while destroying the model’s performance on important skills like math or code. We’ll see where the released models land. (View Highlight)
ArtificialAnalysis rated the models as “some of the best non-reasoning models,” beating leading frontier models. This is complicated because we shouldn’t separate reasoning from non-reasoning models; we should just evaluate on reasoning and non-reasoning domains separately, as discussed in the Gemini 2.5 post. So-called “reasoning models” often top non-reasoning benchmarks, but the opposite is rarely true. (View Highlight)
Other independent evaluation results range from medium to bad and confusing — I suspect very weird results are hosting issues with the very long context models. (View Highlight)
At the same time, the Behemoth model is outclassed by Gemini 2.5 Pro. To list some of the major technical breakthroughs that Meta made (i.e. new to Llama, not new to the industry): • Mixture of expert architectures, enabling Llama 4 to be trained with less compute than Llama 3 even though they have more total parameters — a lot more. • Very long context up to 10M tokens. • Solid multimodal input performance on release day (and not a later model) Interconnects is a reader-supported publication. Consider becoming a subscriber. (View Highlight)
Sadly this post is barely about the technical details. Meta nuked their release vibes with weird timing and by having an off-putting chatty model that was easiest to find to talk to. The release process, timing, and big picture raise more questions for Meta. Did they panic and feel like this was their one shot at being state of the art? (View Highlight)
The evaluation scores for the models are solid, they clear a fairly high bar. With these highly varied MoE architectures, it’s super hard to feel confident in an assessment of the model based on benchmarks, especially when compared to dense models or teacher-student distilled models. The very-long-context base models will be extremely useful for research. The question here is: Why is Meta designing their models in the same way as other frontier labs when their audience is open-source AI communities and businesses, not an API serving business or ChatGPT competitor? (View Highlight)
The model that becomes the “open standard” doesn’t need to be the best overall model, but rather a family of models in many shapes and sizes that is solid in many different deployment settings. Qwen 2.5, with models at 0.5B, 1.5B, 3B, 7B, 14B, 32B, and 72B parameters, is the closest to this right now. There’s actually far less competition in this space than in the space Meta chose to go into (and take on DeepSeek)! (View Highlight)
One of these communities historically has been the LocalLlama subreddit, which named the entire community around running models at home around the Llama series of models — they’re not happy with Llama 4. Another community is academics, where the series of models across different size ranges is wonderful for understanding language models and improving methods. These two groups are all GPU-poor, so memory-intensive models like these sparse mixture of experts price out even more participants in the open community (who tend to be memory-limited). (View Highlight)
This is all on top of an onerous license that makes all artifacts that use Llama in the process be tagged with the “Llama-” name, the Llama license, the “Built with Llama” branding if used commercially, and use-case restrictions. This is at the same time when their competitors, i.e. DeepSeek, released their latest flagship model with an MIT license (which has no downstream restrictions). (View Highlight)
A third group is potential businesses looking to use open models on-premises as open models close the gap to closed counterparts. These feel like groups that would be sensitive to the extra legal risk that Llama’s license exposes them to. (View Highlight)
On top of all of this weirdness, many of Meta’s “open-source” efforts are restricted in the European Union. Where the Llama 3.2 models blocked you if you tried to access them from Europe, Llama 4 is available for download but prohibits the use of vision capabilities in an acceptable use policy. This is not entirely Meta’s fault, as many companies are dealing with side effects of the EU AI Act, but regulatory exposure needs to be considered in Meta’s strategy. (View Highlight)
Sadly, the evaluations for this release aren’t even the central story. The vibes have been off since the beginning by choosing a weird release date. Over the coming weeks, more and more people will find reliable uses for Llama 4, but in a competitive landscape, that may not be good enough. Llama is no longer the open standard. (View Highlight)
With the macro pressure coming to Meta’s business and the increasing commoditization of open models, how is Zuckerberg going to keep up in face of shareholder pressure pushing back against the cost of the Llama project? This isn’t the first time he’s done so, but he needs to reevaluate the lowest level principles of their approach to open AI. (View Highlight)

Pelayo Arbués

Explorer

Recent Notes

Why Software Engineers Should Learn a Bit of Data Science

A recommender beast

The next generation of weak learners

Llama 4: Did Meta Just Push the Panic Button?

Metadata

Highlights

Graph View

Table of Contents

Now Reading

![CDATA[Not Boring by Packy McCormick]]>