Did you know From Hype to Humble: Meta’s Llama 4 Lands at 32nd in AI Rankings
Tech giant Meta was seen landing in some really hot water for utilizing
experimental and unreleased variants of the Llama 4 Maverick to get
higher scores across crowdsourced benchmarks. This matter really
encouraged the maintainers of LM Arena to admit their error and alter
policies as well as the scores generated.
The fact that it
scored below other leading archrivals in the industry says so much. The
fact of the matter is that it’s not competitive at all, showing rankings
far below the likes of GPT-4o from OpenAI and Claude 3.5 Sonnet from
Anthropic. It also failed miserably when you look at Google’s Gemini 1.5
Pro. Remember, the competition it went against was months old.
The release variants for Llama 4 were added to the LMArena after they realized the cheating episode. If you didn’t get the chance to see it, it’s probably because it stands at 32nd place in its ranking. Now the question is why the performance is so poor?
The tech giant tried to defend claims by mentioning that this product
was created for conversations. This kind of optimization really did play
out well towards LM Arena, which entails human raters comparing outputs
of AI systems and selecting any that they want.
LM Arena hasn’t
been the most reliable indicator for the performance of AI models for a
while now. Still, it’s tailoring out benchmarks to the model, which is
not only misleading but a thorough challenge for many developers today.
They find it hard to predict how great this model is going to perform in
various contexts.
Now, the latest release is the open source variant, and that’s how we can tell developers are customizing it for their own uses. They are very excited about what they can design and are even more optimistic about feedback coming through.
