Pelayo Arbués

Recent Notes

Why Software Engineers Should Learn a Bit of Data Science
Apr 01, 2025
A recommender beast
Feb 05, 2025
The next generation of weak learners
Jan 28, 2025

See 89 more →

❯

Literature Notes

❯

❯

Judge Arena: Benchmarking LLMs as Evaluators

Judge Arena: Benchmarking LLMs as Evaluators

Apr 16, 20253 min read

articles
literature-note

Metadata

Author: Hugging Face - Blog
Full Title: Judge Arena: Benchmarking LLMs as Evaluators
URL: https://huggingface.co/blog/arena-atla

Highlights

LLM-as-a-Judge has emerged as a popular way to grade natural language outputs from LLM applications, but how do we know which models make the best judges? (View Highlight)
We’re excited to launch Judge Arena - a platform that lets anyone easily compare models as judges side-by-side. Just run the judges on a test sample and vote which judge you agree with most. The results will be organized into a leaderboard that displays the best judges. (View Highlight)
Crowdsourced, randomized battles have proven effective at benchmarking LLMs. LMSys’s Chatbot Arena has collected over 2M votes and is highly regarded as a field-test to identify the best language models. Since LLM evaluations aim to capture human preferences, direct human feedback is also key to determining which AI judges are most helpful. (View Highlight)
How it works
1. Choose your sample for evaluation: • Let the system randomly generate a 👩 User Input / 🤖 AI Response pair • OR input your own custom sample
2. Two LLM judges will: • Score the response • Provide their reasoning for the score
3. Review both judges’ evaluations and vote for the one that best aligns with your judgment (We recommend reviewing the scores first before comparing critiques) (View Highlight)
After each vote, you can: • Regenerate judges: Get new evaluations of the same sample • Start a 🎲 New round: Randomly generate a new sample to be evaluated • OR, input a new custom sample to be evaluated To avoid bias and potential abuse, the model names are only revealed after a vote is submitted. (View Highlight)
Judge Arena focuses on the LLM-as-a-Judge approach, and therefore only includes generative models (excluding classifier models that solely output a score). We formalize our selection criteria for AI judges as the following:
1. The model should possess the ability to score AND critique other models’ outputs effectively.
2. The model should be prompt-able to evaluate in different scoring formats, for different criteria. (View Highlight)
Mix of top performers between proprietary and open source: GPT-4 Turbo leads by a narrow margin but the Llama and Qwen models are extremely competitive, surpassing the majority of proprietary models (View Highlight)
Smaller models show impressive performance: Qwen 2.5 7B and Llama 3.1 8B are performing remarkably well and competing with much larger models. As we gather more data, we hope to better understand the relationship between model scale and judging ability (View Highlight)
Preliminary empirical support for emerging research: LLM-as-a-Judge literature suggests that Llama models are well-suited as base models, demonstrating strong out-of-the-box performance on evaluation benchmarks. Several approaches including Lynx, Auto-J, and SFR-LLaMA-3.1-Judge opted to start with Llama models before post-training for evaluation capabilities. Our provisional results align with this trend, showing Llama 3.1 70B and 405B ranking 2nd and 3rd, respectively (View Highlight)

Graph View

Metadata
Highlights

Now Reading

![CDATA[Not Boring by Packy McCormick]]>
Apr 16, 2025

See 1293 more →

Created with Quartz, © 2025

Bluesky
Linkedin
Mastodon
Twitter
Unsplash
GitHub
RSS