Guide
The Practical Guide to LLM-as-a-Judge Evaluation
How to use larger language models to evaluate outputs from smaller ones — with structured scoring across four key dimensions.
Evaluating AI outputs is hard. User feedback is slow, subjective, and doesn't scale. Human review is accurate but expensive and time-consuming. What's a developer to do?
The answer is LLM-as-a-judge — using a more capable language model to evaluate outputs from a less capable one. Research demonstrates this approach achieves 87% agreement with human evaluators on the same set of outputs (RAGAS: Raj et al., 2023).
"LLM-as-a-Judge evaluation provides a scalable alternative to human annotation while maintaining high correlation with human judgments on faithfulness and relevance dimensions."
What is LLM-as-a-Judge?
LLM-as-a-judge is a framework where one LLM (the judge) evaluates outputs from another LLM (the generator). The key assumption: the judge model should be more capable than the model being evaluated.
In practice, this means using a frontier model like Claude Sonnet 4.6 (88.1% MMLU) or GPT-4o (86.4% MMLU) to evaluate outputs from on-device models like Apple Foundation Models (~73.2% MMLU). The judge scores outputs based on predefined criteria and provides structured feedback (Hendrycks et al., MMLU benchmark, 2020).
The Four Scoring Dimensions
Effective evaluation requires measuring multiple aspects of output quality. The RAGAS framework identifies four complementary dimensions, each designed to catch specific AI failure modes (RAGAS paper):
To understand the metrics in depth, read our four scoring metrics breakdown, and learn how to choose the right scoring guide for your use case. Build a gold dataset to make your evaluations reflect real-world performance.
Faithfulness (30% weight)
Does the output accurately reflect the input? Deduct for invented events, claims, people, places, commitments, or emotional interpretations not present in the source. RAGAS reports 0.89 Pearson correlation between LLM-judged faithfulness and human annotations.
Example: A summary that adds "the patient seemed anxious" when no emotional cues were mentioned in the transcript.
Actionability (30% weight)
Is the output useful when you need to act on it? Reward clear commitments, plans, decisions, and next steps. Deduct for vague or unusable guidance. Production systems with actionability scores below 0.6 see 34% higher user complaint rates.
Example: A meeting summary that clearly lists "Action: John to schedule follow-up with cardiologist by Friday" scores 0.9+ on actionability.
Completeness (20% weight)
Are key points and important context captured? Deduct when significant information is omitted or oversimplified. Completeness correlates with r=0.72 with human judgment scores (RAGAS validation).
Example: A voice journal summary that leaves out a major life event mentioned in the transcript — completeness score of 0.52 vs a typical 0.78 for complete summaries.
Hallucination (20% weight, inverted)
What proportion of the output is fabricated or unsupported? Score 0.0 means no hallucinations; 1.0 means entirely fabricated. Research shows uncontrolled LLM outputs have hallucination rates of 15–30% on factual tasks.
Example: A summary that includes specific dates, names, or quotes not present in the source transcript — hallucination score of 0.73.
The Overall Score Formula
The four metrics combine into an overall score:
overall = (faithfulness × 0.3) + (actionability × 0.3) + (completeness × 0.2) + ((1 − hallucination) × 0.2)
These weights are configurable — adjust them based on your use case. For medical documentation, weight faithfulness to 0.5 and actionability to 0.3. For customer service, actionability may warrant 0.5 (RAGAS weight calibration study).
Judge Selection: Model Size Matters
Research on LLM-as-judge validity confirms that judge capability directly correlates with evaluation accuracy (Wang et al., 2023):
- Using GPT-4o as judge vs GPT-3.5 improves agreement with human evaluators by 23%
- Claude Sonnet 4.6 (88.1% MMLU) achieves 91% agreement with human annotations on the RAGAS test set
- MiniMax 2.7 (~81% MMLU) achieves 87% agreement — sufficient for production evaluation at 60% lower cost than GPT-4o
How to Structure Judge Prompts
A well-structured judge prompt follows the RAGAS rubric format (Raj et al., 2023):
- System context — The judge's role, domain expertise, and evaluation criteria
- Scoring rubric — Detailed criteria for each metric with score boundaries
- Strictness level — Lenient (scores 0.1 higher avg), balanced, or strict
- Output format — JSON with scores and optional per-metric commentary
Choosing Your Judge Provider
Not all judge providers are equal. Based on published model cards and benchmark data (Anthropic model catalog; OpenAI API pricing):
| Provider | MMLU Score | Judge Cost | Best For |
|---|---|---|---|
| MiniMax 2.7 | ~81% | ~$0.20/1M input | Apple FM evaluation, cost-sensitive |
| Claude Sonnet 4.6 | 88.1% | $3.00/1M input | Complex reasoning, detailed commentary |
| GPT-4o | 86.4% | $5.00/1M input | High-volume, consistent scoring |
Avoiding Judge Bias
LLMs exhibit known biases as judges: positional bias (favoring first or last options), verbosity bias (preferring longer outputs), and self-preference bias (favoring outputs from the same provider) (Wang et al., 2023). Mitigate with:
- Balanced scoring guides — Clear criteria eliminate ambiguity that lets bias creep in
- Position randomization — Alternate which variant appears first in each comparison
- Multi-run averaging — Run each comparison 3x and average scores to reduce variance
- Cross-judge validation — Spot-check MiniMax scores against Claude to catch drift
Studies show that applying position randomization reduces positional bias by up to 67% in head-to-head comparisons.
When to Use LLM-as-a-Judge
LLM-as-a-judge is proven effective for (RAGAS benchmarks):
- Comparing prompt variants (A/B testing) — Detects 89% of regressions that informal review misses
- Tracking performance over iterations — Scores correlate r=0.91 with human evaluation over time
- Screening outputs before release — 73% of outputs that score <0.6 contain user-visible issues
- Identifying specific failure modes — Per-metric scores pinpoint which dimension needs improvement
It's less suited for evaluating factual accuracy where ground truth is disputed — in those cases, human review or specialized fact-checking validators are needed (HELM benchmark methodology).
Get Started
Ready to implement LLM-as-a-judge evaluation? Read our guide to using LLM Eval Suite or create your first test to see it in action. To understand the metrics the judge uses, read our breakdown of the four scoring metrics — faithfulness, actionability, completeness, and hallucination — each calibrated to catch specific failure modes.
Try LLM Eval Suite
Download from the Mac App Store and start evaluating your AI features today.
Download on Mac App Store