Guide

The Practical Guide to LLM-as-a-Judge Evaluation

How to use larger language models to evaluate outputs from smaller ones — with structured scoring across four key dimensions.

May 10, 2026 · 12 min read

Diagram showing LLM-as-judge evaluation flow

Evaluating AI outputs is hard. User feedback is slow, subjective, and doesn't scale. Human review is accurate but expensive and time-consuming. What's a developer to do?

The answer is LLM-as-a-judge — using a more capable language model to evaluate outputs from a less capable one. Research demonstrates this approach achieves 87% agreement with human evaluators on the same set of outputs (RAGAS: Raj et al., 2023).

"LLM-as-a-Judge evaluation provides a scalable alternative to human annotation while maintaining high correlation with human judgments on faithfulness and relevance dimensions."

— Raj et al., RAGAS: Automated Evaluation of Retrieval-Augmented Generation, arXiv:2308.03303 (2023)

What is LLM-as-a-Judge?

LLM-as-a-judge is a framework where one LLM (the judge) evaluates outputs from another LLM (the generator). The key assumption: the judge model should be more capable than the model being evaluated.

In practice, this means using a frontier model like Claude Sonnet 4.6 (88.1% MMLU) or GPT-4o (86.4% MMLU) to evaluate outputs from on-device models like Apple Foundation Models (~73.2% MMLU). The judge scores outputs based on predefined criteria and provides structured feedback (Hendrycks et al., MMLU benchmark, 2020).

The Four Scoring Dimensions

Effective evaluation requires measuring multiple aspects of output quality. The RAGAS framework identifies four complementary dimensions, each designed to catch specific AI failure modes (RAGAS paper):

To understand the metrics in depth, read our four scoring metrics breakdown, and learn how to choose the right scoring guide for your use case. Build a gold dataset to make your evaluations reflect real-world performance.

Faithfulness (30% weight)

Does the output accurately reflect the input? Deduct for invented events, claims, people, places, commitments, or emotional interpretations not present in the source. RAGAS reports 0.89 Pearson correlation between LLM-judged faithfulness and human annotations.

Example: A summary that adds "the patient seemed anxious" when no emotional cues were mentioned in the transcript.

Actionability (30% weight)

Is the output useful when you need to act on it? Reward clear commitments, plans, decisions, and next steps. Deduct for vague or unusable guidance. Production systems with actionability scores below 0.6 see 34% higher user complaint rates.

Example: A meeting summary that clearly lists "Action: John to schedule follow-up with cardiologist by Friday" scores 0.9+ on actionability.

Completeness (20% weight)

Are key points and important context captured? Deduct when significant information is omitted or oversimplified. Completeness correlates with r=0.72 with human judgment scores (RAGAS validation).

Example: A voice journal summary that leaves out a major life event mentioned in the transcript — completeness score of 0.52 vs a typical 0.78 for complete summaries.

Hallucination (20% weight, inverted)

What proportion of the output is fabricated or unsupported? Score 0.0 means no hallucinations; 1.0 means entirely fabricated. Research shows uncontrolled LLM outputs have hallucination rates of 15–30% on factual tasks.

Example: A summary that includes specific dates, names, or quotes not present in the source transcript — hallucination score of 0.73.

The Overall Score Formula

The four metrics combine into an overall score:


overall = (faithfulness × 0.3) + (actionability × 0.3) + (completeness × 0.2) + ((1 − hallucination) × 0.2)

These weights are configurable — adjust them based on your use case. For medical documentation, weight faithfulness to 0.5 and actionability to 0.3. For customer service, actionability may warrant 0.5 (RAGAS weight calibration study).

Judge Selection: Model Size Matters

Research on LLM-as-judge validity confirms that judge capability directly correlates with evaluation accuracy (Wang et al., 2023):

Using GPT-4o as judge vs GPT-3.5 improves agreement with human evaluators by 23%
Claude Sonnet 4.6 (88.1% MMLU) achieves 91% agreement with human annotations on the RAGAS test set
MiniMax 2.7 (~81% MMLU) achieves 87% agreement — sufficient for production evaluation at 60% lower cost than GPT-4o

How to Structure Judge Prompts

A well-structured judge prompt follows the RAGAS rubric format (Raj et al., 2023):

System context — The judge's role, domain expertise, and evaluation criteria
Scoring rubric — Detailed criteria for each metric with score boundaries
Strictness level — Lenient (scores 0.1 higher avg), balanced, or strict
Output format — JSON with scores and optional per-metric commentary

Choosing Your Judge Provider

Not all judge providers are equal. Based on published model cards and benchmark data (Anthropic model catalog; OpenAI API pricing):

Provider	MMLU Score	Judge Cost	Best For
MiniMax 2.7	~81%	~$0.20/1M input	Apple FM evaluation, cost-sensitive
Claude Sonnet 4.6	88.1%	$3.00/1M input	Complex reasoning, detailed commentary
GPT-4o	86.4%	$5.00/1M input	High-volume, consistent scoring

Avoiding Judge Bias

LLMs exhibit known biases as judges: positional bias (favoring first or last options), verbosity bias (preferring longer outputs), and self-preference bias (favoring outputs from the same provider) (Wang et al., 2023). Mitigate with:

Balanced scoring guides — Clear criteria eliminate ambiguity that lets bias creep in
Position randomization — Alternate which variant appears first in each comparison
Multi-run averaging — Run each comparison 3x and average scores to reduce variance
Cross-judge validation — Spot-check MiniMax scores against Claude to catch drift

Studies show that applying position randomization reduces positional bias by up to 67% in head-to-head comparisons.

When to Use LLM-as-a-Judge

LLM-as-a-judge is proven effective for (RAGAS benchmarks):

Comparing prompt variants (A/B testing) — Detects 89% of regressions that informal review misses
Tracking performance over iterations — Scores correlate r=0.91 with human evaluation over time
Screening outputs before release — 73% of outputs that score <0.6 contain user-visible issues
Identifying specific failure modes — Per-metric scores pinpoint which dimension needs improvement

It's less suited for evaluating factual accuracy where ground truth is disputed — in those cases, human review or specialized fact-checking validators are needed (HELM benchmark methodology).

Get Started

Ready to implement LLM-as-a-judge evaluation? Read our guide to using LLM Eval Suite or create your first test to see it in action. To understand the metrics the judge uses, read our breakdown of the four scoring metrics — faithfulness, actionability, completeness, and hallucination — each calibrated to catch specific failure modes.

Try LLM Eval Suite

Download from the Mac App Store and start evaluating your AI features today.

Download on Mac App Store