Guide

How to Choose the Right Scoring Guide for Your AI Use Case

A scoring guide shapes how strictly your judge evaluates outputs. Learn which guide fits your domain and how strictness levels affect your scores.

May 10, 2026 · 8 min read

Scoring guide selection interface in LLM Eval Suite

A scoring guide is the rubric that tells your judge how to evaluate outputs. It's not just about the four metrics — it's about the specific criteria, examples, and boundaries that make those metrics meaningful for your domain.

Choose the wrong guide, and your scores won't reflect real-world performance. Choose the right one, and you'll get actionable insights that translate directly to better user experiences.

Built-in Scoring Guides

LLM Eval Suite ships with three scoring guides designed for common use cases:

voice-transcript-standard

Optimized for voice journaling, voice memos, and transcript summarization tasks.

Use when:

• Summarizing voice recordings
• Transcribing and condensing spoken content
• Processing voice notes or journal entries

Key criteria:

• Emphasizes factual grounding in the transcript
• Penalizes invented emotional interpretations
• Rewards capturing speaker intent and key events

meeting-notes-standard

Optimized for meeting notes, action item extraction, and multi-party conversations.

Use when:

• Extracting action items from meetings
• Summarizing multi-speaker conversations
• Creating follow-up notes from calls

Key criteria:

• Emphasizes actionability and next steps
• Rewards correct attribution to speakers
• Penalizes missing decisions or commitments

minimal

Generic fallback guide for general-purpose text generation.

Use when:

• No domain-specific guide fits your use case
• You're doing exploratory evaluation
• General text quality matters most

Key criteria:

• Basic relevance and coherence
• Minimal domain-specific rules
• Broad applicability

Strictness Levels

Beyond the guide itself, strictness levels adjust how boundaries are applied (Raj et al., RAGAS, 2023):

"Strictness levels should be calibrated against human disagreement rates to avoid over-penalization — applying strict mode to outputs that human evaluators rate as acceptable introduces systematic bias."

— Raj et al., RAGAS: Automated Evaluation of Retrieval-Augmented Generation (2023)

Lenient

Scores tend higher. Use when your baseline is underperforming and you want to see gradual improvement.

Balanced

Default setting. Good for most evaluation scenarios.

Strict

Scores tend lower. Use when you need high accuracy and can't tolerate false positives.

Research shows that lenient vs strict scoring produces 15–25% score variance on identical outputs — using a consistent strictness level is critical for comparable results across runs.

Matching Guide to Use Case

Here's a quick reference for matching guides to common AI features:

AI Feature	Recommended Guide	Strictness
Voice journal summarization	voice-transcript-standard	Balanced
Voice memos to notes	voice-transcript-standard	Balanced
Meeting transcription	meeting-notes-standard	Strict
Call center summarization	meeting-notes-standard	Balanced
Medical visit notes	voice-transcript-standard	Strict
General text generation	minimal	Balanced

Custom Scoring Guides

For specialized domains like medical documentation, legal notes, or financial summaries, you may need a custom scoring guide. A custom guide lets you:

Define domain-specific criteria and acceptable terminology
Add examples that reflect real edge cases in your field
Adjust weights based on what's most important for your users
Include safety checks relevant to your compliance requirements

When creating custom guides, include 5-10 example outputs with expected scores to calibrate the judge.

Testing Your Guide

Before running full evaluations, test your scoring guide on a small sample:

Run evaluation on 5-10 samples
Review the judge's commentary for each score
Check if scores match your expectations
Adjust guide criteria if judge is misaligned
Repeat until scores are consistent and meaningful

Get Started

Ready to evaluate with the right scoring guide? Create your first test using LLM Eval Suite and select the guide that matches your domain.

Try LLM Eval Suite

Download from the Mac App Store and start evaluating with domain-specific scoring guides.

Download on Mac App Store