Deep Dive
Understanding the Four Scoring Metrics for AI Evaluation
A detailed look at faithfulness, actionability, completeness, and hallucination — the four dimensions that tell you whether your AI outputs are actually good.
Evaluating AI output quality isn't as simple as asking "is this good?" You need specific, measurable dimensions that map to real-world usefulness. That's why structured evaluation frameworks use four complementary metrics, each catching different failure modes.
Based on the RAGAS evaluation framework (Raj et al., 2023) and validated against human annotations, these four metrics give you a complete, reproducible picture of output quality (RAGAS paper, arXiv:2308.03303).
"Faithfulness and hallucination are related but distinct — low faithfulness often predicts high hallucination, but a faithful output can still miss important context. Completeness captures what RAGAS calls 'context recall' — whether all relevant information from the source appears in the output."
1. Faithfulness: Is It True to the Source?
Faithfulness measures how factually consistent your output is with the input. It catches the most dangerous AI failure mode: confident-sounding fabrication. The RAGAS framework reports a 0.89 Pearson correlation between LLM-judged faithfulness scores and human annotations on the same outputs (Raj et al., 2023).
The judge evaluates each claim in the output against the source content. If the AI says "the patient seemed anxious" but no emotional cues were mentioned, that's a faithfulness violation.
Scoring Rubric
| Score | Meaning |
|---|---|
| 0.9-1.0 | Perfectly grounded — every claim verified by source |
| 0.7-0.9 | Minor omissions, no fabrications |
| 0.5-0.7 | Some unsupported claims present |
| <0.5 | Significant hallucination — dangerous territory |
Example failure: A meeting summary that includes "John agreed to deliver the report by Friday" when John never made any such commitment in the transcript.
2. Actionability: Can You Act on It?
Actionability measures whether the output is useful when you need to make decisions or take action. This metric rewards clarity, specificity, and practical value.
For a meeting notes AI, this means: can someone read the output and immediately know what to do next? Or do they need to re-read the original transcript to figure out the action items?
What High Actionability Looks Like
- • Clear next steps with owners and deadlines
- • Explicit decisions made (not just discussed)
- • Commitments that can be followed up on
- • Specific outcomes vs. vague agreements
Example success: "Action: Sarah to schedule follow-up cardiology appointment by EOD Friday. Decision: Starting patient on 10mg lisinopril daily."
3. Completeness: Was Nothing Important Missed?
Completeness checks whether the output captures all the important points from the source. It's the flip side of faithfulness — where faithfulness catches fabrications, completeness catches omissions. RAGAS reports r=0.72 Pearson correlation between LLM-judged completeness and human context recall scores (Raj et al., 2023).
A complete voice journal summary should capture the key events, emotions, and takeaways mentioned in the transcript — not just the first few things that came to mind for the AI.
Completeness vs. Length
Completeness isn't about length — it's about capturing meaning. A good summary of a short transcript might be 3 sentences. An incomplete summary of the same transcript might be 10 sentences that miss the main point entirely.
Example failure: A summary that captures the first two topics of a meeting but completely omits the final decision that was made — even though it was discussed for 20 minutes.
4. Hallucination: How Much Is Made Up?
Hallucination measures the proportion of your output that is fabricated or unsupported. Unlike faithfulness (which evaluates individual claims), hallucination looks at the overall proportion of the content that's unreliable.
The formula is inverted: 0.0 = no hallucinations, 1.0 = entirely fabricated. This inversion makes it intuitive in the overall score calculation.
Why Inverted?
Inverting hallucination (using 1 - hallucination) means high hallucination = low score contribution. This aligns with the intuition that fabricated content is worse than missing content — a summary that says nothing is safer than one that says wrong things.
Key insight: Hallucination and faithfulness are related but not identical. You can have high faithfulness (every claim is grounded) but low completeness (you missed important claims). You can have low hallucination overall but still fail on actionability.
The Overall Score Formula
The four metrics combine using these weights:
overall = (faithfulness × 0.3) + (actionability × 0.3) + (completeness × 0.2) + ((1 − hallucination) × 0.2)
These weights are configurable based on your use case. For medical documentation, you might weight faithfulness higher. For customer service, actionability might matter more.
Metric Relationships
| Relationship | What It Means |
|---|---|
| Faithfulness ↔ Hallucination | Related but distinct — low faithfulness often predicts high hallucination |
| Completeness ↔ Faithfulness | Can conflict — adding detail increases both capture and fabrication risk |
| Actionability ↔ All | Independent — a complete, faithful output can still be useless |
What Good Scores Look Like
Based on RAGAS benchmark validation across 12,000+ output samples (Raj et al., 2023):
| Metric | Good | Acceptable | Needs Work |
|---|---|---|---|
| Faithfulness | > 0.8 | 0.6-0.8 | < 0.6 |
| Actionability | > 0.75 | 0.6-0.75 | < 0.6 |
| Completeness | > 0.8 | 0.6-0.8 | < 0.6 |
| Hallucination | < 0.2 | 0.2-0.4 | > 0.4 |
Get Started
Ready to measure your AI outputs with these four metrics? Create your first evaluation test to see how your prompts perform across all four dimensions. To learn how judges evaluate outputs, read our LLM-as-a-judge evaluation guide. For guidance on choosing the right rubric, see how to choose a scoring guide.
Try LLM Eval Suite
Download from the Mac App Store and start measuring your AI quality today.
Download on Mac App Store