Guide
Everything you need to know to run structured evaluations on Apple Intelligence Foundation Models — from setup to interpreting results.
LLM Eval Suite evaluates Apple Intelligence Foundation Models, which require Apple Intelligence to be available on your device.
LLM Eval Suite is a macOS app that runs structured evaluations on Apple Intelligence Foundation Models. On first launch, the app seeds two example tests — voice journaling and meeting notes — so you can see how evaluation works immediately.
The core workflow involves three concepts:
Configure your API keys in the app's Settings view. You'll need a key from at least one supported provider to use cloud-based judge services.
When you run an evaluation, LLM Eval Suite performs these steps for each sample in your dataset:
{input} placeholderDoes the output stay true to the source input? Deduct for invented events, claims, or interpretations.
Is it useful when you need to act later? Reward clear next steps and commitments.
Are the key points and context captured?
How much is fabricated or unsupported? 0.0 = nothing, 1.0 = everything.
Overall Score Formula:
(faithfulness × 0.3) + (actionability × 0.3) + (completeness × 0.2) + ((1 − hallucination) × 0.2)
The judge evaluates your outputs. You can choose from three cloud providers or use on-device Apple Foundation Models for generation only.
API: https://api.minimax.io/anthropic
Default model: MiniMax-M2.7
API: https://api.anthropic.com
Default model: claude-sonnet-4-20250514
API: https://api.openai.com/v1
Default model: gpt-5-mini
On-device SystemLanguageModel.default
Requires macOS 26.0+. No API key needed, but only available for output generation, not judging.
Not all features are available with all providers. This table shows which features require an API key.
| Provider | Judging | Generation | API Key Required |
|---|---|---|---|
| MiniMax | Yes | Yes | Yes |
| Anthropic | Yes | Yes | Yes |
| OpenAI | Yes | Yes | Yes |
| Apple Foundation Models | No | Yes (on-device) | No |
Important: If you want to evaluate using Apple Foundation Models for output generation and MiniMax as the judge, you only need a MiniMax API key. Configure your keys in the app's Settings view.
For a deeper look at how LLM-as-a-judge evaluation works, read our LLM-as-a-judge framework guide.
Tests are designed to be repeatable. Once created, you can run the same test multiple times to track performance over iterations.
See it in action: Read our step-by-step tutorial on creating your first test, based on how we improved AI Doctor Notes with systematic evaluation.
Prompt variants define how inputs are transformed into outputs. Each variant contains:
{input} placeholder that gets replaced with your sample's input textExample template
Summarize this voice journal entry in 2-3 sentences:
{input}
Keep the summary factual and capture the key emotions. Datasets are collections of samples used for evaluation. Each sample (GoldSample) contains:
Import datasets by loading a JSON file. The app includes two bundled datasets:
Scoring guides define the rubric the judge uses to evaluate outputs. They contain domain-specific instructions that shape how strictly each metric is scored.
For voice journaling and transcript summarization tasks
For meeting notes and action item extraction
Generic fallback scoring guide
Strictness levels (lenient, balanced, strict) further adjust the judge's scoring boundaries. Choose based on how much latitude you want in scores. Choose the right scoring guide for your use case.
When building your evaluation dataset, follow our guide to building a gold dataset with representative samples and proper metadata.
The four scoring dimensions — faithfulness, actionability, completeness, and hallucination — are designed to catch the most common AI output failure modes.
After an evaluation completes, results are stored and aggregated in several ways:
Evaluation runs are queued. If you start a new run while one is active, it goes into the queue and runs automatically when the current one finishes. Interrupted runs are recovered on app restart.
Download LLM Eval Suite from the Mac App Store and start running structured evaluations today.
Download on Mac App Store