LLM Eval Suite Logo LLM Eval Suite

Guide

How to use LLM Eval Suite

Everything you need to know to run structured evaluations on Apple Intelligence Foundation Models — from setup to interpreting results.

Requirements

LLM Eval Suite evaluates Apple Intelligence Foundation Models, which require Apple Intelligence to be available on your device.

macOS Requirements

  • macOS 26.0 or later — Required for Apple Foundation Models access
  • Apple Intelligence enabled — Must be signed up in System Settings > Apple Intelligence
  • Mac with Apple Silicon — M-series chip required for on-device Foundation Models

Getting Started

LLM Eval Suite is a macOS app that runs structured evaluations on Apple Intelligence Foundation Models. On first launch, the app seeds two example tests — voice journaling and meeting notes — so you can see how evaluation works immediately.

The core workflow involves three concepts:

  • Datasets — Collections of input samples (transcripts) with expected outputs or entities to check
  • Prompt Variants — Different prompt templates and generation configurations you want to compare
  • Tests — Combine a dataset + variants + judge configuration for repeatable evaluation runs

Configure your API keys in the app's Settings view. You'll need a key from at least one supported provider to use cloud-based judge services.

How Evaluation Works

When you run an evaluation, LLM Eval Suite performs these steps for each sample in your dataset:

  1. Merge the prompt template with the sample's input text using the {input} placeholder
  2. Send the rendered prompt to your chosen generation provider
  3. Send both the input and generated output to the judge provider
  4. Parse the judge's JSON response for four metric scores
  5. Calculate the overall score and store results

The Four Scoring Metrics

Faithfulness (30% weight)

Does the output stay true to the source input? Deduct for invented events, claims, or interpretations.

Actionability (30% weight)

Is it useful when you need to act later? Reward clear next steps and commitments.

Completeness (20% weight)

Are the key points and context captured?

Hallucination (20% weight, inverted)

How much is fabricated or unsupported? 0.0 = nothing, 1.0 = everything.

Overall Score Formula:
(faithfulness × 0.3) + (actionability × 0.3) + (completeness × 0.2) + ((1 − hallucination) × 0.2)

Judge Providers

The judge evaluates your outputs. You can choose from three cloud providers or use on-device Apple Foundation Models for generation only.

MiniMax (Recommended)

API: https://api.minimax.io/anthropic

Default model: MiniMax-M2.7

Anthropic (Claude)

API: https://api.anthropic.com

Default model: claude-sonnet-4-20250514

OpenAI (GPT)

API: https://api.openai.com/v1

Default model: gpt-5-mini

Apple Foundation Models (Generation only)

On-device SystemLanguageModel.default

Requires macOS 26.0+. No API key needed, but only available for output generation, not judging.

Cloud Provider Requirements

Not all features are available with all providers. This table shows which features require an API key.

Provider Judging Generation API Key Required
MiniMax Yes Yes Yes
Anthropic Yes Yes Yes
OpenAI Yes Yes Yes
Apple Foundation Models No Yes (on-device) No

Important: If you want to evaluate using Apple Foundation Models for output generation and MiniMax as the judge, you only need a MiniMax API key. Configure your keys in the app's Settings view.

For a deeper look at how LLM-as-a-judge evaluation works, read our LLM-as-a-judge framework guide.

Creating Your First Test

  1. Select a dataset — Choose from bundled datasets (voice journaling, meeting notes) or import your own JSON dataset.
  2. Choose or create prompt variants — Each variant has a prompt template and generation configuration (provider, model, temperature, max tokens).
  3. Configure the judge — Select your judge provider and scoring guide. The scoring guide determines how strictly the judge evaluates outputs.
  4. Run the evaluation — The app processes each sample, shows real-time progress, and stores results when complete.

Tests are designed to be repeatable. Once created, you can run the same test multiple times to track performance over iterations.

See it in action: Read our step-by-step tutorial on creating your first test, based on how we improved AI Doctor Notes with systematic evaluation.

Prompt Variants & Templates

Prompt variants define how inputs are transformed into outputs. Each variant contains:

  • Prompt template — A string with an {input} placeholder that gets replaced with your sample's input text
  • Generation provider — Which LLM generates outputs (MiniMax, Anthropic, OpenAI, or Apple Foundation Models)
  • Generation config — Temperature, max tokens, sampling mode (greedy, random top-k, or probability threshold)

Example template

Summarize this voice journal entry in 2-3 sentences:
{input}
Keep the summary factual and capture the key emotions.

Dataset Management

Datasets are collections of samples used for evaluation. Each sample (GoldSample) contains:

  • inputText — The source content (e.g., a voice transcript)
  • idealOutput — Optional expected output for reference
  • expectedEntities — Named entities the output should mention
  • metadata — Quality classification (length, word count, clarity, speakers, complexity, specialty)

Import datasets by loading a JSON file. The app includes two bundled datasets:

  • Voice Journaling — Voice journal transcripts with varying clarity and speakers
  • Meeting Notes — Meeting transcripts with multiple speakers and technical content

Scoring Guides

Scoring guides define the rubric the judge uses to evaluate outputs. They contain domain-specific instructions that shape how strictly each metric is scored.

voice-transcript-standard

For voice journaling and transcript summarization tasks

meeting-notes-standard

For meeting notes and action item extraction

minimal

Generic fallback scoring guide

Strictness levels (lenient, balanced, strict) further adjust the judge's scoring boundaries. Choose based on how much latitude you want in scores. Choose the right scoring guide for your use case.

When building your evaluation dataset, follow our guide to building a gold dataset with representative samples and proper metadata.

The four scoring dimensions — faithfulness, actionability, completeness, and hallucination — are designed to catch the most common AI output failure modes.

Understanding Results

After an evaluation completes, results are stored and aggregated in several ways:

  • Per-sample results — Each sample shows its four individual metric scores and overall score
  • Run summary — Average scores across all samples, plus progress and status
  • Leaderboard — Variants ranked by overall score, tracked across all runs
  • A/B comparison — Compare baseline vs champion variants to see what improved

Evaluation runs are queued. If you start a new run while one is active, it goes into the queue and runs automatically when the current one finishes. Interrupted runs are recovered on app restart.

Ready to evaluate?

Download LLM Eval Suite from the Mac App Store and start running structured evaluations today.

Download on Mac App Store