macOS · Apple Intelligence

Apple Intelligence performance, finally visible.

Measure Foundation Model outputs with structured scoring. Run evals, compare variants, and know what actually improved.

Download on the App Store

$19.99 · One-time purchase

Evaluations

Run. Score. Iterate.

Execute evals against Foundation Models and get structured scores on every output.

LLM Eval Suite showing evaluation results with scoring breakdown

Scoring

Four dimensions of quality.

Every output is judged on four scores. Together they tell you what's working and what needs attention.

1.0

Faithfulness

Does the output stay true to the source? Deduct for invented events, claims, or interpretations. 1.0 = perfectly grounded.

1.0

Actionability

Is it useful when you need to act later? Reward clear next steps and commitments. 1.0 = immediately useful.

1.0

Completeness

Are the key points and context captured? 1.0 = nothing important left out.

0.0

Hallucination

How much is fabricated or unsupported? 0.0 = nothing, 1.0 = everything.

Leaderboard

Track what actually improved.

Every eval run is logged. Compare prompt variants and generation configs side-by-side. See which changes moved the needle and which ones didn't — no more guessing.

LLM Eval Suite leaderboard showing prompt variant rankings

Local evals

Foundation Models run on your Mac. No data leaves your device.

Bring your own judge

Use OpenAI, Claude, or MiniMax. Your keys, your account.

Structured scoring

Quality, usefulness, completeness, hallucination risk.

Repeatable results

Log every run. Track what changed. Know what improved.

How it works

01

Configure

Pick your dataset, choose a judge provider, set your scoring guide.

02

Run

Execute evals against Apple Intelligence Foundation Models on macOS.

03

Compare

Review scores, inspect individual outputs, and iterate on your prompts.

Use Cases

Real workflow, real results.

Voice transcript summarization

We used LLM Eval Suite to iterate on the AI summarization feature in AI Doctor Notes. By testing prompt variants and generation configs against Foundation Models, we found that a higher max tokens setting combined with Top K sampling produced a meaningful improvement in both completeness and hallucination control.

Read the full story

Any on-device AI feature

The same workflow works for any app using Apple Intelligence Foundation Models. Text generation, classification, extraction, summarization — if the output quality matters, you can evaluate it systematically. Define your metrics, run your evals, and make decisions backed by structured scores instead of gut feel.

Judge Providers

Bring your own API key.

Use any of these as your judge. Your keys stay on your machine.

OpenAI
Claude Claude
MiniMax MiniMax

Stop guessing. Start evaluating.

Download on the App Store.