AI Evals & LLM Testing
Measuring whether your prompts, agents, and RAG actually work.
Evals are how you turn "the demo looked great" into "we shipped this and it stays good as the model / prompt / data changes."
Eval frameworks (run-and-score)
- ★ Promptfoo — declarative YAML / JS evals; matrix of prompts × providers × test cases; CI-friendly. Default for prompt regression tests.
- ★ Inspect AI (UK AI Safety Institute) — Python framework for sophisticated agent evals; popular for safety / capability research; great for complex evaluators.
- DeepEval — Python; pytest-shaped; rich built-in metrics.
- OpenAI Evals — OpenAI's reference framework; YAML-driven.
@anthropic-ai/sdk-evals/ Anthropic Console evals — provider-native.- Vercel AI SDK Evals — TS-first; integrates with the AI SDK.
- Mastra Evals — bundled with Mastra agents.
- Braintrust Eval framework — open-source primitive that pairs with Braintrust dashboards.
Hosted / observability + evals
- ★ Langfuse — open source + hosted; tracing + evals + prompt management; generous free tier. The default OSS pick.
- ★ Braintrust — hosted; great DX for "compare prompt v1 vs. v2 on 200 examples"; free tier.
- LangSmith — LangChain's hosted observability + evals; works with any framework via the SDK.
- Helicone — proxy + dashboards + evals; OSS + hosted.
- Arize Phoenix — open source; OpenTelemetry-native LLM observability.
- Weights & Biases Weave — W&B's LLM-specific tracing.
- Galileo — paid; production monitoring + evals.
Specific eval techniques
- String / regex match — for deterministic output (extracted JSON, classifier labels).
- LLM-as-judge — another model grades the output; cheap, surprisingly effective; calibrate against humans periodically.
- Embedding similarity — semantic similarity vs. reference; cosine over embeddings.
- Pairwise preference (A vs. B) — humans or LLM-judge picks the better response; cheap to scale.
- Functional / behavioral — does the agent successfully complete the task end-to-end?
- Adversarial / red-team —
prompt-injectiontest sets, jailbreak suites.
Prompt management
- Langfuse Prompts, Braintrust Prompts, PromptLayer, Helicone Prompts — version, A/B test, and roll back prompts without redeploying code.
@vercel/ai-flagsintegration — feature-flag prompts.- Self-host: store prompts in Postgres / Git with a small wrapper.
Adversarial / safety eval suites
- promptfoo redteam — generates adversarial test cases.
- Garak (NVIDIA) — LLM vulnerability scanner.
- HarmBench, SafetyBench, AILuminate — public safety benchmarks.
- TruthfulQA, MMLU, GPQA, SWE-bench, TAU-bench — public capability benchmarks.
Patterns to know
- Build the eval set with real production traffic — sample logs, label, curate. Synthetic test cases miss real distribution.
- Run evals in CI — block bad prompt regressions like you'd block bad tests.
- Human-in-the-loop calibration — LLM-judge accuracy needs periodic human spot-checks.
- Cost discipline — evals can blow the budget; cache LLM-judge calls; sample down.
- Track per-model metrics — keep separate baselines for Claude / GPT / Gemini so you can swap providers safely.
Pick this if…
- Default prompt regression in CI: Promptfoo.
- Hosted dashboards + tracing + evals, OSS: Langfuse.
- Hosted, best DX, will pay: Braintrust.
- Agent / safety research: Inspect AI.
- Already on the AI SDK / Mastra: their bundled eval frameworks.
- Just need OpenTelemetry traces of LLM calls: Arize Phoenix.