AI Evals & LLM Testing

Evals are how you turn "the demo looked great" into "we shipped this and it stays good as the model / prompt / data changes."

Eval frameworks (run-and-score)

★ Promptfoo — declarative YAML / JS evals; matrix of prompts × providers × test cases; CI-friendly. Default for prompt regression tests.
★ Inspect AI (UK AI Safety Institute) — Python framework for sophisticated agent evals; popular for safety / capability research; great for complex evaluators.
DeepEval — Python; pytest-shaped; rich built-in metrics.
OpenAI Evals — OpenAI's reference framework; YAML-driven.
@anthropic-ai/sdk-evals / Anthropic Console evals — provider-native.
Vercel AI SDK Evals — TS-first; integrates with the AI SDK.
Mastra Evals — bundled with Mastra agents.
Braintrust Eval framework — open-source primitive that pairs with Braintrust dashboards.

★ Langfuse — open source + hosted; tracing + evals + prompt management; generous free tier. The default OSS pick.
★ Braintrust — hosted; great DX for "compare prompt v1 vs. v2 on 200 examples"; free tier.
LangSmith — LangChain's hosted observability + evals; works with any framework via the SDK.
Helicone — proxy + dashboards + evals; OSS + hosted.
Arize Phoenix — open source; OpenTelemetry-native LLM observability.
Weights & Biases Weave — W&B's LLM-specific tracing.
Galileo — paid; production monitoring + evals.

String / regex match — for deterministic output (extracted JSON, classifier labels).
LLM-as-judge — another model grades the output; cheap, surprisingly effective; calibrate against humans periodically.
Embedding similarity — semantic similarity vs. reference; cosine over embeddings.
Pairwise preference (A vs. B) — humans or LLM-judge picks the better response; cheap to scale.
Functional / behavioral — does the agent successfully complete the task end-to-end?
Adversarial / red-team — prompt-injection test sets, jailbreak suites.

Langfuse Prompts, Braintrust Prompts, PromptLayer, Helicone Prompts — version, A/B test, and roll back prompts without redeploying code.
@vercel/ai-flags integration — feature-flag prompts.
Self-host: store prompts in Postgres / Git with a small wrapper.

promptfoo redteam — generates adversarial test cases.
Garak (NVIDIA) — LLM vulnerability scanner.
HarmBench, SafetyBench, AILuminate — public safety benchmarks.
TruthfulQA, MMLU, GPQA, SWE-bench, TAU-bench — public capability benchmarks.

Build the eval set with real production traffic — sample logs, label, curate. Synthetic test cases miss real distribution.
Run evals in CI — block bad prompt regressions like you'd block bad tests.
Human-in-the-loop calibration — LLM-judge accuracy needs periodic human spot-checks.
Cost discipline — evals can blow the budget; cache LLM-judge calls; sample down.
Track per-model metrics — keep separate baselines for Claude / GPT / Gemini so you can swap providers safely.