LLM evaluation tools

A collection of LLM evaluation tools for testing models and improving prompts. Submit new tools here.

Why do evals matter?

Defining and measuring model performance can help you understand how to improve your model, prompts, and RAG flow.

By continuously testing and setting guardrail metrics, you can avoid the feeling of "trial and error" when developing, improving, and debugging your LLM app.

The goal: deploy changes with confidence that you're improving the overall system!