A collection of LLM evaluation tools for testing models and improving prompts. Submit new tools here.
Defining and measuring model performance can help you understand how to improve your model, prompts, and RAG flow.
By continuously testing and setting guardrail metrics, you can avoid the feeling of "trial and error" when developing, improving, and debugging your LLM app.
The goal: deploy changes with confidence that you're improving the overall system!