Open-source LLM testing and evaluation framework for systematically testing prompts, models, and AI agent behaviors with automated red-teaming.
Test your AI prompts systematically — run hundreds of test cases to find the best prompt before going live.
Promptfoo is an open-source testing and evaluation framework designed to help developers systematically test LLM applications, prompts, and AI agent behaviors. It provides a CLI-driven workflow for defining test cases, running evaluations across multiple models and prompt variants, and comparing results with automated scoring — essential for building reliable AI agents that behave predictably in production.
The framework supports a wide range of assertion types including exact matching, semantic similarity, model-graded evaluations, and custom JavaScript/Python assertions. Developers can test across multiple LLM providers simultaneously, comparing how different models handle the same prompts and scenarios. This is particularly valuable for agent development where choosing the right model for each task is critical.
Promptfoo's automated red-teaming capability is a standout feature for agent security. It can automatically generate adversarial inputs to test agent robustness against prompt injection, jailbreaking, data exfiltration, and other attack vectors. This helps developers identify and fix agent vulnerabilities before deployment.
The framework integrates with CI/CD pipelines, enabling automated testing of agent behaviors on every code change. Results are displayed in an interactive web UI that makes it easy to compare outputs, identify regressions, and track improvements over time. Promptfoo supports all major LLM providers including OpenAI, Anthropic, Google, AWS Bedrock, and local models via Ollama. With its focus on practical testing workflows, Promptfoo has become the most popular open-source tool for LLM evaluation.
Was this helpful?
Test the same prompts across multiple LLM providers and models simultaneously, comparing outputs side-by-side to find the best model for each agent task.
Use Case:
Generate adversarial inputs automatically to test agent robustness against prompt injection, jailbreaking, PII leakage, and other security vulnerabilities.
Use Case:
Use exact matching, regex, semantic similarity, model-graded evaluation, cost thresholds, and custom JavaScript/Python assertions for comprehensive testing.
Use Case:
Run evaluations in GitHub Actions, GitLab CI, and other pipelines with pass/fail thresholds to catch agent regressions before they reach production.
Use Case:
Web-based interface for exploring test results, comparing outputs, drilling into failures, and tracking evaluation metrics over time.
Use Case:
Supports OpenAI, Anthropic, Google, AWS Bedrock, Azure, Ollama, and any OpenAI-compatible API for comprehensive cross-provider testing.
Use Case:
Free
forever
Free
month
Ready to get started with Promptfoo?
View Pricing Options →Pre-deployment testing of AI agent behaviors
Security red-teaming for agent vulnerability discovery
Model selection through comparative evaluation
Continuous regression testing in CI/CD pipelines
We believe in transparent reviews. Here's what Promptfoo doesn't handle well:
Promptfoo focuses on systematic testing and evaluation with assertions and red-teaming, while LangSmith focuses on tracing and observability. They're complementary — use Promptfoo for pre-deployment testing and LangSmith for production monitoring.
Yes. You can test whether agents call the right tools with correct parameters by asserting on function call outputs and tool selection patterns.
Yes. Promptfoo generates adversarial inputs that work against any LLM provider. It uses a separate model to generate attacks and evaluates target model responses.
Yes. Promptfoo provides a CLI that exits with appropriate status codes based on pass/fail thresholds, making it easy to integrate into any CI/CD pipeline.
Weekly insights on the latest AI tools, features, and trends delivered to your inbox.
People who use this tool also find these helpful
Comprehensive testing and evaluation framework for AI agent performance and reliability.
Open-source LLM application development platform for prompt engineering, evaluation, and deployment with a collaborative UI.
Comprehensive AI agent testing and evaluation platform with automated test generation and behavior validation.
AI-powered visual testing platform that uses Visual AI to automatically detect visual bugs and regressions across web and mobile applications.
Open-source LLM evaluation framework for testing AI agents with 14+ metrics including hallucination detection, tool use correctness, and conversational quality.
Open-source LLM evaluation and testing platform by Comet for tracing, scoring, and benchmarking AI applications.
See how Promptfoo compares to Braintrust and other alternatives
View Full Comparison →Analytics & Monitoring
LLM evaluation and regression testing platform.
Analytics & Monitoring
Tracing, evaluation, and observability for LLM apps and agents.
Analytics & Monitoring
LLMOps platform for prompt engineering, evaluation, and optimization with collaborative workflows for AI product development teams.
Testing & Quality
Open-source LLM evaluation framework for testing AI agents with 14+ metrics including hallucination detection, tool use correctness, and conversational quality.
No reviews yet. Be the first to share your experience!
Get started with Promptfoo and see if it's the right fit for your needs.
Get Started →Take our 60-second quiz to get personalized tool recommendations
Find Your Perfect AI Stack →Explore 20 ready-to-deploy AI agent templates for sales, support, dev, research, and operations.
Browse Agent Templates →