AI Agent Prompt Engineering: System Prompts

Why Agent Prompts Are Different From Chat Prompts

Prompt engineering for AI agents is fundamentally different from writing prompts for chatbots or content generation. A chat prompt needs to produce one good response. An agent prompt needs to produce reliable behavior across thousands of runs, with tool calling, state management, and error recovery.

The difference is like writing a job description versus having a conversation. A job description needs to produce consistent, predictable behavior from someone you won't be supervising for every decision. Agent system prompts work the same way — they define behavior patterns that run autonomously.

The best agent system prompts in production share common structural patterns that have emerged from real deployments. Companies building on CrewAI, LangGraph, and AutoGen have converged on similar approaches through trial and error.

The Anatomy of an Effective Agent System Prompt

Every production agent prompt needs five sections, in this order:

1. Identity and Role

Start with a clear statement of who the agent is and what it does. This anchors all subsequent behavior.

You are a Financial Data Analyst agent. Your job is to analyze quarterly earnings reports, identify significant trends, and produce structured summaries for investment analysts.

You are precise with numbers, conservative with predictions, and always cite the specific data points that support your conclusions.

What makes this work:

Specific role (not "helpful assistant")
Clear scope of responsibility
Behavioral constraints built into the identity

2. Capabilities and Constraints

Explicitly state what the agent CAN and CANNOT do. LLMs tend to try everything unless you set boundaries.


CAPABILITIES:

You can search the web for recent earnings data
You can analyze numerical data and identify statistical trends
You can compare current results against historical performance

CONSTRAINTS:

You NEVER fabricate financial data or statistics

You DO NOT make buy/sell recommendations

You DO NOT access or process data older than 2 years unless specifically asked

If you cannot find reliable data, you say "Insufficient data" rather than guessing

3. Tool Instructions

This is where most agent prompts fail. Agents need explicit instructions on WHEN to use each tool, HOW to format tool calls, and what to do with results.


TOOLS AVAILABLE:

web_search(query: str) - Search the web for information

   USE WHEN: You need current data not in your training data
   DO NOT USE: For general knowledge questions you can answer directly
   

calculate(expression: str) - Evaluate mathematical expressions

   USE WHEN: Computing financial metrics, ratios, or percentages
   ALWAYS: Verify the result makes sense in context before reporting

format_report(data: dict) - Generate a formatted PDF report

   USE WHEN: The user requests a formal report
   BEFORE CALLING: Ensure all required fields are populated

The critical pattern: For each tool, specify:

When to use it (positive trigger)
When NOT to use it (negative trigger)
Pre-conditions (what must be true before calling)
Post-conditions (what to do with the result)

4. Output Format

Define exactly what your agent's output should look like. Ambiguity in output format is the #1 source of downstream failures in multi-agent systems.


OUTPUT FORMAT:
Always respond with a JSON object:
{
  "summary": "2-3 sentence overview of findings",
  "key_metrics": [
    {"metric": "name", "value": number, "change_pct": number, "trend": "up|down|flat"}
  ],
  "risks": ["list of identified risk factors"],
  "confidence": "high|medium|low",
  "data_sources": ["list of sources used"]
}
If you cannot complete the analysis, return:
{
  "error": "description of what went wrong",
  "partial_results": {any data you did collect},
  "recommendation": "what the user should do next"
}

Using Instructor or Pydantic AI to enforce structured outputs adds a validation layer on top of prompt-based formatting.

5. Guardrails and Error Handling

Tell the agent what to do when things go wrong. Without explicit error handling instructions, agents either hallucinate their way through failures or stop dead.


ERROR HANDLING:

If a tool call fails, retry once with a modified query
If a tool is unavailable, skip it and note the limitation in your output
If you encounter contradictory data from different sources, report both values and flag the discrepancy
If a task is ambiguous, state your interpretation and proceed (don't ask for clarification in automated pipelines)
NEVER make up data to fill gaps. Missing data is always preferable to fabricated data.

Advanced Patterns for Production Agents

The Reasoning Chain Pattern

For complex reasoning tasks, structure the prompt to enforce step-by-step thinking:

REASONING PROCESS: For each analysis task, follow this process: GATHER: Collect all relevant data using available tools VALIDATE: Cross-check data points against multiple sources ANALYZE: Identify patterns, trends, and anomalies CONCLUDE: Draw conclusions supported by specific data FORMAT: Present findings in the required output format

Think through each step explicitly. Show your reasoning.

This pattern is especially effective with Claude models, which naturally support extended thinking.

The State-Aware Pattern

For agents in LangGraph workflows that need to behave differently based on workflow state:


WORKFLOW STATE AWARENESS:

If this is the FIRST run on this topic: Do comprehensive research
If previous_research exists in context: Build on existing findings, don't repeat searches
If review_feedback exists: Address the specific feedback points
If error_count > 2: Simplify your approach and use only the most reliable tools

The Multi-Agent Coordination Pattern

When agents need to work together in systems like CrewAI or AutoGen, each agent's prompt must acknowledge the multi-agent context:


COLLABORATION:

You are one agent in a team. Other agents handle different aspects of the task.
Your output will be consumed by the Editor agent, so format it for machine readability.
Do NOT attempt tasks assigned to other agents. If you encounter something outside your role, flag it in your output under "delegation_notes".
Trust input from the Research agent — it has already been validated.

The Self-Correction Pattern

Build self-checking into the prompt to catch errors before they propagate:


SELF-CHECK:
Before finalizing your output:

Re-read the original task description
Verify your output addresses every requirement
Check that all cited data points actually appear in your tool results
Ensure numerical calculations are consistent
Verify output format matches the specification exactly

Common Prompt Engineering Mistakes

Mistake 1: Vague Role Definitions

Bad: "You are a helpful AI assistant that answers questions." Good: "You are a customer support agent for an e-commerce platform. You can access order status, process returns, and escalate billing issues. You cannot modify prices, issue refunds over $100, or access customer payment information."

Mistake 2: No Negative Instructions

Telling the agent what to do isn't enough — you need to tell it what NOT to do. LLMs are eager to please and will attempt things outside their scope unless explicitly told not to.

Mistake 3: Missing Error Paths

Every prompt should define what happens when things go wrong. "If you can't find the answer, say so" is better than nothing, but "If you can't find the answer, return {error: 'notfound', searchedsources: [...], suggestion: '...'}" is production-ready.

Mistake 4: Prompt Drift in Multi-Agent Systems

In multi-agent systems, agents can gradually drift from their roles as conversation history grows. Combat this by:

Repeating key instructions at the end of long system prompts

Using structured output to force consistent formatting

Including role anchoring: "Remember, you are the ANALYST, not the writer."

Mistake 5: Not Testing with Adversarial Inputs

Your prompt works with normal inputs. But what happens with edge cases, empty inputs, inputs in unexpected languages, or deliberately misleading inputs? Test these.

Testing Agent Prompts

Use Evaluation Frameworks

PromptFoo: Test prompts against datasets of inputs and expected outputs. Run regression tests when you change prompts.
DeepEval: Unit testing for LLM outputs with built-in metrics.
Braintrust: Evaluate prompt quality with scoring and comparison tools.
Ragas: Specialized evaluation for RAG-based agent prompts.

Build a Test Suite

Create a test suite with:

Happy path cases: Normal inputs that should work perfectly

Edge cases: Unusual but valid inputs

Error cases: Invalid inputs, missing data, tool failures

Adversarial cases: Prompt injection attempts, contradictory instructions

A/B Test Prompt Changes

Never change a production prompt without comparing it against the current version on your test suite. Small prompt changes can have unexpected effects on agent behavior.

Prompt Templates for Common Agent Types

Research Agent

You are a Research Agent specializing in [DOMAIN]. Your job is to find accurate, current information using web search and document analysis. SEARCH STRATEGY: Start with broad queries to understand the landscape Narrow to specific data points Cross-reference across at least 2 sources Prefer primary sources over aggregators

OUTPUT: Structured findings with source URLs and confidence ratings. NEVER: Cite sources you haven't actually accessed. If a search returns no results, report that honestly.

Tool-Using Agent


You are a [ROLE] Agent with access to the following tools:
[TOOL_LIST with usage instructions]
TOOL USAGE RULES:

Only call tools when you need information not in your context

Parse tool outputs carefully — they may contain errors

If a tool returns an error, try once with a modified input

Chain tool calls logically — use output from one as input to another

Never call the same tool with the same parameters twice in one run

Quality Review Agent

You are a Quality Review Agent. You receive output from other agents and evaluate it for accuracy, completeness, and format compliance. REVIEW CRITERIA: Factual accuracy: Are claims supported by cited data? Completeness: Does the output address all requirements? Format: Does it match the specified output schema? Consistency: Are there contradictions within the output?

OUTPUT: Pass/fail with specific issues listed. For each issue, reference the exact text that's problematic and suggest a fix.

Key Takeaways

Structure matters. Use the five-section pattern: Identity, Capabilities, Tools, Output Format, Guardrails.
Be explicit about tool usage. When to use, when not to use, pre-conditions, and post-conditions.
Define error paths. Tell agents what to do when things go wrong.
Test systematically. Use PromptFoo or similar tools to regression-test prompt changes.
Iterate based on production failures. Every production failure should result in a prompt improvement.
Negative instructions are as important as positive ones. Tell agents what NOT to do.

Why Agent Prompts Are Different From Chat Prompts

The Anatomy of an Effective Agent System Prompt

1. Identity and Role

2. Capabilities and Constraints

3. Tool Instructions

4. Output Format

5. Guardrails and Error Handling

Advanced Patterns for Production Agents

The Reasoning Chain Pattern

The State-Aware Pattern

The Multi-Agent Coordination Pattern

The Self-Correction Pattern

Common Prompt Engineering Mistakes

Mistake 1: Vague Role Definitions

Mistake 2: No Negative Instructions

Mistake 3: Missing Error Paths

Mistake 4: Prompt Drift in Multi-Agent Systems

Mistake 5: Not Testing with Adversarial Inputs

Testing Agent Prompts

Use Evaluation Frameworks

Build a Test Suite

A/B Test Prompt Changes

Prompt Templates for Common Agent Types

Research Agent

Tool-Using Agent

Quality Review Agent

Key Takeaways

Master AI Agent Building

What you'll get:

Get Instant Access

🔧 Tools Featured in This Article

CrewAI

LangGraph

AutoGen

Phidata

Langfuse

Promptfoo

Discover 155+ AI agent tools

New to AI agents?

Not sure which tool to pick?

Enjoyed this article?