Skip to content

Judge Agent

Overview

The Judge Agent is an LLM-powered evaluator that automatically determines whether your agent under test meets defined success criteria. Instead of writing complex assertion logic, you describe what success looks like in natural language, and the judge evaluates each conversation turn to decide whether to continue, succeed, or fail the test.

After each agent response, the judge:

  1. Reviews the entire conversation history
  2. Evaluates against your defined criteria
  3. Decides whether to continue, succeed, or fail

Use Case Example

Let's test a customer support agent handling billing inquiries:

python
import pytest
import scenario
 
@pytest.mark.asyncio
async def test_billing_inquiry_quality():
    result = await scenario.run(
        name="billing inquiry handling",
        description="""
            User received an unexpected charge on their credit card and is
            concerned but polite. They have their account information ready.
        """,
        agents=[
            CustomerSupportAgent(),
            scenario.UserSimulatorAgent(),
            scenario.JudgeAgent(criteria=[
                "Agent asks for account information to investigate",
                "Agent explains the charge clearly",
                "Agent offers a solution or next steps",
                "Agent maintains a helpful and empathetic tone",
                "Agent should not make promises about refunds without verification"
            ])
        ],
        max_turns=8
    )
    
    assert result.success

Configuration Reference

ParameterTypeRequiredDefaultDescription
criteriaList[str]No[]Success criteria to evaluate. Include positive requirements and negative constraints.
modelstrNoGlobal configLLM model identifier (e.g., "openai/gpt-4o").
temperaturefloatNo0.0Sampling temperature (0.0-1.0). Use 0.0-0.2 for consistent evaluation.
max_tokensintNoModel defaultMaximum tokens for judge reasoning and explanations.
system_promptstrNoBuilt-inCustom system prompt to override default judge behavior.
api_basestrNoGlobal configBase URL for custom API endpoints.
api_keystrNoEnvironmentAPI key for the model provider.
**extra_paramsdictNo{}Additional LiteLLM parameters (headers, timeout, client).

Writing Effective Criteria

Good criteria are specific, measurable, relevant, and actionable:

python
# Good - specific and measurable
scenario.JudgeAgent(criteria=[
    "Agent asks for the user's order number",
    "Agent provides a tracking link",
    "Agent offers to help with anything else",
    "Agent should not promise delivery dates without checking the system"
])
 
# Avoid vague criteria
scenario.JudgeAgent(criteria=[
    "Agent is helpful",           # Too vague
    "Agent does everything right", # Not measurable
])

Next Steps