Testing an OpenAI Realtime Voice Model with Scenario
Audience: a developer whose "agent" is an OpenAI Realtime model with instructions and tools. There's no separate app layer — the model is the agent. They want regression tests.
How this differs from the hosted-provider path: ElevenLabs (and platforms like it) hosts the STT→LLM→TTS loop; you test the platform's agent. With OpenAI Realtime, the model itself is the agent — no hosting layer, the session is the product. Scenario treats both the same way via VoiceAgentAdapter, but the configuration surface differs.
Prerequisites
- Python 3.11+
OPENAI_API_KEYwithmodel.requestscope (for chat completions + Realtime + TTS + STT)
All-in-one key: unlike the ElevenLabs path, everything here runs on OpenAI: the Realtime model itself, the user simulator TTS, the judge LLM, and the default STT. One key, one provider.
Step 1 — Install
pip install scenarioNo extras flag. Voice is first-class.
Step 2 — Set env var
OPENAI_API_KEY=sk-...The key must have access to the Realtime API (most standard keys do).
Step 3 — Write a scenario
Create test_realtime_agent.py:
import pytest
import scenario
from scenario.voice import OpenAIRealtimeAgentAdapter
@pytest.mark.asyncio
async def test_realtime_handles_cancellation():
result = await scenario.run(
name="cancellation_flow",
description="Realtime agent must handle a subscription cancel without friction.",
agents=[
# The model IS the agent — instructions + voice + tools live here.
OpenAIRealtimeAgentAdapter(
model="gpt-4o-realtime-preview",
voice="alloy",
instructions=(
"You are a customer support agent. Help users cancel "
"subscriptions. If they change their mind, acknowledge "
"and move on — no upsells."
),
),
scenario.UserSimulatorAgent(voice="openai/nova"),
scenario.JudgeAgent(
criteria=[
"The agent processed the cancellation request",
"The agent did not attempt to upsell or retain the customer",
]
),
],
script=[
scenario.user("I want to cancel my subscription"),
scenario.agent(),
scenario.user("No, I'm sure"),
scenario.agent(),
scenario.judge(),
],
)
assert result.success, result.reasoningThe key difference from the ElevenLabs path: OpenAIRealtimeAgentAdapter takes instructions directly. You're configuring the agent at test time, not pointing at a deployed agent. This lets you A/B test instruction variations in the same suite.
Step 4 — Run it
pytest test_realtime_agent.py -vSame as any other voice scenario. Pass/fail + verdict + audio.
Step 5 — Tools
If your agent uses OpenAI Realtime tool calling, pass the tool schemas:
OpenAIRealtimeAgentAdapter(
model="gpt-4o-realtime-preview",
voice="alloy",
instructions="...",
tools=[
{
"type": "function",
"name": "cancel_subscription",
"description": "Cancel the user's subscription immediately.",
"parameters": {
"type": "object",
"properties": {
"user_id": {"type": "string"},
"reason": {"type": "string"},
},
"required": ["user_id"],
},
},
],
)Tool calls flow through as events on result.timeline — use a callable script step to assert the right tool fired with the right args:
def assert_cancellation_called(state):
calls = [e for e in state.timeline if e.type == "tool_call"]
assert any(
c.tool_name == "cancel_subscription" for c in calls
), "Expected cancel_subscription to fire"
script = [
scenario.user("Cancel my subscription"),
scenario.agent(),
assert_cancellation_called,
scenario.judge(),
]Step 6 — Simulate natural-prosody users
Normally scenario.user("text") runs the text through the OpenAI TTS (voice nova by default). For Realtime-specific tests you may want the user simulator itself to be a Realtime model with natural prosody, emotion, pacing:
from scenario.types import AgentRole
agents = [
OpenAIRealtimeAgentAdapter(
model="gpt-4o-realtime-preview",
voice="alloy",
instructions="You are a helpful support agent.",
role=AgentRole.AGENT, # the agent under test
),
OpenAIRealtimeAgentAdapter(
model="gpt-4o-realtime-preview",
voice="nova",
instructions="You are a confused elderly customer. Speak slowly.",
role=AgentRole.USER, # the user simulator
),
scenario.JudgeAgent(criteria=["The agent was patient and clear"]),
]When a scripted scenario.user("text") turn fires with a role=USER Realtime adapter present, the text is routed through that adapter's send_text (not the default TTS) — the Realtime model natively synthesises speech with the persona it's been given.
Step 7 — Add to CI
- run: pip install scenario
- run: pytest test_realtime_agent.py
env:
OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}One secret. One provider. Cheapest voice-regression CI you can set up.
Pattern: branched scenarios for A/B'ing instructions
Instruction-level changes to a Realtime agent ARE the behavior change. Testing them side-by-side is the point:
@pytest.mark.parametrize("instructions,expected_to_pass", [
("You are a support agent. Be friendly and solve problems.", True),
("You are a support agent. Prioritize upsells.", False), # should fail
])
async def test_instruction_variants(instructions, expected_to_pass):
result = await scenario.run(
agents=[
OpenAIRealtimeAgentAdapter(
model="gpt-4o-realtime-preview",
voice="alloy",
instructions=instructions,
),
scenario.UserSimulatorAgent(voice="openai/nova"),
scenario.JudgeAgent(
criteria=["The agent did not attempt to upsell"]
),
],
script=[scenario.user("Just processing my cancellation"),
scenario.agent(),
scenario.judge()],
)
assert result.success == expected_to_passThis is the thing you can't do with a hosted agent — you're configuring the agent at test time.
Next
- ElevenLabs (hosted provider) happy path: /voice/happy-path-elevenlabs
- Feature contract:
specs/voice-agents.feature - Capability matrix per adapter: /voice/capability-matrix
