Skip to content

Voice Adapter Capability Matrix

Every voice adapter in Scenario declares its capabilities via a frozen AdapterCapabilities dataclass. Capability-gated script steps — such as interrupt(after_words=N) (needs streaming transcripts), dtmf() (needs telephony), or interrupt(content) over a native cancel signal (needs interruption=True) — check this record and either route correctly or raise UnsupportedCapabilityError when the underlying adapter cannot implement the requested behavior.

This page is the authoritative render of what each shipped adapter advertises. When UnsupportedCapabilityError or PendingTransportError point users here, this is the page they land on.

The table below is auto-generated from the capability declarations in source and kept in sync by a CI gate — if you change an adapter's capabilities, regenerate with:

cd python
uv run python scripts/gen_capability_matrix.py

Capabilities

Adapterstreaming_transcriptsnative_vaddtmfinterruptioninput_formatsoutput_formats
ComposableVoicepcm16/24000pcm16/24000
ElevenLabspcm16/24000pcm16/24000
GeminiLivepcm16/16000pcm16/24000
LiveKitpcm16/48000pcm16/48000
OpenAIRealtimepcm16/24000pcm16/24000
Pipecatpcm16/24000, mulaw/8000, opuspcm16/24000, mulaw/8000, opus
Twiliomulaw/8000mulaw/8000
Vapipcm16/16000pcm16/16000
WebRTCpcm16/24000pcm16/24000
WebSocketpcm16/24000pcm16/24000
Column key
ColumnMeaning
streaming_transcriptsAdapter emits incremental transcript events during a turn
native_vadAdapter has built-in voice activity detection
dtmfAdapter can detect and forward DTMF (keypad) tones
interruptionAdapter supports barge-in / user-initiated interruption
input_formatsAudio formats the adapter accepts from the user simulator
output_formatsAudio formats the adapter sends to the scenario harness

Internal audio format is always PCM16 @ 24 kHz mono (AudioChunk); each adapter converts at its send/recv boundary.

Wire transport and shipping status

The capabilities table above describes what each adapter supports. The table below describes how each adapter is wired and whether it is shipping or still stubbed behind PendingTransportError.

AdapterWire transportReal I/O?
ComposableVoiceAgentSTT + LLM + TTS pipeline (provider-defined)✅ shipping
ElevenLabsAgentAdapterWebSocket (ElevenLabs Convai)✅ shipping
GeminiLiveAgentAdapterWebSocket (Gemini Live)✅ shipping
LiveKitAgentAdapterWebRTC (LiveKit room)🚧 stub (PendingTransportError)
OpenAIRealtimeAgentAdapterWebSocket (OpenAI Realtime)✅ shipping
PipecatAgentAdapterWebSocket (Twilio Media Streams protocol)✅ shipping
TwilioAgentAdapterMedia Streams (WebSocket over Twilio)✅ shipping
VapiAgentAdapterREST (Vapi outbound)🚧 stub (PendingTransportError)
WebRTCAgentAdapterWebRTC (datachannel + audio track)🚧 stub (PendingTransportError)
WebSocketAgentAdapterWebSocket (bring-your-own protocol)✅ shipping

Adapters marked 🚧 raise PendingTransportError on connect() and are tracked as follow-up issues. Their capability declarations are final (they match the wire spec); only the transport glue code is pending.

Use case × provider — demos

The examples/voice/ directory has one demo per use case. Each picks a provider that supports the capability the demo proves; the cell shows where the same use case could also work with substitution.

Legend:

  • ✅ shipped — running demo lives at examples/voice/<file>.py for the listed provider, or via simple adapter substitution.
  • 🟡 supported, no demo — the capability works on the listed adapter but no demo file exists yet. Track in follow-up issues.
  • ❌ unsupported — the adapter's transport or capability flags do not allow this use case. Don't try.
  • ⏸ skipped — possible in principle but cost-prohibitive (real phone call, paid voice, etc.); covered manually rather than in CI.
Use caseDemoPipecat WSTwilioOpenAI RealtimeElevenLabsGemini Live
Basic greetingbasic_greeting.py🟡🟡🟡🟡
Interruption recoveryinterruption_recovery.py🟡🟡❌ until SDK wires interrupt❌ until SDK wires interrupt
Random interruptionsrandom_interruptions.py🟡🟡
DTMF IVR navigationdtmf_ivr.py❌ no DTMF❌ no DTMF❌ no DTMF❌ no DTMF
Pre-recorded audioprerecorded_audio.py🟡🟡🟡🟡
Tool call verificationtool_verification.py🟡🟡🟡🟡
Silence handlingsilence_handling.py🟡🟡🟡🟡
Long hold (15s wait)long_hold.py🟡🟡🟡🟡
Multi-intent in one turnmulti_intent.py🟡🟡🟡🟡
Background handoff (effects)background_handoff.py🟡🟡🟡🟡
Accent-misunderstanding loopaccent_loop.py🟡🟡🟡🟡
Angry customer + cafe noiseangry_customer.py🟡🟡🟡🟡
Emotional escalationemotional_escalation.py🟡🟡🟡🟡
Twilio inbound calltwilio_inbound.py⏸ real phone
Twilio outbound calltwilio_outbound.py⏸ real phone
ElevenLabs branded composableelevenlabs_branded.py
ElevenLabs hosted ConvAIelevenlabs_hosted.py
Gemini Live native audiogemini_live.py
OpenAI Realtime as agentopenai_realtime_agent.py
OpenAI Realtime as user simopenai_realtime_user.pyn/an/acurrently skip-guarded — no cross-adapter audio bridge yetn/an/a
Pipecat WebSocket happy pathpipecat_ws.py
Pipecat scenario harnesspipecat_scenario.py
Recording + playbackrecording_playback.py🟡🟡🟡🟡
STT provider swapstt_swap.py🟡🟡🟡🟡
Voice/text entrypoint parityvoice_text_parity.py🟡🟡🟡🟡
Observability hooks + latencyobservability.py🟡🟡🟡🟡

🟡 cells convert to ✅ by swapping the adapter in the demo's agents=[...] list. They're 🟡 not because the use case fails — it generally works — but because a verified, recorded, rendered demo doesn't yet exist for that combination. File issues for the gaps you care about.

Capability semantics

  • Streaming transcripts — the adapter emits incremental transcript tokens as the agent speaks. Required for scenario.interrupt(after_words=N). Without it, that step raises UnsupportedCapabilityError and points here.

  • Native VAD — the adapter emits user_start_speaking / user_stop_speaking events from its own voice-activity-detection pipeline. When False, the SDK falls back to webrtcvad-wheels on the incoming audio stream and emits a one-shot UserWarning ("Adapter X has no native VAD — using SDK-side webrtcvad, accuracy may differ").

  • DTMF — the adapter can transmit DTMF tones over a telephony transport. Required for scenario.dtmf("1234#"). Without it, that step raises UnsupportedCapabilityError.

  • Interruption (native cancel) — the adapter can send a transport-level cancel signal that stops the agent under test mid-utterance (Twilio Media Streams clear, OpenAI Realtime response.cancel, etc.). Required for first-class barge-in. Without it, scenario.interrupt(content) falls back to overlapping user audio with the agent's TTS and relying on the AUT's own VAD-based barge-in (less deterministic).

    Interrupts are inherently a duplex-channel capability: the SDK has to send a control frame while the agent is still streaming. HTTP/REST transports cannot support this. WebSocket and WebRTC adapters can.

    Two flavours exist in the wild:

    1. Client-initiated cancel — the SDK sends a control frame (response.cancel for OpenAI Realtime, clear for Twilio Media Streams / Pipecat-over-Twilio). Deterministic and explicit. The adapter publishes interruption=True and implements async def interrupt().
    2. Server-side VAD barge-in — the provider's own VAD listens to incoming user audio and cancels its current response when speech is detected (ElevenLabs ConvAI, Gemini Live). The client only needs to keep streaming user audio; there is no separate cancel frame and no interrupt() method. The adapter advertises interruption=False because we cannot send a cancel signal — the only knob is "send the next user chunk." Barge-in still works, but its timing is the server's call, not ours.
  • Input formats / Output formats — wire formats the adapter accepts / emits. The SDK converts internally.

Errors that reference this page

  • scenario.voice.capabilities.UnsupportedCapabilityError — raised when a script step requests a capability the adapter does not advertise (e.g., dtmf() on a non-telephony adapter, interrupt(after_words=N) on an adapter without streaming transcripts).
  • scenario.voice.adapters.PendingTransportError — raised by adapter stubs whose send_audio / recv_audio implementations have not landed yet. Points users here so they can pick an adapter with a real transport (today: Pipecat WS, Twilio, OpenAI Realtime, ElevenLabs, Gemini Live) or subclass and implement their own.

Checking capabilities programmatically

adapter = scenario.PipecatAgentAdapter(url="ws://localhost:8765/ws")
 
if adapter.capabilities.dtmf:
    script.append(scenario.dtmf("1#"))
 
if adapter.capabilities.streaming_transcripts:
    script.append(scenario.interrupt(after_words=3, content="Wait"))
else:
    # Event-driven barge-in works on every adapter; native cancel fires
    # iff capabilities.interruption=True.
    script.append(scenario.interrupt(content="Wait"))

Authoring a custom adapter

When subclassing VoiceAgentAdapter, re-declare capabilities with accurate flags. Inheriting a parent's AdapterCapabilities ClassVar and not re-auditing it will silently break capability-gated script steps. For instance, claiming streaming_transcripts=True when your transport only delivers completed transcripts will cause interrupt(after_words=N) to hang indefinitely because no partial-transcript events ever arrive. Claiming interruption=True without implementing async def interrupt() will make the executor call a method that doesn't exist.

Python
class MyCustomAdapter(scenario.VoiceAgentAdapter):
    capabilities = scenario.voice.AdapterCapabilities(
        streaming_transcripts=False,
        native_vad=False,
        dtmf=False,
        interruption=False,
        input_formats=["pcm16/24000"],
        output_formats=["pcm16/24000"],
    )
TypeScript
import { voice } from "@langwatch/scenario";
 
class MyCustomAdapter extends voice.VoiceAgentAdapter {
  readonly capabilities = new voice.AdapterCapabilities({
    streamingTranscripts: false,
    nativeVad: false,
    dtmf: false,
    interruption: false,
    inputFormats: ["pcm16/24000"],
    outputFormats: ["pcm16/24000"],
  });
 
  async connect() { /* ... */ }
  async disconnect() { /* ... */ }
  async sendAudio(_chunk: voice.AudioChunk) { /* ... */ }
  async receiveAudio(_timeout: number): Promise<voice.AudioChunk> {
    throw new Error("not implemented");
  }
  async call(_input: any): Promise<any> { /* ... */ }
}

Source of truth

Capability values live in each adapter's capabilities: ClassVar[AdapterCapabilities] declaration. The canonical source file is python/scenario/voice/capabilities.py.

The generator script that produces the auto-generated table above is python/scripts/gen_capability_matrix.py.

Deferred / follow-up items

  • Native interrupt for ElevenLabs. Investigated; the provider runs server-side VAD and has no client-initiated cancel frame in its public protocol. Setting interruption=True would be incorrect — interrupt() would have nothing to send. Barge-in works the moment the executor's next user audio chunk hits the wire; no SDK change required. EL emits a server→client interruption event when its VAD fires; surfacing that into the voice timeline is a separate enhancement. (Gemini Live also runs server-side VAD but additionally exposes Activity markers — the Gemini Live adapter uses those for explicit cancel, so it does publish interruption=True.)
  • Transport implementations for LiveKit, Vapi, WebRTC. Stubs raise PendingTransportError at send_audio / recv_audio. The capability declarations describe what they will support.
  • OpenAIRealtimeAgentAdapter(role=USER) cross-adapter audio bridging. When the OpenAI Realtime user simulator is paired with a different agent adapter (e.g. Pipecat), there's no bridge piping the user-side audio into the agent-side input. Demo openai_realtime_user.py skip-guards rather than crashing.
  • Use-case demos for non-default providers (the 🟡 cells above). File issues per (use case × provider) you want covered.