OptionalagentSet when the adapter has emitted its first agent audio chunk for the
current turn — gates timing-based barge-in. Concrete adapters expose
this so scenario.interrupt can wait for real speech before
firing the interruption. Optional: adapters without server-VAD-style
interrupt sequencing can leave it undefined.
ReadonlycapabilitiesDeclaration of what this adapter can and cannot do. Concrete subclasses MUST publish a non-default value; the base instance defaults to "nothing supported" so capability-gated steps fail safely when an adapter forgets to declare.
ReadonlyinstructionsMost recent finalized agent transcript (post audio_transcript.done).
Most recent user-side transcript from the Whisper input pipeline.
ReadonlymodelOptionalnameHard cap on a single agent turn's audio. Prevents runaway loops if a transport never signals end-of-stream. 30s = a long sentence.
Tail silence: once the first agent chunk arrives, keep draining receiveAudio until no chunk shows up within this many seconds — that's how we detect the agent finished talking.
Seconds to wait for agent audio after sending user audio.
OptionalstreamingIncremental transcript text emitted while the agent speaks. Populated
by adapters that advertise capabilities.streamingTranscripts. Read
by scenario.interrupt when afterWords: N is set.
ReadonlytoolsReadonlyvoiceSurface realtime tool calls alongside the spoken audio turn (#630).
The base call() (defaultVoiceCall) returns a single assistant audio
message and does all the recording bookkeeping. We keep that intact and,
when the agent called any tools this turn, append ONE extra role:"tool"
message carrying every call as AI-SDK tool-result parts — the shape
state.hasToolCall / state.lastToolCall consume (AC4).
Returns:
[audioMessage, toolMessage] when ≥1 tool was called (AC4/AC10).
convertAgentReturnTypesToMessages passes a list through verbatim into
the run's messages.Per-turn tool state is reset HERE (turn start) so tool calls never leak
across turns; the function-call events for THIS turn are consumed inside
super.call()'s drain and finalized onto _completedToolCalls.
Open the Realtime WebSocket and send the initial session.update.
Close the WebSocket if open.
Send response.cancel — the OpenAI Realtime API's first-class
interrupt. The model stops generating audio and text immediately. No
timing race against VAD: deterministic stop, then the next user turn
flows normally through sendAudio + receiveAudio.
Whether the Realtime WebSocket is open (Gap #11).
Commit any pending audio, request a response, and return the first audio chunk the model produces.
Loops over incoming events until a response.output_audio.delta
event arrives, then returns decoded PCM16. Transcript events update
lastUserTranscript / lastAgentTranscript. An error event throws.
GA event names are response.output_audio[_transcript].{delta,done}
(the Beta response.audio[_transcript].* names are dead). We accept
both so back-port to a Beta endpoint stays trivial; production hits
the GA path.
Append a PCM16 audio chunk to the model's input audio buffer.
Only emits input_audio_buffer.append — commit + response are deferred
to the next receiveAudio call. The executor may call sendAudio many
times for a single user turn (TTS streams audio as chunks); committing
per-chunk would confuse the server with sub-second turn boundaries.
Transmit DTMF tones to the telephony peer. Adapters that advertise
capabilities.dtmf MUST implement this; the default raises
UnsupportedCapabilityError so an adapter that forgot to ship
sendDtmf while claiming the capability fails loudly instead of
silently routing through a PCM fallback.
Inject scripted text into the realtime session as a user message.
Used when this adapter is the user simulator (role=USER): scripted
user("text") steps route through here instead of spawning TTS. The
model synthesizes the text into spoken audio with natural prosody,
which is then delivered via receiveAudio.
Per §7.2, OpenAI Realtime cannot populate assistant audio messages retroactively; the downstream transcript reflects what the model actually emitted, not what was scripted.
Hide the API key when this object lands in error messages or logs.
Exercise OpenAI's Realtime API as either the agent under test (
role=AGENT, default) or as the voice-enabled user simulator (role=USER, per §7.2 L1164-1171).When
role=USER, scripteduser("text")steps route text through the realtime session's text-input channel rather than triggering TTS.Transcript observability:
lastUserTranscript— set fromconversation.item.input_audio_transcription.completedlastAgentTranscript— accumulated fromresponse.audio_transcript.delta/ reset on done