Building a Restaurant Voice Agent with GPT-Realtime-1.5: A Practical Tutorial

OpenAI's developer team just published a walkthrough of building a production voice agent for restaurant ordering using gpt-realtime-1.5 — and the architecture choices reveal where real-time voice AI actually is in 2026. The system handles menu queries, order customization, dietary restrictions, and payment handoff over a single persistent WebSocket connection. If you've been waiting for a concrete example of the Realtime API beyond demos, this is it. Here's what the implementation looks like, what works, and where the sharp edges are.

What Happened

OpenAI's developer relations team shared a detailed breakdown of a restaurant voice agent built on gpt-realtime-1.5, the latest iteration of their real-time speech-to-speech model. Unlike the traditional pipeline of speech-to-text → LLM → text-to-speech, gpt-realtime-1.5 processes audio natively — the model hears the customer and speaks back without intermediate text transcription.

The restaurant agent handles the full ordering flow: greeting, menu navigation, item customization (size, toppings, modifications), allergy and dietary restriction checking, order confirmation, and handoff to a payment system. The demo uses a pizza restaurant as the reference case, but the architecture generalizes to any menu-driven ordering scenario.

Key specs from the implementation: the system maintains sub-300ms response latency for most exchanges, handles interruptions naturally (a customer can change their mind mid-sentence), and maintains conversation state across a multi-turn ordering session that can span 5-10 minutes. The model runs on OpenAI's Realtime API, which uses WebSocket connections rather than REST calls — the connection stays open for the entire conversation.

This builds on the Realtime API that launched in late 2024, but gpt-realtime-1.5 brings significantly improved function calling reliability and better handling of noisy audio environments — both critical for restaurant use cases where background noise is constant.

Why It Matters

Voice ordering isn't new — IVR systems have existed for decades. What's new is a voice agent that actually understands context, handles edge cases gracefully, and doesn't make customers want to throw their phone.

The economics are compelling. A single restaurant location might spend $3,000-5,000/month on phone order staffing during peak hours. The Realtime API's pricing — roughly $0.06 per minute of conversation at current rates — means a busy restaurant handling 200 phone orders per day at 3 minutes each would spend about $36/day on API costs. That's a 90%+ cost reduction before factoring in 24/7 availability.

But the bigger signal is architectural. The speech-to-speech approach eliminates an entire class of errors that plague transcription-based pipelines. Accents, background noise, and mumbled words that would torpedo a Whisper → GPT → TTS pipeline get handled more gracefully when the model processes audio natively. For restaurant ordering specifically — where proper nouns (menu item names), numbers (quantities, table numbers), and modifications ("no onions, extra cheese, light sauce") are the entire conversation — this accuracy improvement is the difference between a usable product and an expensive frustration machine.

The competitive landscape is shifting too. Google's Gemini Live and Anthropic's voice capabilities are advancing, but OpenAI currently has the most mature real-time voice API. Startups building voice agents should note: this reference implementation essentially provides a production blueprint that previously required months of custom engineering.

Technical Deep-Dive

The architecture has three core components: the WebSocket connection layer, the function calling system, and the state management logic.

WebSocket Connection: The client establishes a persistent WebSocket to wss://api.openai.com/v1/realtime. Audio streams bidirectionally — the client sends microphone input as raw PCM audio frames, and the server streams back generated speech. The connection handles turn detection automatically; the model identifies when the customer stops speaking and begins its response.

Function Calling: This is where the real engineering lives. The agent defines tools for:

const tools = [
  { name: "get_menu", description: "Retrieve current menu items and prices" },
  { name: "check_availability", description: "Verify item is currently available" },
  { name: "add_to_order", parameters: { item: "string", modifications: "string[]", quantity: "number" } },
  { name: "remove_from_order", parameters: { item_id: "string" } },
  { name: "get_order_summary", description: "Read back current order" },
  { name: "submit_order", description: "Finalize and send to kitchen" }
];

Each function call happens mid-conversation. When a customer says "Add a large pepperoni with extra cheese," the model calls add_to_order with the parsed parameters, gets a confirmation response, and speaks the confirmation — all within the same audio stream. gpt-realtime-1.5's improved function calling means this works reliably about 95% of the time, up from roughly 80% with the previous model version.

State Management: The session object tracks the evolving order, customer preferences mentioned earlier in the conversation ("I'm vegetarian"), and conversation phase (browsing, ordering, confirming, paying). This state persists in the system prompt that gets updated with each function call result.

One critical limitation: the Realtime API doesn't support persistent memory across calls. If a regular customer calls back tomorrow, the system doesn't remember their usual order. That requires an external customer database and a lookup function — achievable, but additional engineering work.

Latency optimization matters. The implementation uses input_audio_transcription for logging but relies on the native audio path for responses. Forcing text intermediate steps adds 200-400ms of latency — unacceptable for natural conversation.

What You Should Do

Start with the OpenAI Realtime API playground before writing code. Test your menu-specific terminology and see how well the model handles your domain's vocabulary.
Design your function schema carefully. The quality of your tool definitions directly determines how reliably the agent parses customer requests. Be explicit about parameter types and descriptions.
Build a fallback path. When the voice agent can't understand a request after two attempts, hand off to a human. A graceful escalation preserves the customer experience.
Test with real background noise. Record audio in your actual restaurant environment and use it during development. The gap between quiet-room testing and real-world performance is significant.
Monitor function call accuracy. Log every tool invocation and its parameters. A 95% accuracy rate means 1 in 20 orders has an error — track and improve this metric relentlessly.

Related: Today's newsletter covers the broader AI development landscape this week. See also: Voice Agent for background on real-time voice AI architectures.

Found this useful? Subscribe to AI News for daily AI briefings.