Building Multiplayer Voice-Based AI Apps
Svapnil Ankolkar | 2024-10-03
Earlier this year, Woodside Labs launched one of the first multiplayer conversational AI applications in the world. It currently has around 100,000 users. I’m here to explain a little bit about how we architected the first audio-first social app on Discord, and where we might take it from here.
Presidential AI initially launched as a joke between friends. Today it's one of the only audio-first AI consumer apps and profitably brings thousands of people together and makes them laugh at scale.
Presidential AI's growth over the last 24 months
We’ve since built novel techniques to scale conversational AI to handle multiplayer group conversations, and we're excited to share the techiniques we've built with you.
Core Product
The initial premise was simple: type a message and hear Donald Trump join your voice call and say it out loud. Presidential AI started as a small demo between friends at the start of the generative audio boom about a year ago – what started as a demo quickly went viral as Discord users found the app online and started using them in their group chats.
Diagram of the simple TTS architecture
Why did our product have such strong initial traction? Firstly, being able to deep-fake the presidents was funny, and people wanted to entertain themselves long-running voice channels they were in.
Since then, we’ve built the project from a simple TTS app to a cascading system of models to enable full conversational audio. This is similar to what one might find in most modern conversational AI audio startups.
Diagram of the one-person cascading voice AI architecture we built to handle conversations
Making The Cascading System Multiplayer
Cascading systems have worked well in one-to-one use cases, but the average Discord group call that uses Presidential AI are usually in group calls of three to six people. This meant that we had an unique open opportunity to make the app multiplayer, a first for conversational AI applications.
What we ended up building is a system to parse out audio streams per-user on a channel basis, and process different audio streams coming in so we could decide on which stream to act on.
We ended up building an additional layer on top of the traditional cascading model approach: a cascading model system with conversational history, with an added module that determines if it’s the AI’s turn to talk or not.
In this approach, we take a multiparty audio stream, parse it by channel, then determine if it’s the right time to make a response. If it is, then we generate an LLM response and transcribe it into audio using off-the-shelf TTS providers.
Diagram of the full multiplayer voice AI architecture we built to serve group calls
Looking Forward: State of the Art
We also look forward to novel consumer use-cases that are audio first. We think that Presidential AI will be the first of many great consumer applications that use AI through different modalities than a simple text box.
For cascading systems, Deepgram and Cartesia are building incredibly fast and cheap speech-to-text and text-to-speech models respectively. These models are rapidly becoming faster, cheaper, and better. Singular speech-to-speech models will eventually emerge and the cheapest and fastest way to build conversational AI applications, with Tincans and OpenAI building audio-first foundational models.
The recent launch of OpenAI’s Realtime API is extremely promising for magic-like conversational apps, and we look forward to implementing it our product soon.
Woodside Labs is focused on building novel technology on top of messaging rails, whether it's messaging platforms like Discord or the banking networks. If this work interests you, please follow @hellosvapnil to stay up to date in what we ship.
If you’d like to chat face-to-face, feel free to schedule some time here.