Building an AI-Powered Audio Interface: Our Journey and Insights

Daniel Jodlos
Software Mansion
Published in
3 min readMar 8, 2024

--

In response to popular demand, our team embarked on an exciting experiment — building an audio interface for GPT. The goal was to grant GPT the ability to listen and speak, eliminating the need for written communication. Today, we are thrilled to share our experiences and observations from this project, focusing on the key concepts, ideas, and challenges we encountered during its implementation.

Basic Building Blocks

To realize the audio interface, we designed a simple yet effective pipeline consisting of three essential tasks: speech recognition, response generation, and speech synthesis.

In our approach, we implemented the client-side application as a website. To replicate real conversational experiences accurately, we incorporated an interrupt feature, allowing users to interrupt the conversation flow at any point. However, this novel functionality also presented us with unique challenges, which we will delve into further in this article.

Simplified visualisation of our audio to audio GPT interface

Optimizing for Latency

Ensuring a seamless and natural conversation experience required us to focus on reducing the response latency as much as possible. To achieve this, we adopted a streaming approach at every possible stage:

  1. We employed a streaming variant of the Google Cloud Speech API for real-time speech recognition.
  2. The GPT responses were streamed during generation to minimize processing delays.
  3. We utilized the streaming variant of the Google Cloud Text To Speech API for synthesizing speech in real-time.

Thanks to this strict streaming strategy, we were able to maintain a conversation flow that felt immediate and responsive, enhancing the overall user experience.

Choosing the Right Transport Protocol

Selecting an appropriate transport protocol was important in our quest to build an effective audio interface. After careful consideration, we determined that transmitting audio data over websockets was the optimal choice due to its simplicity. While WebRTC offers advanced capabilities, we found its deployment process to be complex, requiring intricate server setup and potentially introducing challenges with audio packet loss and reordering. Additionally, accurately determining the positions of interruptions proved to be more challenging with the WebRTC integration.

Implementing Interruptions for Realistic Conversations

In order to simulate a genuine conversation experience, we dedicated effort to implementing an interrupt feature. This would allow users to interrupt the bot mid-sentence just as they would in a real-life interaction. You might think now “what’s difficult about that? Interrupting audio playback appears so straightforward!”. However there is a catch, that is to ensure the bot acknowledges the interruption and identifies its position correctly in the middle of its response and discards anything after being interrupted.

After exploring several approaches, we found that the most reliable and accurate location to determine the precise timing was the frontend application. Having control over the audio player on the client side empowered us to precisely capture the time of the interruption. Although character-level precision was unattainable due to limitations in the Google Cloud Text To Speech API, we succeeded in successfully identifying the interrupted sentence, which yielded satisfactory results for the user.

Experience the Basic Demo

To bring our audio interface to life, we developed a basic demo that you can experience on Software Mansion’s website here. This demo serves as a testament to the capabilities and potential of our AI-powered audio interface. We invite you to explore its functionalities and witness the future of conversational AI.

Conclusion

By leveraging the aforementioned key building blocks, optimizing latency, selecting appropriate transport protocols, and implementing the interruption features, we have successfully developed an audio interface for GPT that revolutionizes the way users interact with AI systems. Our publicly available demo showcases the potential of this technology and we are excited to push the boundaries of AI-driven audio interfaces in future projects.

--

--