Prefer listening over scrolling? I’ve got you. The audio version is here, perfect for your commute or while pretending to work.
At the start of this year, we set a clear technical goal at Aquant to take voice AI from merely a useful innovation and turn it into something genuinely usable in service environments. We wanted to move beyond the buzzwords and see how voice AI could become a reliable, day-to-day tool for scenarios like customer support and technician assistance. Instead of framing this as a grand transformation, we focused on the nuts and bolts: how can voice be a hands-free, always-on way for technicians to access expert help in their language of choice, or for support agents to streamline their workflows?
Over the months, we've learned a lot about these use cases. We discovered how voice can boost not only process efficiency and operational ease but also improve customer satisfaction and even open up new revenue opportunities. This blog post is our technical deep dive into that journey, laying out exactly how we turned useful voice AI concepts into truly usable solutions.
From Vision to Real-World Challenges
Before diving directly into what we learned, it’s important to lay out the landscape we were navigating. Voice AI technology, as promising as it is, is still in relatively early stages. That means everyone in the field is figuring out how to connect the dots and build solutions that truly work end-to-end. Our journey was no different: we had to stitch together core technologies like telephony providers, speech-to-text engines, large language model reasoning, and text-to-speech synthesis. Each of these pieces introduced its own set of challenges, and we learned a lot about what it takes to turn a concept into a reliable, real-world solution.
In other words, we didn’t just encounter customer use cases; we encountered technological puzzles. And while we're still on this journey, we want to share a few critical insights into why these problems matter and how Aquant has approached solving them.
Learning #1: Optimizing for Performance in Continuous Voice Interactions
One of the very first and most critical lessons we learned was around performance. Unlike traditional SaaS applications where each interaction is typically a single request and response cycle, voice AI involves a continuous, conversational connection. As soon as a human starts speaking, there's an expectation that the AI will respond almost like another human would, and that means the system has to process and reply very quickly.
At the beginning of the year, a simple use case like a technician asking for expert help often took seven to ten seconds of processing. That was because we had to run through multiple steps: speech-to-text conversion, large language model reasoning, retrieving relevant information, and then text-to-speech synthesis. Each step introduced latency, and those delays added up to something far slower than a natural human conversation.
Over time, we focused on identifying and removing these bottlenecks, there are no shortcuts. For example, we initially used a large language model to understand the user's intent, but that created huge performance issues. Switching to a smaller, more efficient model at the start of the conversation helped reduce that time significantly. We also learned to recognize whether a query was brand new or a follow-up, which let us skip certain processing steps and streamline the response even further.
By making these adjustments and a few more, we brought response times down to about two seconds or less, making the voice AI feel much more natural and human-like and made continuous conversations actually usable in real-world scenarios.
Learning #2: Mastering Turn Detection in Voice AI
Another crucial lesson we tackled was turn detection. In simple terms, turn detection is the AI’s ability to know exactly when a user has finished speaking and it's time for the AI to respond. Unlike a simple button click, human speech varies widely. Some people take longer pauses, some speak in bursts, and sometimes the nature of the information, like reciting a long serial number, means they pause naturally as they think.
The challenge is that many speech-to-text models are still figuring out how to handle what we call “dynamic end-pointing.” In other words, they’re working on how to determine exactly when the user is done speaking in a flexible, user-aware way. For us, this was an urgent issue because we couldn’t rely solely on the speech-to-text engine to figure it out.
So, we ended up implementing a reactive solution. We let the speech-to-text model tell us when it thought the user had finished, and we started processing that input right away. However, if the user continued speaking, essentially interrupting the AI's assumption that they were done, we treated that as a cue to stop and update the input. This way, we could handle turn-taking more like a human would, adjusting dynamically whenever the user wasn’t actually finished.
In the end, this approach is aggressive in that we react quickly to any interruption, and it’s a bit of a workaround until speech-to-text models become more proactive and user-aware themselves. But for now, it allows us to handle the natural variability in how people speak and make the overall experience feel smoother and more intuitive.
Learning #3: From RAG to RAC - Making Conversations Flow Naturally
Our third big learning was about moving beyond traditional retrieval-augmented generation, or RAG. In a text or browser-based world, RAG is fine: a user asks a question, and you can return a list of steps or a set of instructions all at once. But real human conversations, especially over voice, don’t work that way. When a technician calls an expert, the expert doesn't list all 12 troubleshooting steps at once. Instead, they ask, "Did you try this first?" and wait for a response before moving on to the next step.
We realized that for voice AI, it wasn't enough to just retrieve information and hand it over. We needed to make it a conversation - what we call Aquant Retrieval-Augmented Conversation, or RAC. That means layering conversational logic on top of RAG so that the AI doesn’t just dump out all the steps. Instead, it guides the user step by step, asks clarifying questions, and waits for feedback before continuing. This turns the interaction into a true conversation rather than just a Q&A session.
By doing this, we made sure that even with millions of documents in our knowledge base, the AI could have a natural, guided conversation. It could prompt the user for more detail if needed or confirm each step before moving on. This was essential for making our voice AI not just a source of information, but a real conversational partner in troubleshooting and support.
Learning #4: Navigating Ambiguity in Conversational AI
The next extension of moving from RAG to RAC was learning how to handle ambiguity. Ambiguity can show up in multiple ways in a conversation. For instance, a user's prompt might be unclear because they’re dealing with different variants of a model. If they ask about an error code but don’t specify which model variant they’re using, the AI needs to resolve that ambiguity first. Instead of jumping straight to an answer, we trained the AI to ask clarifying questions like, “Can you confirm which model you’re referring to?” That way, we ensure the AI is giving the right guidance for the right context and making the conversation more efficient.
But there’s another kind of ambiguity that comes up when the AI gives instructions. For example, if the AI says, “Please reset the fan and then confirm if the status light is green,” the user might simply respond with “Yes.” From the user’s perspective, that might feel like a clear acknowledgement, but for the AI, it’s ambiguous. Did they confirm the fan reset, the light status, or both? This can lead the AI into a loop, not knowing exactly what was acknowledged.
To solve this, we built in ambiguity resolution techniques that treat short acknowledgments as applying to the entire instruction. Unless the user specifies more detail, the AI assumes that a “yes” means they completed the whole step. This not only breaks the loop but based on the call analysis, aligns with what users intend when they give those brief responses.
Learning #5: Making Voice AI Agentic and Seamlessly Orchestrated
Another important lesson we learned is that voice AI isn't a one-trick pony. In the real world, it often needs to do more than just fetch knowledge from documentation. Sometimes it needs to guide users through decision trees, retrieve parts information, or even interact with custom agents that our customers have built on Aquant’s platform.
To make the voice AI truly useful for these varied business use cases, we had to ensure it was agentic in nature. That means it can invoke different agents as needed during a conversation and then seamlessly use those agents’ outputs to respond naturally to the user.
We tackled this by building an agent orchestration layer into the platform. Essentially, we created a mechanism so that the large language model not only reasons about the user’s prompt but also decides which agent to call. By combining agent selection and reasoning into a single step, we saved a lot of time and made the whole process faster.
In other words, instead of running separate steps to figure out which agent to use, we let the LLM handle both tasks at once. This made our voice AI more responsive and able to handle complex, multi-agent scenarios without missing a beat.
Learning #6: Integrating Process Automation into Voice AI
The final key lesson we learned is the importance of process automation. Over time, we realized that many use cases aren’t just about troubleshooting or providing knowledge. Often, our customers wanted voice AI to handle the initial intake process, gather key information, and then trigger downstream workflows like creating a support ticket.
For example, when a customer calls in, the voice AI can act as the intake processor, asking about the reason for the call, collecting asset information like the model and serial number, and even doing lightweight troubleshooting. After that, it can seamlessly hand off to a human agent with all the details already documented. This ensures that every call is logged properly and saves human agents time, which at scale means a lot of efficiency gains.
To make this work, we built process automation directly into the voice AI. That means once the intake is done, the system can automatically trigger workflows; like creating a ticket and populating all the fields, so that everything is ready for the human agent or for any other downstream process. It also means the voice AI can be easily configured to handle both hand-offs and automated actions, making the whole process smoother and more reliable.
The Road Ahead for Voice AI at Aquant
As you can see, the promise of voice AI is immense. In the future, it can become a powerful technology that changes the way businesses engage with their customers. There's a lot of potential, but it's also an organic journey. We've come a long way since the beginning of the year. Aquant's voice AI has made significant strides in performance, user experience, and technological capabilities. But we also know there's a lot more to do. We need to ensure the system can operate at scale, continually improve turn detection, and deepen its integration into business processes so that it streamlines operations and ensures consistency.
There are still plenty of challenges to solve, from breaking language barriers to making voice AI available everywhere the work happens - web, mobile, phone, messaging, meetings. But with each step forward, we see exponential benefits. We’re excited and confident that voice will become, if not the mainstream way, certainly one of the main ways businesses interact with AI and deliver value to their customers.



.png)