Problem Context
We spoke to over 30 SME owners and operators across Singapore to understand the problem’s impacts in practice.
Small and medium-sized enterprises (SMEs) in Singapore are the backbone of the economy, yet they operate under enormous manpower constraints. Most customer-facing work falls entirely on the business owner or a tiny team, juggling customer service enquiries and outbound calls on top of everything else they already have to do.
For the majority of SMEs, 24/7 customer support is not a staffing challenge. It is a financial impossibility. The cost of a dedicated customer service agent, whether in-house or outsourced, is simply out of reach.
The result? Missed calls. Lost customers. Revenue that quietly walks out the door, long after business hours.
“Lack of 24/7 call customer support because of the high expense of staffing.”
Pierre Tan, Founder of Meide
“Current solutions struggle to grasp the specific context and domain knowledge of businesses.”
Derrick, Senior Manager at SimplyGo
“Existing AI tools provide slow and mechanical responses.”
Elson, Co-founder of Chatavocado
Generic voice bots can handle scripted queries. But the moment a caller asks something specific to your business, your products, or your policies, they fall apart. SMEs are left with a choice between a solution that sounds robotic and one that costs more than they can afford. The phone remains one of the most trusted channels for customer service in Singapore and right now, SMEs are losing that battle every time a call goes unanswered.
SMEs deserve 24/7 customer service that actually understands their business — without breaking the bank.
Solution
Introducing Doby, the AI voice agent built for Singapore SMEs
Doby answers every call, around the clock, with the knowledge, tone, and context of your specific business. It is not a generic chatbot with a phone line bolted on. Doby is built to learn from the documents you already have and gets smarter with every conversation.
Doby is also environmentally responsible — operations produce emissions equivalent to just 1.6 standard emails per minute of calling.
Key Features
Click each feature below to learn more.

Low Latency
Doby responds to customers in under a second, powered by a parallel WebSocket pipeline combining Deepgram for speech-to-text, Groq for language model inference, and ElevenLabs for voice synthesis. Through three iterations of optimisation — including Enthusiastic End Utterance prediction and Overlap Stream Piping — we reduced end-to-end response time from a 6–8 second baseline to approximately 0.8 seconds, an 88% improvement. Conversations feel natural, not like waiting on hold.

Context Aware
Upload your company documents — text, PDFs, or images — and Doby builds a knowledge base specific to your business. Our multimodal RAG system converts all ingested content into structured question-and-answer pairs, then embeds them for precise cosine-similarity retrieval grounded in the last 20 conversational turns. This achieves 90% retrieval accuracy during live calls, so Doby always knows what to say and never makes things up.

Self Improving
Every call is scored by our proprietary Call Quality Score (CQS), a hybrid reward function measuring response latency, user interruptions, and silence patterns. An LLM-driven pipeline then analyses weak spots and generates improved call scripts automatically. In testing, this raised CQS from 69 to 90 — a 30% improvement — while compressing script iteration cycles from 20 minutes to under 5. Doby gets measurably better with every conversation.

Business Dashboard
Track call volume, response times, resolution rates, and full call transcripts from a single clean dashboard. You always know how Doby is performing and what your customers are asking about.

Holographic Avatar
Customers do not just hear Doby. They can see a live holographic avatar on screen, a human-like face that builds trust and makes every interaction feel personal. It brings AI efficiency together with the warmth of face-to-face service.
How It Works
Add your business FAQs, product guides, existing customer service scripts, pricing sheets or even your Website that you want Doby to know as context when answering your customers.
Doby automatically processes your business context to prepare call scripts for your customers that are detailed and compelling.
Doby picks up your customer calls 24/7. Speech is transcribed in real time, matched against your knowledge base, and answered in a natural voice, all 0.8s. If the caller interrupts mid-response, Doby stops and listens and updates accordingly.
Doby comes with a first-in-market holographic avatar for physical locations. It displays our call agent with live lifelike lip-sync, body sway, and 3D depth, increasing your customer trust and engagement by ~30% according to research, potentially increasing customer retention and lifetime value. The avatar character design can even be customised to your Company's branding or mascot.
System Architecture
Traction & Validation
Over 10 weeks, we executed an intensive multi-channel outreach campaign to validate Doby with real businesses:
200+ targeted cold emails to service-based SMEs · 30+ in-depth interviews with SME owners and decision-makers · 5 live demonstrations at SUTD Academy workshops · 1 Luxasia innovation event showcase
These efforts yielded 2 signed Letters of Intent, a paid contract with Meide (babysitting and cleaning services), and active feature collaborations with CultureForte and Allinpay — confirming that Doby addresses a real and pressing need in the market.
Doby is built on a parallel WebSocket pipeline coordinated by a central Call Orchestrator. Incoming phone calls are routed through Twilio, which handles call connection, media streaming, and call control. Speech is transcribed in real time by Deepgram, passed to a Groq-hosted large language model for response generation, and converted back to voice by ElevenLabs. Call data and transcripts are stored in a PostgreSQL database.
Two key innovations power this architecture: Enthusiastic End Utterance, which predicts when a speaker is about to finish and begins processing early, and Overlap Stream Piping, which feeds LLM output directly into text-to-speech with minimal delay. Together, they enable the 0.8-second response time that makes conversations feel lifelike. On the retrieval side, our Q&A Pair Reformulation pipeline converts uploaded documents into structured question-and-answer pairs before embedding, dramatically improving semantic matching during live calls.
Our Poster