D’Noise

Multi-Modal Speech Capturing Device

Enhance Decision-Making. Reduce Errors. Increase Safety.

>>> WATCH OUR VIDEO HERE

Introducing D’Noise

D’Noise is a wearable aid for security personnel in noisy environments such as airports and malls. Traditional microphones struggle with background noise–causing security personnel to inaccurately capture speech from civilians and suspects during critical encounters. This device integrates audio processing, beamforming, and automatic speech recognition (ASR) to enhance speech clarity. Key features include real-time noise suppression, playback, near real-time transcription, and generative error correction (GER). By improving speech intelligibility, it enhances decision-making, reduces errors, and increases safety.

Team members

Abel Lee Yang Yeow (CSD), Jone Chong Jin (CSD), Loh Jianyang, John (CSD), Wang Jun Long Ryan (CSD), Caitlin Daphne Tan Chiang (DAI), Leong Wen Jie Lucas (EPD), Ong Jing Ting (EPD)

Instructors:

Tan Mei Chee
Zhao Fang

Writing Instructors:

Belinda Seet

Project Roadmap

Empathize.

Security personnel in high-noise environments—such as airports and shopping malls—often struggle to capture speech from civilians or suspects during critical encounters. The combination of overlapping conversations, environmental noise, public announcements, and unpredictable acoustic conditions severely impacts the intelligibility of spoken instructions and alerts. These conditions can lead to misunderstandings, delayed responses, and potentially missed threats.

Traditional microphone systems and standard communication tools often fall short in such dynamic settings, struggling to isolate speech from ambient noise. As a result, frontline personnel are left without the clarity they need to act swiftly and accurately in high-stakes situations. A new approach is essential—one that ensures every word is heard, captured, and understood in real time.

Define.

After identifying a broad communication issue in the field, the team conducted a series of controlled experiments to convert this vague challenge into measurable, actionable insights. Through on-site testing and user observation, we pinpointed three core obstacles: the prevalence of high-noise environments (averaging above 70dB), the dependency of existing products on internet access, and the lack of clarity in captured speech.

These findings allowed us to sharpen our focus and define the exact needs of security personnel operating in chaotic spaces. We refined our problem statement to the following:

How might we create a speech capturing device that enhances, playback, and transcribes audio in noisy or crowded environments?

Solution.

The team has decided to work through the problem in steps. We identified key requirements by mapping a solution to each subproblem:

High Noise Environments (70dB) -> Captures Speech Effectively up to 4.5m
Existing Products Require Internet Access -> Operates Fully Offline
Clarity of Speech -> Playback and Live On-Screen Transcriptions

To put it all together, here’s our solution: D’NOISE

A discreet, wearable device that captures and transcribes audio effectively in noisy, offline environments.

It provides dual modality output:

Audio – Through real-time playback
Text – Through live on-screen transcription

System Architecture

System Components

[HARDWARE]
Shotgun Microphone: To effectively capture speech from as far as 6m (with natural ambient noise: 55dB).

Processor: Contain the computer (Jetson Orin Nano) and its battery for speech processing locally. It features a quick Swap (40s) using lithium-ion batteries (lifespan of 4.25hrs). It is also discreet at 0.936kg, inside of a small satchel bag.

Handheld LCD Device: Pocket-size device that enables the viewing of transcriptions.

[SOFTWARE]
Denoise Model: RNNoise --> Exceptional balance of real-time performance (lowest latency), highest perceptual quality, and efficient CPU operation (ideal for offline use).

Transcription Model: WhisperAI --> Significantly outperformed other models in the market, with a word error rate of 8.9% for a video dataset.

Hardware Development Process

Through multiple design iterations and rigorous component comparisons, we refined each part of the system—from microphones and processors to display and power—balancing performance, size, and efficiency. This process-driven approach enabled us to arrive at an optimized, field-ready hardware solution.

Software Development Process

The software development journey involved iterative testing, model benchmarking, and real-world scenario simulations. By comparing noise suppression, transcription, and communication modules, we fine-tuned each layer for speed, accuracy, and offline reliability—resulting in a robust, low-latency real-time system.

Noise Suppression: RNNoise

To identify the optimal noise suppression model for real-time, on-device use, the team evaluated a range of static and deep learning methods using key metrics: ΔSNR (Signal-to-Noise Ratio improvement), PESQ (perceptual audio quality), ΔSTOI (speech intelligibility), and latency.
Among all models, RNNoise demonstrated the best balance for wearable deployment:
Lowest Latency: RNNoise (CPU) had the lowest latency at 0.755 ms, compared to 52–155 ms for other deep models.

Highest Perceptual Quality: It scored the highest PESQ of 1.224, outperforming Facebook Denoiser (1.044) and DeepFilterNet (1.075).

Good SNR Improvement: While Facebook Denoiser had the highest ΔSNR (+1.42 dB), RNNoise achieved a solid +0.90 dB with much faster performance.

Efficient CPU Operation: RNNoise runs smoothly on CPU, whereas other models like Facebook Denoiser require GPU to achieve lower latency.

Automatic Speech Recognition: WhisperAI

To find the optimal ASR model for offline, on-device use, the team evaluated lightweight models using Normalized Word Error Rate (WER%) across five realistic datasets: Conversational AI, Phone Call, Meeting, Earnings Call, and Video.
Among all models, Whisper demonstrated the best performance for wearable deployment:
Lowest Word Error Rate (WER): Whisper achieved the lowest WER across all datasets, e.g., 8.9% on Video and 9.7% on Earnings Call, compared to Kaldi (up to 69.9%) and wav2vec 2.0 (up to 36.3%).

Fine-Tuning for Local Context: After fine-tuning on 530 hours of Singlish data, Whisper achieved a WER of 12.87%, showing a 66% improvement over Whisper-base.en and 93% improvement over Whisper-base.

Comparison with Market

Measuring our product's capabilities against leading market alternatives to demonstrate superior performance and innovation.

Product Showcase

Acknowledgements

Special thanks to Professor Zhao Fang and Professor Tan Mei Chee, the team’s SUTD Capstone Mentors, for their consistent feedback and guidance. The team also appreciates the Capstone Office for its resources and support.
The team is grateful to the KLASS Engineering & Solutions team, especially Lu Zheng Hao (Project Manager) for coordinating meetings and offering valuable feedback, and to Nicholas Chan and Asif for their insightful technical advice. Terence Goh also provided key support during the proposal stage.
Thanks also to Professor Jihong for his helpful feedback during update sessions, and to Belinda Seet from the Center for Writing and Rhetoric for her assistance with presentations and consultations. The team appreciates the technical guidance from Professor Teo Tee Hui and Professor Joel Yang, which has been crucial to the project’s development.

Days

Hours

Minutes

Seconds

Vote the project now!

Vote for our project at the exhibition! Your support is vital in recognizing our creativity. Join us in celebrating innovation and contributing to our success. Thank you for being part of our journey!