How I Integrated Whisper Speech-to-Text into Hermes Agent
- Ctrl Man
- AI , Software Development , Developer Tools
- 22 Mar, 2026
๐ฏ Why Voice Matters for AI Agents
Text-based interaction is limiting. Sometimes youโre driving, cooking, or just donโt want to type. Voice messages are faster, more natural, and accessible to everyone.
When I built Hermes Agent โ my autonomous AI assistant that runs on Telegram, Discord, WhatsApp, and Slack โ I knew voice support was essential. But I had specific requirements:
- Privacy-first โ Process locally when possible
- Free tier โ No mandatory API costs
- Multi-platform โ Work across all messaging apps
- Accurate โ Filter out Whisperโs hallucinations on silence
Hereโs how I did it.
๐๏ธ Architecture Overview
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ User sends voice message โ
โ (Telegram/WhatsApp/etc.) โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ Gateway receives audio file (.ogg/.wav) โ
โ Saves to temporary location โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ Transcription Pipeline (3 providers) โ
โ โ
โ โโโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโโ โ
โ โ Local โ โ Groq โ โ OpenAI โ โ
โ โ (default) โ โ (free tier) โ โ (paid) โ โ
โ โ โ โ โ โ โ โ
โ โ faster- โ โ Whisper API โ โ Whisper API โ โ
โ โ whisper โ โ โ โ โ โ
โ โโโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโโ โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ Hallucination Filter (critical!) โ
โ โ
โ Checks for common Whisper mistakes on silence: โ
โ โข "Thank you." โ
โ โข "Thanks for watching." โ
โ โข "Subscribe to my channel." โ
โ โข "The end." โ
โ โข Russian/French/Japanese YouTube outro text โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ Transcript sent to Hermes Agent โ
โ Agent processes as text message โ
โ Response sent back to user โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
๐ง Three Transcription Providers
1. Local Whisper (Default, Free)
Pros:
- โ No API key required
- โ Complete privacy โ audio never leaves your machine
- โ No rate limits
- โ Works offline
Cons:
- โ Requires
faster-whisperPython package (~150MB model download) - โ Slower than API-based solutions
- โ Uses local CPU/GPU resources
Setup:
pip install faster-whisper
The model auto-downloads on first use. I use the base model โ good balance of speed and accuracy.
2. Groq Whisper (Free Tier)
Pros:
- โ Blazing fast (Groqโs LPU inference)
- โ Free tier available
- โ No local resources used
Cons:
- โ Requires
GROQ_API_KEY - โ Rate limits on free tier
- โ Audio sent to external API
Setup:
# Add to ~/.hermes/.env
GROQ_API_KEY=your_key_here
3. OpenAI Whisper (Paid)
Pros:
- โ Highest accuracy
- โ Handles noisy audio well
- โ Multiple language support
Cons:
- โ Paid ($0.006/minute)
- โ Requires
VOICE_TOOLS_OPENAI_KEY - โ Audio sent to OpenAI
Setup:
# Add to ~/.hermes/.env
VOICE_TOOLS_OPENAI_KEY=your_key_here
๐ Core Implementation
Transcription Tool (transcription_tools.py)
#!/usr/bin/env python3
"""
Transcription Tools Module
Three providers:
- local (default, free) โ faster-whisper running locally
- groq (free tier) โ Groq Whisper API
- openai (paid) โ OpenAI Whisper API
"""
import logging
from typing import Optional, Dict, Any
logger = logging.getLogger(__name__)
# Optional imports โ graceful degradation
import importlib.util as _ilu
_HAS_FASTER_WHISPER = _ilu.find_spec("faster_whisper") is not None
_HAS_OPENAI = _ilu.find_spec("openai") is not None
def transcribe_audio(
file_path: str,
provider: str = "local",
model: Optional[str] = None
) -> Dict[str, Any]:
"""Transcribe audio file with specified provider."""
if provider == "local":
return _transcribe_local(file_path, model)
elif provider == "groq":
return _transcribe_groq(file_path, model)
elif provider == "openai":
return _transcribe_openai(file_path, model)
else:
return {"success": False, "error": f"Unknown provider: {provider}"}
Voice Mode for CLI (voice_mode.py)
For local CLI usage, I added push-to-talk voice support:
import sounddevice as sd
import numpy as np
import wave
import tempfile
SAMPLE_RATE = 16000 # Whisper native rate
CHANNELS = 1 # Mono
MAX_RECORDING_SECONDS = 120
class AudioRecorder:
"""Thread-safe audio recorder using sounddevice.InputStream."""
def __init__(self):
self._frames = []
self._recording = False
def start(self, on_silence_stop=None):
"""Start recording. Auto-stops on silence if callback provided."""
# ... implementation
def stop(self) -> Optional[str]:
"""Stop recording and save WAV file."""
# ... writes to temp directory
return wav_path
Key features:
- Silence detection โ Auto-stops after 3 seconds of silence
- RMS threshold โ Filters out background noise
- WAV output โ 16kHz mono (Whisperโs native format)
- Temp cleanup โ Auto-deletes recordings after 1 hour
๐ซ The Hallucination Problem
Whisper has a well-known issue: it hallucinates on silent audio.
When you send near-silent audio (or the user stops speaking), Whisper often outputs:
"Thank you."
"Thanks for watching."
"Subscribe to my channel."
"ะัะพะดะพะปะถะตะฝะธะต ัะปะตะดัะตั..." (Russian: "To be continued...")
"Sous-titres rรฉalisรฉs par la communautรฉ d'amara.org"
This is catastrophic for an AI agent. Imagine:
User: (silence, thinking)
Hermes: "Thank you. Subscribe to my channel."
User: "What?? I didn't say anything!"
My Solution: Hallucination Filter
WHISPER_HALLUCINATIONS = {
"thank you.",
"thank you",
"thanks for watching.",
"subscribe to my channel.",
"like and subscribe.",
"bye.",
"the end.",
"ะฟัะพะดะพะปะถะตะฝะธะต ัะปะตะดัะตั",
"sous-titres",
"amara.org",
"ใ่ฆ่ดใใใใจใใใใใพใใ",
}
def is_whisper_hallucination(transcript: str) -> bool:
"""Check if transcript is a known Whisper hallucination."""
cleaned = transcript.strip().lower()
if not cleaned:
return True
# Exact match
if cleaned.rstrip('.!') in WHISPER_HALLUCINATIONS:
return True
# Repetitive patterns ("Thank you. Thank you. Thank you.")
if re.match(r'^(?:thank you|thanks|bye|you|ok|the end|\.)+$', cleaned):
return True
return False
def transcribe_recording(wav_path: str) -> Dict[str, Any]:
"""Transcribe with hallucination filtering."""
result = transcribe_audio(wav_path)
if result.get("success") and is_whisper_hallucination(result["transcript"]):
logger.info("Filtered Whisper hallucination: %r", result["transcript"])
return {"success": True, "transcript": "", "filtered": True}
return result
Result: Silent audio returns empty transcript instead of nonsense.
๐ Multi-Platform Gateway Integration
The transcription system integrates with Hermesโ messaging gateway:
# gateway/run.py
async def handle_voice_message(event, audio_path: str):
"""Process voice message from any platform."""
# Transcribe
result = transcribe_audio(audio_path)
if not result["success"]:
await send_message(event.chat_id, "Sorry, I couldn't understand that.")
return
transcript = result["transcript"]
# Filter hallucinations
if result.get("filtered"):
await send_message(event.chat_id, "I didn't catch that โ could you repeat?")
return
# Process as text message
await handle_text_message(event, transcript)
Supported platforms:
- โ Telegram (voice messages + audio files)
- โ WhatsApp (voice messages)
- โ Discord (audio attachments)
- โ Slack (audio files)
- โ Signal (voice messages)
โ๏ธ Configuration
~/.hermes/config.yaml
stt:
enabled: true
provider: local # local, groq, openai
model: base # Whisper model (local only)
language: en # Detection language
voice:
auto_transcribe: true # Auto-transcribe voice messages
playback_enabled: true # Play TTS responses
~/.hermes/.env
# Local Whisper (no key needed)
# Just install: pip install faster-whisper
# Groq (free tier)
GROQ_API_KEY=gsk_...
# OpenAI (paid)
VOICE_TOOLS_OPENAI_KEY=sk-...
# Optional: Custom local STT command
HERMES_LOCAL_STT_COMMAND=whisper {input_path} --model {model}
๐ค CLI Voice Mode
For local terminal usage, I added a voice mode:
# Enable voice mode
hermes --voice
# Or toggle in-session
/voice on
How it works:
- Press and hold a key (or use push-to-talk button)
- Speak (audio level visualized in terminal)
- Release or wait for auto-stop on silence
- Audio transcribed locally
- Transcript sent to Hermes Agent
- Response spoken back via TTS
Requirements:
pip install sounddevice numpy
๐ Performance Comparison
| Provider | Speed | Accuracy | Cost | Privacy |
|---|---|---|---|---|
| Local | 2-5s | Good | Free | โ Full |
| Groq | <1s | Very Good | Free tier | โ API |
| OpenAI | 1-3s | Excellent | $0.006/min | โ API |
My recommendation: Start with local for privacy and zero cost. Switch to Groq if you need faster response times.
๐ Common Issues & Solutions
Issue: โNo audio devices detectedโ
Cause: Running in headless environment (SSH, Docker, WSL)
Solution:
- Use API-based providers (Groq/OpenAI)
- Or forward audio devices via PulseAudio (WSL)
Issue: โfaster-whisper not foundโ
Cause: Package not installed
Solution:
pip install faster-whisper
Issue: Hallucinations still getting through
Cause: New hallucination patterns not in filter
Solution:
# Add to WHISPER_HALLUCINATIONS set
WHISPER_HALLUCINATIONS.add("your new phrase")
Issue: Slow transcription on local
Cause: CPU-bound, large model
Solution:
- Use
tinyorbasemodel instead oflarge - Switch to Groq API for speed
- Enable GPU acceleration (CUDA)
๐ฎ Future Improvements
- Streaming transcription โ Real-time as user speaks
- Voice activity detection (VAD) โ Better silence detection
- Speaker diarization โ โWho said whatโ in group chats
- Multilingual auto-detect โ No need to set language
- Custom wake word โ โHey Hermesโ activation
๐ Resources
๐ฏ Key Takeaways
- Voice is essential for natural AI interaction
- Local Whisper works great for privacy-focused setups
- Hallucination filtering is critical โ donโt skip this!
- Multi-provider support gives users flexibility
- Auto-stop on silence improves UX dramatically
Next article: Iโll cover how I integrated text-to-speech (TTS) for voice responses, completing the full voice conversation loop.
Found this helpful? Share your thoughts on ctrlman.dev or reach out on Telegram @ctrlman.