How I Integrated Whisper Speech-to-Text into Hermes Agent

Ctrl Man
AI , Software Development , Developer Tools
22 Mar, 2026

🎯 Why Voice Matters for AI Agents

Text-based interaction is limiting. Sometimes you’re driving, cooking, or just don’t want to type. Voice messages are faster, more natural, and accessible to everyone.

When I built Hermes Agent — my autonomous AI assistant that runs on Telegram, Discord, WhatsApp, and Slack — I knew voice support was essential. But I had specific requirements:

Privacy-first — Process locally when possible
Free tier — No mandatory API costs
Multi-platform — Work across all messaging apps
Accurate — Filter out Whisper’s hallucinations on silence

Here’s how I did it.

🏗️ Architecture Overview

┌─────────────────────────────────────────────────────────────┐
│                    User sends voice message                  │
│                    (Telegram/WhatsApp/etc.)                  │
└─────────────────────────────────────────────────────────────┘
                              ↓
┌─────────────────────────────────────────────────────────────┐
│              Gateway receives audio file (.ogg/.wav)         │
│              Saves to temporary location                     │
└─────────────────────────────────────────────────────────────┘
                              ↓
┌─────────────────────────────────────────────────────────────┐
│           Transcription Pipeline (3 providers)               │
│                                                              │
│   ┌──────────────┐  ┌──────────────┐  ┌──────────────┐     │
│   │    Local     │  │    Groq      │  │   OpenAI     │     │
│   │  (default)   │  │  (free tier) │  │   (paid)     │     │
│   │              │  │              │  │              │     │
│   │ faster-      │  │ Whisper API  │  │ Whisper API  │     │
│   │ whisper      │  │              │  │              │     │
│   └──────────────┘  └──────────────┘  └──────────────┘     │
└─────────────────────────────────────────────────────────────┘
                              ↓
┌─────────────────────────────────────────────────────────────┐
│            Hallucination Filter (critical!)                  │
│                                                              │
│   Checks for common Whisper mistakes on silence:            │
│   • "Thank you."                                            │
│   • "Thanks for watching."                                  │
│   • "Subscribe to my channel."                              │
│   • "The end."                                              │
│   • Russian/French/Japanese YouTube outro text              │
└─────────────────────────────────────────────────────────────┘
                              ↓
┌─────────────────────────────────────────────────────────────┐
│              Transcript sent to Hermes Agent                 │
│              Agent processes as text message                 │
│              Response sent back to user                      │
└─────────────────────────────────────────────────────────────┘

🔧 Three Transcription Providers

1. Local Whisper (Default, Free)

Pros:

✅ No API key required
✅ Complete privacy — audio never leaves your machine
✅ No rate limits
✅ Works offline

Cons:

❌ Requires faster-whisper Python package (~150MB model download)
❌ Slower than API-based solutions
❌ Uses local CPU/GPU resources

Setup:

pip install faster-whisper

The model auto-downloads on first use. I use the base model — good balance of speed and accuracy.

2. Groq Whisper (Free Tier)

Pros:

✅ Blazing fast (Groq’s LPU inference)
✅ Free tier available
✅ No local resources used

Cons:

❌ Requires GROQ_API_KEY
❌ Rate limits on free tier
❌ Audio sent to external API

Setup:

# Add to ~/.hermes/.env
GROQ_API_KEY=your_key_here

3. OpenAI Whisper (Paid)

Pros:

✅ Highest accuracy
✅ Handles noisy audio well
✅ Multiple language support

Cons:

❌ Paid ($0.006/minute)
❌ Requires VOICE_TOOLS_OPENAI_KEY
❌ Audio sent to OpenAI

Setup:

# Add to ~/.hermes/.env
VOICE_TOOLS_OPENAI_KEY=your_key_here

📝 Core Implementation

Transcription Tool (`transcription_tools.py`)

#!/usr/bin/env python3
"""
Transcription Tools Module

Three providers:
  - local (default, free) — faster-whisper running locally
  - groq (free tier) — Groq Whisper API
  - openai (paid) — OpenAI Whisper API
"""

import logging
from typing import Optional, Dict, Any

logger = logging.getLogger(__name__)

# Optional imports — graceful degradation
import importlib.util as _ilu
_HAS_FASTER_WHISPER = _ilu.find_spec("faster_whisper") is not None
_HAS_OPENAI = _ilu.find_spec("openai") is not None

def transcribe_audio(
    file_path: str,
    provider: str = "local",
    model: Optional[str] = None
) -> Dict[str, Any]:
    """Transcribe audio file with specified provider."""
    
    if provider == "local":
        return _transcribe_local(file_path, model)
    elif provider == "groq":
        return _transcribe_groq(file_path, model)
    elif provider == "openai":
        return _transcribe_openai(file_path, model)
    else:
        return {"success": False, "error": f"Unknown provider: {provider}"}

Voice Mode for CLI (`voice_mode.py`)

For local CLI usage, I added push-to-talk voice support:

import sounddevice as sd
import numpy as np
import wave
import tempfile

SAMPLE_RATE = 16000  # Whisper native rate
CHANNELS = 1  # Mono
MAX_RECORDING_SECONDS = 120

class AudioRecorder:
    """Thread-safe audio recorder using sounddevice.InputStream."""
    
    def __init__(self):
        self._frames = []
        self._recording = False
        
    def start(self, on_silence_stop=None):
        """Start recording. Auto-stops on silence if callback provided."""
        # ... implementation
        
    def stop(self) -> Optional[str]:
        """Stop recording and save WAV file."""
        # ... writes to temp directory
        return wav_path

Key features:

Silence detection — Auto-stops after 3 seconds of silence
RMS threshold — Filters out background noise
WAV output — 16kHz mono (Whisper’s native format)
Temp cleanup — Auto-deletes recordings after 1 hour

🚫 The Hallucination Problem

Whisper has a well-known issue: it hallucinates on silent audio.

When you send near-silent audio (or the user stops speaking), Whisper often outputs:

"Thank you."
"Thanks for watching."
"Subscribe to my channel."
"Продолжение следует..." (Russian: "To be continued...")
"Sous-titres réalisés par la communauté d'amara.org"

This is catastrophic for an AI agent. Imagine:

User: (silence, thinking)
Hermes: "Thank you. Subscribe to my channel."
User: "What?? I didn't say anything!"

My Solution: Hallucination Filter

WHISPER_HALLUCINATIONS = {
    "thank you.",
    "thank you",
    "thanks for watching.",
    "subscribe to my channel.",
    "like and subscribe.",
    "bye.",
    "the end.",
    "продолжение следует",
    "sous-titres",
    "amara.org",
    "ご視聴ありがとうございました",
}

def is_whisper_hallucination(transcript: str) -> bool:
    """Check if transcript is a known Whisper hallucination."""
    cleaned = transcript.strip().lower()
    
    if not cleaned:
        return True
    
    # Exact match
    if cleaned.rstrip('.!') in WHISPER_HALLUCINATIONS:
        return True
    
    # Repetitive patterns ("Thank you. Thank you. Thank you.")
    if re.match(r'^(?:thank you|thanks|bye|you|ok|the end|\.)+$', cleaned):
        return True
    
    return False

def transcribe_recording(wav_path: str) -> Dict[str, Any]:
    """Transcribe with hallucination filtering."""
    result = transcribe_audio(wav_path)
    
    if result.get("success") and is_whisper_hallucination(result["transcript"]):
        logger.info("Filtered Whisper hallucination: %r", result["transcript"])
        return {"success": True, "transcript": "", "filtered": True}
    
    return result

Result: Silent audio returns empty transcript instead of nonsense.

🌐 Multi-Platform Gateway Integration

The transcription system integrates with Hermes’ messaging gateway:

# gateway/run.py
async def handle_voice_message(event, audio_path: str):
    """Process voice message from any platform."""
    
    # Transcribe
    result = transcribe_audio(audio_path)
    
    if not result["success"]:
        await send_message(event.chat_id, "Sorry, I couldn't understand that.")
        return
    
    transcript = result["transcript"]
    
    # Filter hallucinations
    if result.get("filtered"):
        await send_message(event.chat_id, "I didn't catch that — could you repeat?")
        return
    
    # Process as text message
    await handle_text_message(event, transcript)

Supported platforms:

✅ Telegram (voice messages + audio files)
✅ WhatsApp (voice messages)
✅ Discord (audio attachments)
✅ Slack (audio files)
✅ Signal (voice messages)

⚙️ Configuration

`~/.hermes/config.yaml`

stt:
  enabled: true
  provider: local  # local, groq, openai
  model: base      # Whisper model (local only)
  language: en     # Detection language
  
voice:
  auto_transcribe: true  # Auto-transcribe voice messages
  playback_enabled: true # Play TTS responses

`~/.hermes/.env`

# Local Whisper (no key needed)
# Just install: pip install faster-whisper

# Groq (free tier)
GROQ_API_KEY=gsk_...

# OpenAI (paid)
VOICE_TOOLS_OPENAI_KEY=sk-...

# Optional: Custom local STT command
HERMES_LOCAL_STT_COMMAND=whisper {input_path} --model {model}

🎤 CLI Voice Mode

For local terminal usage, I added a voice mode:

# Enable voice mode
hermes --voice

# Or toggle in-session
/voice on

How it works:

Press and hold a key (or use push-to-talk button)
Speak (audio level visualized in terminal)
Release or wait for auto-stop on silence
Audio transcribed locally
Transcript sent to Hermes Agent
Response spoken back via TTS

Requirements:

pip install sounddevice numpy

📊 Performance Comparison

Provider	Speed	Accuracy	Cost	Privacy
Local	2-5s	Good	Free	✅ Full
Groq	<1s	Very Good	Free tier	❌ API
OpenAI	1-3s	Excellent	$0.006/min	❌ API

My recommendation: Start with local for privacy and zero cost. Switch to Groq if you need faster response times.

🐛 Common Issues & Solutions

Issue: “No audio devices detected”

Cause: Running in headless environment (SSH, Docker, WSL)

Solution:

Use API-based providers (Groq/OpenAI)
Or forward audio devices via PulseAudio (WSL)

Issue: “faster-whisper not found”

Cause: Package not installed

Solution:

pip install faster-whisper

Issue: Hallucinations still getting through

Cause: New hallucination patterns not in filter

Solution:

# Add to WHISPER_HALLUCINATIONS set
WHISPER_HALLUCINATIONS.add("your new phrase")

Issue: Slow transcription on local

Cause: CPU-bound, large model

Solution:

Use tiny or base model instead of large
Switch to Groq API for speed
Enable GPU acceleration (CUDA)

🔮 Future Improvements

Streaming transcription — Real-time as user speaks
Voice activity detection (VAD) — Better silence detection
Speaker diarization — “Who said what” in group chats
Multilingual auto-detect — No need to set language
Custom wake word — “Hey Hermes” activation

📚 Resources

🎯 Key Takeaways

Voice is essential for natural AI interaction
Local Whisper works great for privacy-focused setups
Hallucination filtering is critical — don’t skip this!
Multi-provider support gives users flexibility
Auto-stop on silence improves UX dramatically

Next article: I’ll cover how I integrated text-to-speech (TTS) for voice responses, completing the full voice conversation loop.

Found this helpful? Share your thoughts on ctrlman.dev or reach out on Telegram @ctrlman.

Comments

Google GitHub

Loading comments...

AI-Invoked Fears: Unpacking Creators' Mixed Reactions to AI

AI-Invoked Fears: Unpacking Creators' Mixed Reactions to AI Introduction The forward march of artificial intelligence (AI) and robotics is rewriting the script of societal norms and economic…

How I Integrated Whisper Speech-to-Text into Hermes Agent

🎯 Why Voice Matters for AI Agents

🏗️ Architecture Overview

🔧 Three Transcription Providers

1. Local Whisper (Default, Free)

2. Groq Whisper (Free Tier)

3. OpenAI Whisper (Paid)

📝 Core Implementation

Transcription Tool (transcription_tools.py)

Voice Mode for CLI (voice_mode.py)

🚫 The Hallucination Problem

My Solution: Hallucination Filter

🌐 Multi-Platform Gateway Integration

⚙️ Configuration

~/.hermes/config.yaml

~/.hermes/.env

🎤 CLI Voice Mode

📊 Performance Comparison

🐛 Common Issues & Solutions

Issue: “No audio devices detected”

Issue: “faster-whisper not found”

Issue: Hallucinations still getting through

Issue: Slow transcription on local

🔮 Future Improvements

📚 Resources

🎯 Key Takeaways

Tags:

Share:

Comments

Related Posts

Related Posts

Transcription Tool (`transcription_tools.py`)

Voice Mode for CLI (`voice_mode.py`)

`~/.hermes/config.yaml`

`~/.hermes/.env`