voice-stt
Local push-to-talk speech-to-text for Linux. faster-whisper on CUDA,
hardware PTT via evdev, Unix-socket fanout to any consumer — including a
Claude Code channel that
streams dictated transcripts straight into a running Claude Code session.
Architecture
PTT key (hardware button, hotkey, etc.)
│
▼
voice-stt-ptt (evdev listener)
│
▼
voice-stt start/stop ──► voice-sttd (holds model in VRAM, captures mic)
│
▼
OUT_SOCK (Unix socket, line-delimited UTF-8)
│
┌───────────────────────────┼───────────────────────────┐
▼ ▼ ▼
voice-stt listen voice-stt type voice-stt clip
(stdout, pipe (xdotool into (xclip clipboard)
to anything) focused window)
The daemon broadcasts each utterance to all connected output clients, so
you can run as many consumers in parallel as you want.
One-time setup
Clone the repo anywhere you like — the examples below assume ~/projects/voice-stt:
git clone https://github.com/MaxInertia/claude-voice-input-channel.git ~/projects/voice-stt
cd ~/projects/voice-stt
System packages (Ubuntu/Debian):
# required
sudo apt install libportaudio2
# optional — only if you want the X11-specific consumers and hotkey:
sudo apt install xdotool xclip xbindkeys
libportaudio2 is required by sounddevice to open the mic. The X11
packages are only needed if you want the voice-stt type (xdotool) or
voice-stt clip (xclip) consumers, or the keyboard-PTT fallback
(xbindkeys). The daemon, PTT listener, Claude Code channel, and the
listen consumer all work on any Linux display server without them.
Install uv (if you don't already have it):
curl -LsSf https://astral.sh/uv/install.sh | sh
# new shells pick it up automatically; for the current shell:
export PATH="$HOME/.local/bin:$PATH"
CUDA libs (cuBLAS + cuDNN) are pulled in as Python deps (nvidia-cublas-cu12,
nvidia-cudnn-cu12) and dlopen'd at startup by daemon.py, so you do not
need system libcudnn or to fiddle with LD_LIBRARY_PATH. You only need a
working NVIDIA driver (check with nvidia-smi).
Install the Python dependencies (from the repo root):
uv sync
Configuration is read from a .env file at the repo root. Copy the example
and edit if you want to change the defaults (model, compute type, input
device, PTT key — all documented inline in the example):
cp .env.example .env
$EDITOR .env
The defaults in .env.example work out of the box for an 8 GB NVIDIA GPU
on a modern Linux desktop with PipeWire. You can skip editing .env
entirely and the daemon will run with the builtin defaults.
First run of the daemon downloads the model (~1.5GB for medium.en) from
HuggingFace into ~/.cache/huggingface/. After that it's fully offline — no
audio, transcripts, or telemetry leave the machine.
Run
The scripts/voice-stt-svc helper launches both the daemon and the PTT
listener in the background and tears them down again. You can run it
directly from the repo:
./scripts/voice-stt-svc start # launch voice-sttd + voice-stt-ptt (backgrounded)
./scripts/voice-stt-svc status # show pids / running state
./scripts/voice-stt-svc logs # tail both log files
./scripts/voice-stt-svc stop # kill both, clean up sockets
./scripts/voice-stt-svc restart
Optional: if you have a personal bin directory on your PATH
(commonly ~/bin or ~/.local/bin), symlink the wrapper into it so you
can call it as a bare voice-stt-svc from anywhere:
# example — adjust the target directory to wherever your PATH picks up
# personal binaries (check with: echo $PATH)
ln -sf "$PWD/scripts/voice-stt-svc" ~/.local/bin/voice-stt-svc
Logs land at /tmp/voice-stt-daemon.log and /tmp/voice-stt-ptt.log. There
is no autostart on boot — you launch it when you want it.
Once voice-stt-svc start reports both running, hold your configured PTT
key and speak. To consume the transcripts, run any consumer in the
foreground:
cd ~/projects/voice-stt
uv run voice-stt listen # stdout
uv run voice-stt type # type into focused window
uv run voice-stt clip # copy to clipboard
Push-to-talk hotkey
voice-stt-ptt is a small evdev listener that watches for a chosen key's
press/release and calls voice-stt start / voice-stt stop accordingly.
Hold the key to dictate; release to transcribe.
Requires read access to /dev/input/event* — add yourself to the input
group once:
sudo usermod -aG input $USER
# log out and back in