Real-time audio playback patterns for macOS Apple Silicon. TRIGGERS - audio jitter, tts choppy, sounddevice, afplay jitter, audio architecture, playback glitch, GIL contention audio, launchd audio priority, wrong audio device, airpods, bluetooth audio, device switching.
From kokoro-ttsnpx claudepluginhub terrylica/cc-skills --plugin kokoro-ttsThis skill is limited to using the following tools:
references/device-routing.mdreferences/launchd-qos.mdreferences/pipeline-synthesis.mdreferences/write-based-stream.mdGuides Next.js Cache Components and Partial Prerendering (PPR) with cacheComponents enabled. Implements 'use cache', cacheLife(), cacheTag(), revalidateTag(), static/dynamic optimization, and cache debugging.
Migrates code, prompts, and API calls from Claude Sonnet 4.0/4.5 or Opus 4.1 to Opus 4.5, updating model strings on Anthropic, AWS, GCP, Azure platforms.
Optimizes cloud costs on AWS, Azure, GCP via rightsizing, tagging strategies, reserved instances, spot usage, and spending analysis. Use for expense reduction and governance.
Battle-tested patterns and anti-patterns for jitter-free audio playback on macOS Apple Silicon, learned from building the Kokoro TTS pipeline.
Self-Evolving Skill: This skill improves through use. If instructions are wrong, parameters drifted, or a workaround was needed — fix this file immediately, don't defer. Only update for real, reproducible issues.
When building audio playback in Python on macOS, choose based on this hierarchy:
1. Write-based sd.OutputStream ← DEFAULT CHOICE
2. Callback-based sd.OutputStream ← Only if you need sample-level control
3. afplay subprocess ← Only for one-shot playback of existing files
4. macOS say ← NEVER for production TTS
The default choice for Python audio playback. stream.write() blocks in PortAudio's C code until the device buffer has space. No Python code runs on the audio thread, so the GIL is irrelevant.
import sounddevice as sd
import numpy as np
def open_audio_stream() -> sd.OutputStream:
# Refresh PortAudio to discover hot-plugged devices (Bluetooth, HDMI)
sd._terminate()
sd._initialize()
stream = sd.OutputStream(
samplerate=24000,
channels=1,
dtype="float32",
blocksize=2048, # ~85ms blocks at 24kHz
latency="high", # large internal buffer (not live, so latency is fine)
)
stream.start()
return stream
# Open per request — close after each to follow device changes
stream = open_audio_stream()
# Play audio — blocks in C code, no GIL contention
audio = np.array([...], dtype=np.float32).reshape(-1, 1)
WRITE_BLOCK = 4096 # ~170ms — responsive to stop, smooth playback
for i in range(0, len(audio), WRITE_BLOCK):
if interrupted:
break
stream.write(audio[i:i + WRITE_BLOCK])
stream.close() # close after request so next open uses current default device
Why this works:
stream.write() calls into PortAudio's C layer → no Python on the audio threadStop mechanism: stream.abort() immediately stops playback and unblocks write(). Reopen the stream for next playback.
Reference: write-based-stream.md
For chunked TTS, overlap synthesis and playback:
from concurrent.futures import ThreadPoolExecutor
with ThreadPoolExecutor(max_workers=1) as pool:
ahead = pool.submit(synthesize, chunks[0])
for i in range(len(chunks)):
audio = ahead.result()
if i + 1 < len(chunks):
ahead = pool.submit(synthesize, chunks[i + 1])
stream.write(audio) # plays while next chunk synthesizes
Why: Synthesis takes 500-2000ms per chunk. Without pipelining, there's dead silence between chunks while waiting for synthesis. With pipelining, chunk N+1 is ready by the time chunk N finishes playing (since playback is typically longer than synthesis).
CoreAudio's native sample format is 32-bit float. Use it end-to-end:
# Synthesis output → float32 directly
audio = model.synthesize(text)
if audio.dtype != np.float32:
audio = audio.astype(np.float32)
if np.max(np.abs(audio)) > 2.0: # int16 range
audio = audio / 32768.0
Why: Avoids WAV encode/decode overhead. No temp files. No format conversion at playback time. CoreAudio receives the data in its preferred format.
Apply tiny fade-in/out at chunk boundaries to prevent click artifacts:
FADE_SAMPLES = 48 # 2ms at 24kHz
def apply_boundary_fades(audio: np.ndarray) -> np.ndarray:
if len(audio) < FADE_SAMPLES * 2:
return audio
audio = audio.copy()
audio[:FADE_SAMPLES] *= np.linspace(0, 1, FADE_SAMPLES, dtype=np.float32)
audio[-FADE_SAMPLES:] *= np.linspace(1, 0, FADE_SAMPLES, dtype=np.float32)
return audio
Why: Adjacent chunks may have different DC offsets or phase. A 2ms fade is inaudible but prevents the discontinuity click. Simpler and more reliable than inter-chunk crossfade.
<!-- CORRECT: Audio process gets CPU priority -->
<key>Nice</key>
<integer>-10</integer>
<key>ProcessType</key>
<string>Adaptive</string>
Why:
Nice: -10 gives higher CPU scheduling priority (range: -20 highest to 20 lowest)ProcessType: Adaptive lets macOS boost priority when the process is actively workingOne server, one speak queue, shared across all clients (BTT, Telegram bot, CLI):
BTT shortcut → POST /v1/audio/speak → [server queue] → synthesize → play
Telegram bot → POST /v1/audio/speak → [server queue] → synthesize → play
Why: Prevents audio conflicts. One lock protocol. One process to tune. Clients are thin HTTP POST callers.
PortAudio caches the device list at Pa_Initialize() time. Bluetooth devices (AirPods) connecting later are invisible. Two-layer strategy:
def _refresh_audio_devices():
"""Re-init PortAudio to discover hot-plugged devices (~1ms)."""
sd._terminate()
sd._initialize()
def open_audio_stream():
"""Open stream with fresh device discovery."""
_refresh_audio_devices() # ← discovers AirPods, new HDMI, etc.
stream = sd.OutputStream(samplerate=24000, channels=1, dtype="float32",
blocksize=2048, latency="high")
stream.start()
return stream
def maybe_reopen_stream(stream):
"""Between-chunk check for device switching (cached devices only).
CRITICAL: Do NOT call _refresh_audio_devices() here — it invalidates
the active stream pointer (PaErrorCode -9988).
"""
current_default = sd.query_devices(kind='output')['index']
if stream.device != current_default:
stream.close()
return open_audio_stream()
return stream
Two layers:
| Layer | When | Handles | Mechanism |
|---|---|---|---|
| Between requests | Stream open | Bluetooth hot-plug, HDMI connect | _refresh_audio_devices() + new stream |
| Between chunks | Mid-playback | Switching between known devices | sd.query_devices() on cached list |
CRITICAL: Never call sd._terminate() while a stream is active — it invalidates all PortAudio stream pointers.
Reference: device-routing.md
# DON'T — GIL contention causes jitter
def callback(outdata, frames, time_info, status):
data = audio_queue.get_nowait() # needs GIL!
outdata[:, 0] = data
stream = sd.OutputStream(callback=callback, ...)
Why it fails: The callback runs on PortAudio's real-time audio thread, but queue.get_nowait() acquires Python's GIL to execute. When MLX synthesis (or any CPU-intensive Python work) holds the GIL — even for 10ms — the callback is delayed, causing buffer underruns → audible glitches.
The callback itself is C-level, but the Python code inside it needs the GIL. This is the fundamental trap: the sounddevice docs say "callback runs on real-time thread" which is true for the C wrapper, but your Python code inside still contends for the GIL.
# DON'T — process spawn + device acquisition per chunk = jitter
for chunk in chunks:
wav_path = write_temp_wav(chunk)
subprocess.run(["afplay", wav_path]) # new process each time!
os.unlink(wav_path)
Why it fails:
fork() + exec() for each chunkWhen afplay IS appropriate: One-shot playback of an existing file (e.g., notification sound). Not for streaming/chunked audio.
<!-- DON'T — macOS actively throttles CPU and I/O -->
<key>Nice</key>
<integer>5</integer>
<key>ProcessType</key>
<string>Background</string>
Why it fails: ProcessType: Background tells macOS this process doesn't need timely CPU access. macOS will:
For audio playback, this causes sporadic jitter that's hard to reproduce — it only happens when other processes are active.
say as TTS Fallback# DON'T — quality cliff, unexpected behavior
if ! kokoro_synthesize "$text"; then
say "$text" # "fallback"
fi
Why it fails:
say has different timing, volume, and behaviorInstead: Fail loudly with a notification. Let the user know the TTS server is down and how to fix it.
# DON'T — stream binds to whatever device was default at process start
stream = sd.OutputStream(samplerate=24000, channels=1, dtype="float32")
stream.start()
# ... reuse forever, never close/reopen
Why it fails:
Pa_Initialize() scans devices once. Bluetooth devices connecting later are invisible — stream open to them fails silently or crashes the playback worker.Instead: Open stream lazily per request, close after each. Call sd._terminate() + sd._initialize() before opening to refresh the device list.
If you hear jitter/choppiness:
ps -o pid,nice,pri,command -p $(pgrep -f tts_server)
grep -c afplay ~/.local/state/launchd-logs/kokoro-tts-server/stdout.log
audio callback status: output underflow in logs
plutil -p ~/Library/LaunchAgents/com.terryli.kokoro-tts-server.plist | grep -E 'Nice|ProcessType'
If audio goes to wrong device:
grep "Audio stream opened" ~/.local/state/launchd-logs/kokoro-tts-server/stdout.log | tail -3
grep "PaErrorCode\|PortAudio error" ~/.local/state/launchd-logs/kokoro-tts-server/stdout.log | tail -5
PaErrorCode -9988 = stream pointer invalidated (device refresh while stream active)~/.local/share/kokoro/.venv/bin/python3 -c "import sounddevice as sd; print(sd.query_devices(kind='output'))"devops-tools:macbook-desktop-mode — Complementary skill covering USB device resilience (sleep/wake recovery, uhubctl port cycling, battery longevity, pmset desktop configuration). This skill handles the application/playback layer; that one handles the system/USB layer.After this skill completes, check before closing:
Only update if the issue is real and reproducible — not speculative.