npx claudepluginhub sacredvoid/skillkit --plugin skillkitThis skill is limited to using the following tools:
Process video and audio recordings into text transcripts and visual frame captures for analysis. Auto-detects your hardware and picks the optimal transcription model, device, and settings.
Transcribes audio/video files locally to text using faster-whisper: 4-6x faster than OpenAI Whisper, GPU-accelerated for ~20x realtime, multilingual, with word-level timestamps for subtitles.
Transcribes audio/video files (mp4, mov, mp3, wav, m4a) to timestamped Markdown using local FunASR with speaker diarization, ONNX acceleration, and video keyframe extraction. For speech-to-text, meetings, podcasts, subtitles.
Transcribes audio/video files to Markdown documentation with LLM summaries, speaker diarization, timestamps, metadata, and SRT/VTT subtitles using Faster-Whisper or Whisper.
Share bugs, ideas, or general feedback.
Process video and audio recordings into text transcripts and visual frame captures for analysis. Auto-detects your hardware and picks the optimal transcription model, device, and settings.
Run this at the start of every invocation. It determines everything downstream.
python3 -c "
import platform, shutil, subprocess, json
info = {'os': platform.system(), 'arch': platform.machine(), 'ram_gb': 0, 'gpu': 'none', 'vram_gb': 0, 'device': 'cpu', 'dtype': 'float32'}
# RAM
try:
if platform.system() == 'Darwin':
import os; info['ram_gb'] = round(os.sysconf('SC_PAGE_SIZE') * os.sysconf('SC_PHYS_PAGES') / (1024**3))
elif platform.system() == 'Linux':
with open('/proc/meminfo') as f:
for line in f:
if line.startswith('MemTotal'):
info['ram_gb'] = round(int(line.split()[1]) / (1024**2))
break
else:
import ctypes
mem = ctypes.c_ulonglong(0)
ctypes.windll.kernel32.GetPhysicallyInstalledMemory(ctypes.byref(mem))
info['ram_gb'] = round(mem.value / (1024**2))
except: pass
# GPU detection
try:
import torch
if torch.cuda.is_available():
info['gpu'] = torch.cuda.get_device_name(0)
info['vram_gb'] = round(torch.cuda.get_device_properties(0).total_mem / (1024**3))
info['device'] = 'cuda'
info['dtype'] = 'float16'
elif hasattr(torch.backends, 'mps') and torch.backends.mps.is_available():
info['gpu'] = 'Apple Silicon (MPS)'
info['vram_gb'] = info['ram_gb'] # unified memory
info['device'] = 'mps'
info['dtype'] = 'float16'
elif hasattr(torch, 'hip') or 'AMD' in str(getattr(torch, '_C', '')):
info['gpu'] = 'AMD (ROCm)'
info['device'] = 'cuda' # ROCm uses cuda API
info['dtype'] = 'float16'
except ImportError:
pass
# ffmpeg detection
info['ffmpeg'] = shutil.which('ffmpeg') is not None
info['ffprobe'] = shutil.which('ffprobe') is not None
print(json.dumps(info))
"
Parse the JSON output and store it internally as COMPUTE. Present the results to the user:
Detected hardware:
- OS: {os} ({arch})
- RAM: {ram_gb} GB
- GPU: {gpu} ({vram_gb} GB VRAM)
- Compute device: {device}
- ffmpeg: {installed/missing}
Based on hardware, choose the transcription engine:
| Condition | Engine | Reason |
|---|---|---|
| CUDA GPU available | faster-whisper | GPU-accelerated CTranslate2, 4x faster than standard Whisper |
| MPS (Apple Silicon) | openai-whisper | Standard Whisper, uses MPS acceleration when available |
| CPU only | openai-whisper | Standard Whisper on CPU, reliable fallback |
Based on hardware and recording length, recommend a model:
| Condition | Recommended Model | Speed (1hr recording) | Accuracy |
|---|---|---|---|
| CUDA >= 8 GB VRAM | large-v3 | ~5 min | Best |
| CUDA 4-7 GB VRAM | medium | ~8 min | High |
| MPS (Apple Silicon, 16+ GB) | medium | ~15 min | High |
| MPS (Apple Silicon, 8 GB) | small | ~10 min | Good |
| CPU + RAM >= 16 GB | small | ~30 min | Good |
| CPU + RAM < 16 GB | base | ~15 min | Adequate |
Present the recommendation and let the user choose:
AskUserQuestion: "Which transcription model should I use?"
Options:
- [Recommended model] (Recommended) - [reason based on their hardware]
- large-v3 - Best accuracy, needs 8+ GB VRAM or lots of RAM
- medium - High accuracy, balanced speed
- small - Good accuracy, faster processing
- base - Adequate for most meetings, fastest
Store the chosen model as WHISPER_MODEL and engine as WHISPER_ENGINE.
Find the file. Zoom recordings typically follow the pattern:
GMT{date}-{time}_Recording_{resolution}.mp4 (video)GMT{date}-{time}_Recording.m4a (audio only)GMT{date}-{time}_RecordingnewChat.txt (chat log)Check for all three artifacts. Read the chat file first (it's small and gives immediate context).
Check and install what's needed based on platform.
# Check if ffmpeg is available
ffmpeg -version 2>&1 | head -1
ffprobe -version 2>&1 | head -1
If ffmpeg is missing, install based on platform:
| Platform | Install command |
|---|---|
| macOS (Homebrew) | brew install ffmpeg |
| macOS (no Homebrew) | Guide user to install Homebrew first, then ffmpeg |
| Ubuntu/Debian | sudo apt update && sudo apt install -y ffmpeg |
| Fedora/RHEL | sudo dnf install -y ffmpeg |
| Arch Linux | sudo pacman -S ffmpeg |
| Windows (winget) | winget install Gyan.FFmpeg |
| Windows (choco) | choco install ffmpeg |
| Windows (scoop) | scoop install ffmpeg |
python3 -c "import torch; print(torch.__version__)" 2>&1
If torch is missing, install based on platform:
| Platform | Install command |
|---|---|
| macOS (MPS) | pip3 install torch torchvision torchaudio |
| Linux (CUDA) | pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121 |
| Linux (ROCm) | pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/rocm6.0 |
| Linux/Windows (CPU) | pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cpu |
| Windows (CUDA) | pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121 |
Then install the transcription engine:
# For CUDA systems (faster-whisper with CTranslate2 acceleration)
pip3 install faster-whisper
# For MPS / CPU systems (standard OpenAI Whisper)
pip3 install openai-whisper
Use ffprobe (cross-platform) to get the file duration:
ffprobe -v quiet -show_entries format=duration -of csv=p=0 "/path/to/file"
This returns duration in seconds. Convert to human-readable format and display to user. Use this to estimate transcription time based on the selected model and hardware.
Present the plan:
Processing plan:
- File: {filename} ({duration})
- Engine: {WHISPER_ENGINE} with {WHISPER_MODEL} model on {device}
- Estimated transcription time: {estimate}
- Frame extraction: 1 per minute ({frame_count} frames)
Want me to adjust anything before starting?
| Device | base | small | medium | large-v3 |
|---|---|---|---|---|
| CUDA (RTX 3060+) | ~1 min/hr | ~2 min/hr | ~4 min/hr | ~5 min/hr |
| CUDA (faster-whisper) | ~30s/hr | ~1 min/hr | ~2 min/hr | ~3 min/hr |
| MPS (M1/M2/M3/M4) | ~3 min/hr | ~6 min/hr | ~15 min/hr | ~25 min/hr |
| CPU (16GB+ RAM) | ~8 min/hr | ~15 min/hr | ~30 min/hr | ~60 min/hr |
Extract one frame per minute for visual context. Use a cross-platform temp directory:
# Create temp directory (cross-platform)
python3 -c "import tempfile, os; d = os.path.join(tempfile.gettempdir(), 'av-frames'); os.makedirs(d, exist_ok=True); print(d)"
Then extract frames:
ffmpeg -i "/path/to/video.mp4" \
-vf "fps=1/60,scale=1280:-1" -q:v 3 {TEMP_DIR}/frame_%03d.jpg
Use the Read tool on each frame to view it. Frames show:
Run the transcription engine. Choose the correct code based on WHISPER_ENGINE:
from faster_whisper import WhisperModel
import json
model = WhisperModel("{WHISPER_MODEL}", device="cuda", compute_type="float16")
segments, info = model.transcribe("/path/to/audio", beam_size=5, language="en")
results = []
for segment in segments:
results.append({
"start": segment.start,
"end": segment.end,
"text": segment.text.strip()
})
# Save timestamped transcript
with open("{TEMP_DIR}/transcript.json", "w") as f:
json.dump({"language": info.language, "segments": results}, f, indent=2)
with open("{TEMP_DIR}/transcript.txt", "w") as f:
for seg in results:
start = int(seg["start"])
mins, secs = divmod(start, 60)
f.write(f"[{mins:02d}:{secs:02d}] {seg['text']}\n")
print(f"Transcribed {len(results)} segments")
import whisper, json
model = whisper.load_model("{WHISPER_MODEL}")
result = model.transcribe("/path/to/audio", verbose=False)
# Save timestamped transcript
with open("{TEMP_DIR}/transcript.json", "w") as f:
json.dump(result, f, indent=2)
with open("{TEMP_DIR}/transcript.txt", "w") as f:
for seg in result["segments"]:
start = int(seg["start"])
mins, secs = divmod(start, 60)
f.write(f"[{mins:02d}:{secs:02d}] {seg['text'].strip()}\n")
print(f"Transcribed {len(result['segments'])} segments")
Run transcription in the background since it takes significant time. While it runs, review the extracted frames visually.
Combine visual frames + transcript + chat to build understanding:
| Model | Parameters | Speed | Accuracy | Best For |
|---|---|---|---|---|
tiny | 39M | Very fast | Low | Quick scan, checking if audio is usable |
base | 74M | Fast | Adequate | Default for short/clear meetings |
small | 244M | Moderate | Good | General meetings with accents |
medium | 769M | Slow | High | Important meetings, customer-facing |
large-v3 | 1.5B | Very slow | Best | Critical recordings, poor audio quality |
| Error | Cause | Fix |
|---|---|---|
ffmpeg: command not found | ffmpeg not installed | Install per platform table above |
No module named 'whisper' | Whisper not installed | pip3 install openai-whisper |
No module named 'faster_whisper' | faster-whisper not installed | pip3 install faster-whisper |
CUDA out of memory | Model too large for VRAM | Switch to smaller model |
RuntimeError: MPS backend | MPS issue on older macOS | Update macOS, or fall back to CPU with PYTORCH_MPS_HIGH_WATERMARK_RATIO=0.0 |
torch not found | Python deps missing | Install torch per platform table above |
| Garbled transcript | Wrong language or poor audio | Try specifying language="en" or use a larger model |
| Very slow transcription | Running large model on CPU | Switch to base or small model |