whisperx

Name: whisperx
Author: ThePlasmak
ASecurity
Speech-to-text with word-level timestamps, speaker diarization, and forced alignment using WhisperX. Built on faster-whisper with batched inference for 70x realtime speed.
2 stars
0 votes
0 copies
5 views
Added 5/27/2026
developmentpythongoshellbashkubernetes
Works with

cliapi
Security Analysis

A96/100
mediumInstalls packages at runtime which could introduce malicious dependencies
Scanned 5/27/2026
Install via CLI
$openskills install ThePlasmak/whisperx
Files
SKILL.md
---
name: whisperx
description: Speech-to-text with word-level timestamps, speaker diarization, and forced alignment using WhisperX. Built on faster-whisper with batched inference for 70x realtime speed.
version: 1.0.0
author: Sarah Mak
tags: ["audio", "transcription", "whisperx", "speech-to-text", "diarization", "alignment", "subtitles", "ml", "cuda", "gpu"]
homepage: https://github.com/ThePlasmak/whisperx
platforms: ["linux", "macos", "wsl2"]
metadata: {"openclaw":{"emoji":"🎙️","requires":{"bins":["ffmpeg","python3"]}}}
---

# WhisperX

Speech-to-text with **word-level timestamps**, **speaker diarization**, and **forced alignment** — built on faster-whisper with batched inference for up to **70x realtime** transcription speed.

WhisperX extends Whisper with three key capabilities that faster-whisper alone doesn't provide:
1. **Forced alignment** — precise word-level timestamps via phoneme ASR models (wav2vec2)
2. **Speaker diarization** — label who said what (via pyannote.audio)
3. **Batched inference** — process audio in parallel chunks for massive speedup

## When to Use

Use this skill when you need to:
- **Transcribe with word-level timing** — subtitles, captions, karaoke-style highlighting
- **Identify speakers** — meetings, interviews, podcasts, multi-speaker recordings
- **Generate subtitle files** — SRT, VTT with accurate word timestamps
- **Translate speech to English** — from any supported language
- **Batch transcribe** — efficient processing of multiple files
- **Transcribe a section** — extract just part of a long recording

**Trigger phrases:** "transcribe with speakers", "who said what", "diarize", "make subtitles", "word timestamps", "speaker identification", "meeting transcript", "karaoke subtitles"

**When NOT to use:**
- Simple transcription without speaker/timing needs → use **faster-whisper** (lighter, faster)
- Real-time/streaming transcription
- Files <10 seconds where setup overhead isn't worth it

**WhisperX vs faster-whisper:**
| Feature | faster-whisper | WhisperX |
|---------|---------------|----------|
| Basic transcription | ✅ | ✅ |
| Word timestamps | ✅ (approximate) | ✅ (precise, aligned) |
| Speaker diarization | ❌ | ✅ |
| Forced alignment | ❌ | ✅ |
| Batched inference | ❌ | ✅ |
| Word-level subtitles | ❌ | ✅ (karaoke-style) |
| Subtitle generation | Manual | Built-in (SRT/VTT/TSV) |
| Time range trimming | ❌ | ✅ (--start/--end) |
| Hotwords | ❌ | ✅ (boost specific terms) |
| Initial prompt | ❌ | ✅ (domain terms) |
| Speaker renaming | ❌ | ✅ (--speaker-names) |
| Stdin pipe | ❌ | ✅ (read from `-`) |
| Setup complexity | Simple | Requires HF token for diarization |

## Quick Reference

All commands use `./scripts/transcribe` (the skill wrapper), **not** the `whisperx` CLI directly. The wrapper applies a required PyTorch compatibility patch — see "PyTorch 2.6+ Compatibility" below.

| Task | Command | Notes |
|------|---------|-------|
| **Basic transcription** | `./scripts/transcribe audio.mp3` | Word-aligned by default |
| **With speakers** | `./scripts/transcribe audio.mp3 --diarize` | Auto-reads `~/.cache/huggingface/token` |
| **Clean speaker output** | `./scripts/transcribe audio.mp3 --diarize --merge-speakers` | Merges consecutive same-speaker segments |
| **Named speakers** | `./scripts/transcribe audio.mp3 --diarize --speaker-names "Alice,Bob"` | Replaces SPEAKER_00, SPEAKER_01 |
| **SRT subtitles** | `./scripts/transcribe audio.mp3 --srt -o subs.srt` | Ready for video players |
| **Word-level SRT** | `./scripts/transcribe audio.mp3 --srt --word-level` | Karaoke-style, one word per cue |
| **Wrapped subtitles** | `./scripts/transcribe audio.mp3 --srt --max-line-width 42` | Standard TV subtitle width |
| **VTT subtitles** | `./scripts/transcribe audio.mp3 --vtt -o subs.vtt` | Web-compatible with `<v>` speaker tags |
| **JSON output** | `./scripts/transcribe audio.mp3 --json` | Full data with word timestamps |
| **TSV output** | `./scripts/transcribe audio.mp3 --tsv` | Spreadsheet-friendly |
| **Translate to English** | `./scripts/transcribe audio.mp3 --translate` | Any language → English |
| **Fast, no alignment** | `./scripts/transcribe audio.mp3 --no-align` | Skip forced alignment |
| **Specific language** | `./scripts/transcribe audio.mp3 -l en` | Faster than auto-detect |
| **Partial transcription** | `./scripts/transcribe audio.mp3 --start 1:30 --end 5:00` | Only a section |
| **Boost terms** | `./scripts/transcribe audio.mp3 --hotwords "Kubernetes gRPC"` | Improve rare term recognition |
| **Domain accuracy** | `./scripts/transcribe audio.mp3 --initial-prompt "OpenAI, GPT-4"` | Condition the model |
| **From stdin** | `cat audio.mp3 \| ./scripts/transcribe -` | Pipe from other tools |
| **Auto-detect format** | `./scripts/transcribe audio.mp3 -o out.srt` | Format from extension |

**⚠️ Do NOT run `whisperx` CLI directly** — it will crash on PyTorch 2.6+ with pyannote models. Always use this skill's `./scripts/transcribe` wrapper.

## Model Selection

| Model | Size | Speed | Accuracy | Use Case |
|-------|------|-------|----------|----------|
| `tiny` | 39M | Fastest | Basic | Quick drafts, testing |
| `base` | 74M | Very fast | Good | General use |
| `small` | 244M | Fast | Better | Default for whisperx CLI |
| `medium` | 769M | Moderate | High | Quality transcription |
| `large-v2` | 1.5GB | Slower | Excellent | Best diarization compat |
| `large-v3` | 1.5GB | Slower | Best | Maximum accuracy |
| **`large-v3-turbo`** | 809M | Fast | Excellent | **Recommended (default)** |

**Note:** WhisperX defaults to `small` but this skill defaults to `large-v3-turbo` for the best speed/accuracy balance with GPU.

## Setup

### First-Time Setup

**Prerequisites:** Python 3.10+, ffmpeg, NVIDIA GPU with CUDA (strongly recommended)

**Step 1: Install whisperx**

```bash
# Option A: Run the setup script (auto-detects GPU, creates venv if needed)
./setup.sh

# Option B: Install globally (if you prefer)
pip install whisperx
```

**Step 2 (optional, for diarization): Set up Hugging Face token**

Speaker diarization requires a free Hugging Face account and access to gated models. Skip this if you only need transcription/alignment.

1. Create account at [huggingface.co](https://huggingface.co) (if you don't have one)
2. Go to [huggingface.co/settings/tokens](https://huggingface.co/settings/tokens) and create a **read** access token
3. Accept the model agreement(s) (click "Agree and access repository"):
   - **whisperx ≥3.8.0** (recommended): Accept [pyannote/speaker-diarization-community-1](https://huggingface.co/pyannote/speaker-diarization-community-1) — uses pyannote v4 with better accuracy
   - **whisperx <3.8.0**: Accept **both** [pyannote/speaker-diarization-3.1](https://huggingface.co/pyannote/speaker-diarization-3.1) and [pyannote/segmentation-3.0](https://huggingface.co/pyannote/segmentation-3.0)
4. Save the token so the script auto-detects it:
   ```bash
   mkdir -p ~/.cache/huggingface && echo -n "hf_YOUR_TOKEN" > ~/.cache/huggingface/token && chmod 600 ~/.cache/huggingface/token
   ```
   Alternatively: set `HF_TOKEN` env var, or pass `--hf-token` per-command.

**Note:** The token and model access are completely free. The models are just gated behind a click-to-agree license. Without step 3, you'll get a 403 error even with a valid token.

### Subsequent Runs

No setup needed — just run `./scripts/transcribe`. The wrapper script:
- Auto-detects GPU/CPU and picks optimal compute type
- Auto-reads the HF token from `~/.cache/huggingface/token` for diarization
- Applies the PyTorch 2.6+ compatibility patch automatically (see below)
- First run for a new model downloads it to `~/.cache/huggingface/` (one-time per model)

### Checking If It's Working

```bash
# Quick test — should print transcript to stdout
./scripts/transcribe some_audio.mp3

# Test diarization — should show [SPEAKER_00], [SPEAKER_01], etc.
./scripts/transcribe some_audio.mp3 --diarize

# Check version
./scripts/transcribe --version

# If diarization fails with 403: model agreements not accepted (see step 3 above)
# If it crashes with pickle/weights_only error: you're running `whisperx` CLI directly instead of the wrapper
```

### Platform Support

| Platform | Acceleration | Speed |
|----------|-------------|-------|
| **Linux + NVIDIA GPU** | CUDA (batched) | ~70x realtime 🚀 |
| **WSL2 + NVIDIA GPU** | CUDA (batched) | ~70x realtime 🚀 |
| macOS Apple Silicon | CPU | ~3-5x realtime |
| macOS Intel | CPU | ~1-2x realtime |
| Linux (no GPU) | CPU | ~1x realtime |

## Usage

All commands use `./scripts/transcribe` — resolve the path relative to this skill's directory.

```bash
# Basic transcription (word-aligned)
./scripts/transcribe audio.mp3

# With speaker diarization (auto-reads ~/.cache/huggingface/token)
./scripts/transcribe audio.mp3 --diarize

# Diarize with merged same-speaker segments (cleaner output)
./scripts/transcribe audio.mp3 --diarize --merge-speakers

# Rename speakers to real names
./scripts/transcribe audio.mp3 --diarize --speaker-names "Alice,Bob,Charlie"

# Generate SRT subtitles
./scripts/transcribe audio.mp3 --srt -o subtitles.srt

# SRT with line wrapping (standard TV width)
./scripts/transcribe audio.mp3 --srt --max-line-width 42 -o subtitles.srt

# Word-level karaoke subtitles (one word per cue with precise timing)
./scripts/transcribe audio.mp3 --srt --word-level -o karaoke.srt

# VTT with speaker voice tags (<v Speaker>text</v>)
./scripts/transcribe audio.mp3 --vtt --diarize -o subtitles.vtt

# Auto-detect format from output filename
./scripts/transcribe audio.mp3 -o transcript.json
./scripts/transcribe audio.mp3 -o subtitles.vtt

# TSV for spreadsheets/data analysis
./scripts/transcribe audio.mp3 --tsv -o transcript.tsv

# Transcribe only a section (useful for long recordings)
./scripts/transcribe podcast.mp3 --start 10:30 --end 15:00

# Boost recognition of specific terms (hotwords)
./scripts/transcribe meeting.mp3 --hotwords "Kubernetes gRPC OAuth2"

# Improve accuracy for domain context (initial prompt)
./scripts/transcribe meeting.mp3 --initial-prompt "Attendees: Alice, Bob. Topics: Kubernetes, gRPC, OAuth2"

# Maximum accuracy
./scripts/transcribe audio.mp3 --model large-v3

# Translate non-English audio to English
./scripts/transcribe audio.mp3 --translate -l ja

# Fast mode (skip alignment)
./scripts/transcribe audio.mp3 --no-align

# JSON with full metadata and word timestamps
./scripts/transcribe audio.mp3 --json -o transcript.json

# Specify known speaker count for better diarization
./scripts/transcribe audio.mp3 --diarize --min-speakers 2 --max-speakers 4

# Read audio from stdin (pipe from other tools)
cat audio.mp3 | ./scripts/transcribe -
ffmpeg -i video.mp4 -f wav - | ./scripts/transcribe - --diarize
```

## Options

```
AUDIO_FILE               Path to audio/video file, or '-' to read from stdin

Model options:
  -m, --model NAME       Whisper model (default: large-v3-turbo)
  --batch-size N         Batch size for inference (default: 8, lower if OOM)
  --beam-size N          Beam search size (higher = slower but more accurate)
  --initial-prompt TEXT   Condition the model with domain terms, names, acronyms
  --hotwords TEXT         Space-separated hotwords to boost recognition of rare terms

Device options:
  --device               cpu, cuda, or auto (default: auto)
  --compute-type         int8, float16, float32, or auto (default: auto)
  --threads N            CPU threads for CTranslate2 inference (default: 4)

Language options:
  -l, --language CODE    Language code (auto-detects if omitted)
  --translate            Translate to English

Time range:
  --start TIME           Start time — seconds (90), MM:SS (1:30), HH:MM:SS
  --end TIME             End time — same formats as --start

Alignment options:
  --no-align             Skip forced alignment (no word timestamps)
  --align-model MODEL    Custom phoneme ASR model for alignment

Speaker diarization:
  --diarize              Enable speaker labels
  --hf-token TOKEN       Hugging Face access token (also reads ~/.cache/huggingface/token or HF_TOKEN env)
  --min-speakers N       Minimum speaker count hint
  --max-speakers N       Maximum speaker count hint
  --merge-speakers       Merge consecutive segments from same speaker (cleaner output)
  --speaker-names NAMES  Comma-separated names to replace SPEAKER_00, SPEAKER_01, etc.

Output options:
  -j, --json             JSON output with segments and word timestamps
  --srt                  SRT subtitle format
  --vtt                  WebVTT subtitle format (uses <v> voice tags for speakers)
  --tsv                  TSV (tab-separated values) for data analysis
  --word-level           Word-level subtitles (SRT/VTT only) — karaoke-style
  --max-line-width N     Maximum characters per subtitle line (wraps at word boundaries)
  --output-format FMT    Explicit format (srt, vtt, txt, json, tsv)
  -o, --output FILE      Save to file (format auto-detected from extension)

Miscellaneous:
  -V, --version          Show version
  -q, --quiet            Suppress progress messages
```

## Output Formats

### Plain Text (default)
```
Hello and welcome to the show.
Today we're talking about AI transcription.
```

### Plain Text with Diarization (`--diarize`)
```
[SPEAKER_00] Hello and welcome to the show.
[SPEAKER_01] Thanks for having me.
```

### Plain Text with Named Speakers (`--diarize --speaker-names "Alice,Bob"`)
```
[Alice] Hello and welcome to the show.
[Bob] Thanks for having me.
```

### SRT (`--srt`)
Standard subtitle format, compatible with VLC, YouTube, etc.
```
1
00:00:00,000 --> 00:00:03,500
Hello and welcome to the show.

2
00:00:03,500 --> 00:00:06,200
Today we're talking about AI transcription.
```

### Word-Level SRT (`--srt --word-level`)
One word per cue — for karaoke-style highlighting or precise editing.
```
1
00:00:00,000 --> 00:00:00,320
Hello

2
00:00:00,320 --> 00:00:00,560
and

3
00:00:00,560 --> 00:00:01,100
welcome
```

### WebVTT (`--vtt`)
Web-native subtitle format for HTML5 `<video>` and `<track>`. When used with `--diarize`, uses proper VTT `<v>` voice tags for speaker identification.
```
WEBVTT

00:00:00.000 --> 00:00:03.500
<v Alice>Hello and welcome to the show.</v>

00:00:03.500 --> 00:00:05.000
<v Bob>Thanks for having me.</v>
```

### JSON (`--json`)
Structured output with word-level timestamps and confidence scores.
```json
{
  "segments": [
    {
      "start": 0.0,
      "end": 3.5,
      "text": "Hello and welcome to the show.",
      "speaker": "SPEAKER_00",
      "words": [
        {"word": "Hello", "start": 0.0, "end": 0.32, "confidence": 0.98},
        {"word": "and", "start": 0.32, "end": 0.56, "confidence": 0.95}
      ]
    }
  ]
}
```

### TSV (`--tsv`)
Tab-separated values for spreadsheets and data pipelines.
```
start	end	text
0.000	3.500	Hello and welcome to the show.
3.500	6.200	Today we're talking about AI transcription.
```

## Examples

```bash
# Transcribe a meeting recording with speakers (clean output)
./scripts/transcribe meeting.mp3 --diarize --merge-speakers \
  --speaker-names "Alice,Bob,Charlie" \
  --min-speakers 3 --max-speakers 5 --json -o meeting.json

# Generate subtitles for a video (wrapped to standard width)
./scripts/transcribe video.mp4 --srt --max-line-width 42 -o video.srt

# Karaoke-style word-level subtitles
./scripts/transcribe song.mp3 --vtt --word-level -o karaoke.vtt

# Transcribe just the interesting part of a podcast
./scripts/transcribe podcast.mp3 --start 45:00 --end 1:02:30 --diarize --merge-speakers

# Improve accuracy for technical content (hotwords + initial prompt)
./scripts/transcribe lecture.mp3 \
  --hotwords "PyTorch CTranslate2 whisperx" \
  --initial-prompt "A lecture on ML model optimization"

# Batch transcribe a folder
for file in recordings/*.mp3; do
  ./scripts/transcribe "$file" --json -o "${file%.mp3}.json"
done

# Transcribe YouTube audio (with yt-dlp)
yt-dlp -x --audio-format mp3 <URL> -o audio.mp3
./scripts/transcribe audio.mp3 --diarize --merge-speakers

# Pipe directly from ffmpeg (extract audio on the fly)
ffmpeg -i video.mp4 -f wav -ac 1 -ar 16000 - 2>/dev/null | ./scripts/transcribe -

# Quick draft (fast, no alignment)
./scripts/transcribe audio.mp3 --model base --no-align

# German audio with TSV output for analysis
./scripts/transcribe audio.mp3 -l de --tsv -o transcript.tsv

# Auto-detect format from filename
./scripts/transcribe audio.mp3 -o transcript.srt   # → SRT
./scripts/transcribe audio.mp3 -o data.json         # → JSON
./scripts/transcribe audio.mp3 -o export.tsv        # → TSV
```

## Common Mistakes

| Mistake | Problem | Solution |
|---------|---------|----------|
| **Using CPU when GPU available** | 10-70x slower | Check `nvidia-smi`; verify CUDA |
| **Missing HF token for diarize** | Diarization fails | Get token from huggingface.co/settings/tokens |
| **Not accepting model agreements** | 403 error on diarization model | whisperx ≥3.8.0: accept community-1. Earlier: accept **both** pyannote/speaker-diarization-3.1 AND segmentation-3.0 (see Setup) |
| **Running `whisperx` CLI directly** | Crashes on PyTorch 2.6+ | Always use `./scripts/transcribe` wrapper (applies torch.load patch) |
| **batch_size too high** | CUDA OOM | Lower `--batch-size` (try 4 or 2) |
| **Using large-v3 when turbo works** | Unnecessary slowdown | `large-v3-turbo` is faster with near-identical accuracy |
| **Forgetting --language** | Wastes time auto-detecting | Specify `-l en` when you know the language |
| **Using WhisperX for simple transcription** | Heavier setup for no benefit | Use faster-whisper for basic transcription |
| **--word-level without --srt/--vtt** | Flag is ignored | Word-level only applies to subtitle formats |
| **--merge-speakers without --diarize** | Flag is ignored | Merge only works when speakers are identified |

## Performance Notes

- **First run**: Downloads model + alignment model (one-time)
- **GPU batched inference**: Up to 70x realtime with large-v2
- **Diarization**: Adds ~30-60s overhead for model loading
- **Memory** (GPU VRAM):
  - `large-v3-turbo`: ~2-3GB
  - `large-v3` + diarization: ~4-5GB
  - Reduce `--batch-size` if OOM
- **Completion stats**: The tool prints segment/word counts, speed ratio, and speaker breakdown (when diarizing) at the end

## Supported Languages

WhisperX supports all languages that Whisper supports (99 languages). **Forced alignment** (word timestamps) is available for a subset — if alignment fails for a language, the tool falls back gracefully to segment-level timestamps.

**Languages with alignment support** (common subset):
`en` English, `zh` Chinese, `de` German, `es` Spanish, `fr` French, `it` Italian, `ja` Japanese, `ko` Korean, `pt` Portuguese, `ru` Russian, `nl` Dutch, `pl` Polish, `tr` Turkish, `ar` Arabic, `sv` Swedish, `da` Danish, `fi` Finnish, `hu` Hungarian, `uk` Ukrainian, `el` Greek, `cs` Czech, `ro` Romanian, `vi` Vietnamese, `th` Thai, `hi` Hindi, `he` Hebrew, `id` Indonesian, `ms` Malay, `no` Norwegian, `fa` Persian, `bg` Bulgarian, `ca` Catalan, `hr` Croatian, `sk` Slovak, `sl` Slovenian, `ta` Tamil, `te` Telugu, `ur` Urdu

For the full list, see [whisperx/alignment.py](https://github.com/m-bain/whisperX/blob/main/whisperx/alignment.py).

## Hotwords vs Initial Prompt

Both improve accuracy for specific terms, but they work differently:

| | `--hotwords` | `--initial-prompt` |
|---|---|---|
| **How it works** | Boosts probability of specific tokens during decoding | Conditions the model as if these words appeared earlier |
| **Best for** | Rare terms, proper nouns, technical jargon | Setting domain context, style, formatting |
| **Example** | `--hotwords "Kubernetes gRPC OAuth2"` | `--initial-prompt "A technical meeting about cloud infrastructure"` |
| **Can combine** | ✅ Yes, use both together for best results | ✅ |
| **Requires** | whisperx ≥3.7.5 | Any version |

**Tip:** Use hotwords for the specific words you need recognized correctly, and initial prompt for broader context about the audio content.

## PyTorch 2.6+ Compatibility (CRITICAL)

**⚠️ PyTorch 2.6 changed `torch.load()` to default to `weights_only=True`.** This breaks pyannote.audio's model loading (used by both VAD and diarization), because the model checkpoints contain globals like `omegaconf.listconfig.ListConfig` and `torch.torch_version.TorchVersion` that aren't allowlisted.

**Symptoms:**
- `_pickle.UnpicklingError: Weights only load failed` when loading VAD or diarization models
- `'NoneType' object has no attribute 'to'` (pipeline silently returns None)

**How this skill handles it:**

The `scripts/transcribe.py` uses the **whisperx Python API directly** (not as a subprocess) so it can monkey-patch `torch.load` before any model loading happens:

```python
import torch
_original_torch_load = torch.load
def _patched_torch_load(*args, **kwargs):
    kwargs['weights_only'] = False  # Must FORCE, not setdefault — lightning_fabric passes True explicitly
    return _original_torch_load(*args, **kwargs)
torch.load = _patched_torch_load
```

**Key details:**
- The patch **must** use `kwargs['weights_only'] = False` (forced override), NOT `kwargs.setdefault('weights_only', False)` — because `lightning_fabric` explicitly passes `weights_only=True`, which `setdefault` won't override
- The patch **must** be applied before importing `whisperx`, `pyannote`, or any model loading code
- This is why the skill uses the Python API instead of shelling out to `whisperx` CLI — a subprocess can't inherit the monkey-patch
- VAD defaults to **silero** (not pyannote) to avoid a second torch.load issue in the VAD pipeline. Silero loads fine without the patch, but diarization still needs it

**Note:** whisperx ≥3.8.0 migrated to pyannote-audio v4 with `speaker-diarization-community-1`, which may resolve some of these compatibility issues. The patch is kept for broad version support.

**If whisperx CLI is updated to fix this upstream**, the monkey-patch can be removed and the script could switch back to subprocess mode. Track: [whisperX#972](https://github.com/m-bain/whisperX/issues/972)

## Troubleshooting

**`_pickle.UnpicklingError: Weights only load failed`**: PyTorch 2.6+ compat issue. If running via CLI (`whisperx` command directly), this can't be fixed without patching the installed library. Use this skill's `scripts/transcribe` wrapper instead, which applies the patch automatically. See "PyTorch 2.6+ Compatibility" section above.

**"CUDA not available"**: Install PyTorch with CUDA (`pip install torch --index-url https://download.pytorch.org/whl/cu121`)

**"No module named whisperx"**: Run `./setup.sh` or `pip install whisperx`

**Diarization 403 error**: You must accept the model agreement(s). For whisperx ≥3.8.0: accept [community-1](https://huggingface.co/pyannote/speaker-diarization-community-1). For earlier versions: accept **both** [speaker-diarization-3.1](https://huggingface.co/pyannote/speaker-diarization-3.1) and [segmentation-3.0](https://huggingface.co/pyannote/segmentation-3.0). See Setup above.

**Diarization fails but transcription continues**: v1.1.0+ gracefully handles diarization failures — it prints a diagnostic error and continues without speaker labels instead of crashing.

**`'NoneType' object has no attribute 'to'`**: Either the HF token is invalid, the model agreements haven't been accepted, or the torch.load patch isn't applied. Check all three.

**OOM on GPU**: Lower `--batch-size` to 4 or 2

**Alignment fails for language X**: The language may not have a wav2vec2 alignment model. The tool will fall back to segment-level timestamps and print a warning. Check supported languages in [whisperx alignment.py](https://github.com/m-bain/whisperX/blob/main/whisperx/alignment.py).

**Slow on CPU**: Expected — use GPU for practical transcription. Even `tiny` model on CPU is ~5-10x slower than `large-v3-turbo` on a mid-range GPU.

**Empty output / no segments**: Audio may be silence or too short. Check with `ffprobe audio.mp3` to verify the file has actual audio content. v1.1.0+ prints a warning and produces valid empty output instead of crashing.

**Timestamps wrong after trimming**: If using `--start`, timestamps in the output reflect the original file's timeline (not relative to the trim point). This is by design — subtitle timecodes stay correct for the source video.

**"No speech detected" warning**: The audio file may contain only music, silence, or non-speech sounds. This is expected behavior, not an error.

## Upstream Changes

**whisperx 3.8.0** (Feb 2026): Migrated to pyannote-audio v4 with `speaker-diarization-community-1`. This model has lower diarization error rates across all benchmarks compared to the older `speaker-diarization-3.1`. Upgrade recommended: `pip install --upgrade whisperx`

**whisperx 3.7.5**: Added `--hotwords` support for boosting recognition of specific terms. This skill exposes it via the `--hotwords` flag.

## References

- [WhisperX GitHub](https://github.com/m-bain/whisperX)
- [WhisperX Paper (INTERSPEECH 2023)](https://arxiv.org/abs/2303.00747)
- [pyannote.audio](https://github.com/pyannote/pyannote-audio) — speaker diarization
- [faster-whisper](https://github.com/SYSTRAN/faster-whisper) — CTranslate2 backend
whisperx

Works with

Security Analysis

Attribution

Comments (0)