watch-video

Name: watch-video
Author: Newuxtreme
SLASH-COMMAND-ONLY. Invoke ONLY when the user explicitly types the literal `/watch-video` slash command. Never auto-trigger on phrases like "watch this video," "summarize this video," "take notes on this video," or on a bare YouTube URL. For ad-hoc YouTube questions, fetch the transcript directly (YouTube page or yt-dlp) and skim — do not run this heavyweight pipeline.
16 stars
0 votes
0 copies
0 views
Added 5/26/2026
data-aipythongogitapibackend
Works with

api
Install via CLI
$openskills install Newuxtreme/watch-video-skill
Files
SKILL.md
---
name: watch-video
description: SLASH-COMMAND-ONLY. Invoke ONLY when the user explicitly types the literal `/watch-video` slash command. Never auto-trigger on phrases like "watch this video," "summarize this video," "take notes on this video," or on a bare YouTube URL. For ad-hoc YouTube questions, fetch the transcript directly (YouTube page or yt-dlp) and skim — do not run this heavyweight pipeline.
---

# Watch Video

Claude can't stream video directly. This skill fakes it: a Python pipeline (vendored from [bradautomates/claude-video](https://github.com/bradautomates/claude-video) under `scripts/`) downloads the video, extracts auto-scaled JPEG frames with ffmpeg, pulls a timestamped transcript (native captions first, Whisper API fallback), and prints a markdown report listing every frame path. Claude then `Read`s each frame, aligns it to the spoken text, and writes a structured notes file.

## When to invoke

**Slash-command only.** Run this skill ONLY when the user literally types `/watch-video`. That is the sole trigger.

Do NOT invoke on:
- Casual phrases like "watch this video," "summarize this," "take notes on this YouTube video," "analyze this reel"
- A bare YouTube URL pasted with a question ("what useful tips are in this video?" + URL)
- Any natural-language ask about a video that lacks the explicit `/watch-video` command

**Default for YouTube questions without the slash command:** pull the transcript the fastest way available — fetch the YouTube page / use yt-dlp captions / a transcript site — skim it, and answer from that. The frame-extraction pipeline is overkill unless the user explicitly asks for it via `/watch-video`.

> If you'd rather have auto-trigger behavior, edit the `description:` line above to match the keywords you want Claude to fire on (e.g. "Use when the user wants Claude to watch, analyze, or take notes on a video"). Slash-only is the default in this repo because explicit invocation prevents accidental token burn on long videos.

## Dependencies

- **ffmpeg + ffprobe** on PATH — for frame and audio extraction
- **yt-dlp** on PATH — for downloading and caption fetching
- **Python 3.9+** — the bundled scripts use `from __future__ import annotations` so 3.9 works
- **Optional:** Whisper API key for videos without native captions. Set `GROQ_API_KEY` (preferred — cheaper/faster, runs `whisper-large-v3`) or `OPENAI_API_KEY` in `~/.config/watch/.env`. Without one, captioned videos work fine; uncaptioned videos return frames-only.

Run `python scripts/setup.py --check` to verify dependencies, or `python scripts/setup.py` to scaffold the `.env` and check binaries. On macOS, the installer auto-installs missing binaries via Homebrew. On Linux/Windows, it prints exact install commands.

## Pipeline

The work happens in `scripts/watch.py`. It downloads, extracts, transcribes, and prints a markdown report to stdout that lists every frame path. The pipeline auto-scales the frame budget by duration (hard cap 100 frames / 2 fps), so no manual interval tuning.

```
python scripts/watch.py "<youtube-url-or-local-path>" [flags]
```

Flags worth knowing:
- `--start T` / `--end T` — focus on a section (`SS`, `MM:SS`, or `HH:MM:SS`). Auto-scales fps denser inside the range. Use this for any question about a specific moment, or for any video > 10 min where the user's question is about one part.
- `--max-frames N` — lower the cap for tighter token budget (default 80, hard max 100).
- `--resolution W` — frame width in px (default 512; bump to 1024 only if on-screen text is unreadable).
- `--whisper groq|openai` — force a specific Whisper backend (default: prefer Groq if both keys exist).
- `--no-whisper` — skip transcription entirely if no captions. Frames-only output.
- `--out-dir DIR` — keep working files somewhere specific (default: an auto-generated tmp dir).

Auto-fps budgets (full-video mode):
- ≤30s → up to 30 frames
- 30s–1min → ~40 frames
- 1–3min → ~60 frames
- 3–10min → ~80 frames
- \>10min → 100 frames sparse (warning printed; consider `--start`/`--end`)

## Step-by-step workflow

### 1. Run the pipeline

Default invocation, no flags:

```
python scripts/watch.py "<source>"
```

For long videos where the user asked about a specific moment, pass `--start`/`--end`:

```
python scripts/watch.py "<source>" --start 2:15 --end 2:45
```

The script writes everything to a tmp working directory and prints a markdown report to stdout. Capture the stdout — it contains:
- Header (Title, Uploader, Duration, Transcript source: `captions` / `whisper (groq)` / `whisper (openai)` / `none available`)
- `## Frames` section with `- \`<absolute-path>\` (t=MM:SS)` lines
- `## Transcript` section with `[MM:SS] text...` lines
- Footer with `Work dir: <path>`

### 2. Read every frame

Read all the listed frame paths in a single message (parallel `Read` tool calls). The Read tool renders JPEGs as images. Each frame's filename + `t=MM:SS` from the report tells you when it occurred — pair each frame with the matching transcript line at that timestamp.

For very long videos (>10 min, sparse mode): the budget already capped at 100 frames, so reading all of them is fine.

### 3. Write the summary

Output file: `<work-dir>/<slug>-notes.md` by default. If the user said "save notes somewhere permanent," ask where.

Structure the markdown like this:

```markdown
# <Title>

**Source:** <URL or local path>
**Duration:** <mm:ss>
**Uploader:** <if from YouTube>
**Transcript source:** <captions / whisper (groq) / whisper (openai) / none>

## One-line summary
<≤20 words — the core claim or hook of the video>

## TL;DR
<3–5 bullet points capturing the main arguments, moments, or beats>

## Timeline
- **[00:00]** <what's happening visually + the key line being said>
- **[00:15]** ...
<one row per meaningful beat, not per frame>

## Key quotes
> "<verbatim quote>" — [mm:ss]

## Visual notes
<what the video shows that the transcript alone would miss — setting, B-roll, on-screen text, graphics, transitions, subject's emotion>

## Takeaways (optional)
<Only include if the content is directly relevant to the user's domain or goals. Omit otherwise.>
```

### 4. Clean up

After the `.md` is written, delete the work dir (it contains the full downloaded video + frames, which is large). The script prints `Work dir: <path>` in the footer of its report; pass that path to `rm -rf`.

If the user specified a non-tmp `--out-dir`, ask before deleting.

## Common gotchas

- **YouTube Shorts / age-gated / members-only** — yt-dlp may fail. Surface its stderr verbatim; don't retry silently.
- **No captions + no Whisper key** — the report says `Transcript: none available` and points at `setup.py`. Tell the user they can add a Groq key to `~/.config/watch/.env` for Whisper, or use `--no-whisper` for frames-only.
- **Local file with no audio track** — Whisper extraction errors out cleanly. Use `--no-whisper` for frames-only.
- **Very long videos (>30 min)** — confirm with the user before running. The pipeline caps at 100 frames so the budget is bounded, but a sparse 100-frame scan of a 60-min video isn't very useful. Almost always better to run focused on the specific section.
- **Cloudflare 403 on Groq** — `whisper.py` already sets a custom User-Agent to clear Cloudflare's default-Python-UA block. If you ever see a 403, that's the failure mode.
- **No automatic Groq → OpenAI fallback** — the script picks one Whisper backend at the start of each run (Groq if its key is set, else OpenAI) and stays on it. If Groq rate-limits, the script retries Groq twice then errors out. To use OpenAI instead, pass `--whisper openai`.

## What NOT to do

- Don't attempt to "watch" a video by inventing content based on title or thumbnail. If the pipeline fails, say so.
- Don't write the summary before actually reading frames. The transcript alone misses visual context (B-roll, on-screen text, emotion, graphics, transitions).
- Don't skip cleanup. Frame dumps + downloaded videos are large.
- Don't auto-fire on a video URL. Slash-command-only — see "When to invoke" above.

## Engine attribution

The pipeline scripts under `scripts/` are vendored from [bradautomates/claude-video](https://github.com/bradautomates/claude-video) (MIT). See [THIRD_PARTY_NOTICES.md](THIRD_PARTY_NOTICES.md) for the full upstream license.
watch-video

Works with

Attribution

Comments (0)