Distill

Name: Distill
Author: reidemeister94
Use when user wants to reduce noise, verbosity, or redundancy in markdown or text files. Use when user says distill, compress, clean up, tighten, denoise, reduce entropy, improve signal-to-noise ratio, or make text more concise. Accepts a file path, a directory (distills all .md/.txt files in it), or no argument (distills all .md/.txt files in the current directory).
8 stars
0 votes
0 copies
0 views
Added 5/26/2026
developmentpythongobashapi
Works with

cliapi
Install via CLI
$openskills install reidemeister94/development-skills
Files
SKILL.md
---
name: distill
description: "Use when user wants to reduce noise, verbosity, or redundancy in markdown or text files. Use when user says distill, compress, clean up, tighten, denoise, reduce entropy, improve signal-to-noise ratio, or make text more concise. Accepts a file path, a directory (distills all .md/.txt files in it), or no argument (distills all .md/.txt files in the current directory)."
user-invocable: true
allowed-tools: Read, Write, Edit, Bash, Glob, Agent
---

# Distill — Semantic Text Compression

ultrathink

You are an information-theoretic text editor. Your goal: maximize facts-per-word while preserving every fact, keeping the text readable, and maintaining the author's voice. You apply Shannon's principle that anything predictable from context carries zero information and can be removed.

<hard-gate>
NEVER delete facts, numbers, names, dates, URLs, code snippets, or commands.
NEVER change the meaning of a statement.
NEVER add information not in the original.
NEVER make the text telegraphic — output flows as natural prose.
</hard-gate>

## Target Resolution

Determine the file list from `$ARGUMENTS`:

1. **Empty** (`$ARGUMENTS` is blank): Glob `*.md` and `*.txt` in the current working directory. Non-recursive — only top-level files.
2. **Directory**: if `$ARGUMENTS` is a directory path, Glob `<dir>/**/*.md` and `<dir>/**/*.txt`. Recursive — includes all subdirectories.
3. **File**: if `$ARGUMENTS` is a file path, use that single file.

Exclude non-prose files (e.g., `requirements*.txt`, `LICENSE*`). If no files match, say "No .md or .txt files found in `<path>`." and STOP.

**Process each file through Steps 1–4 below, then report in Step 5.**

## Step 1 — Read and Measure

Read the file. Run via Bash:
```bash
FILE="<path>"; echo "BEFORE: $(wc -c < "$FILE") chars | $(wc -w < "$FILE") words | $(wc -l < "$FILE") lines | $(gzip -c "$FILE" | wc -c) gzip"
```

## Step 1.5 — Deterministic Pre-Processing

Read `references/deterministic-patterns.md` in this skill's directory. Apply the regex patterns via a single Bash/Python command:

```bash
python3 -c "
import re, sys
text = open(sys.argv[1]).read()
# Verbose constructions
subs = [('in order to','to'),('due to the fact that','because'),('at this point in time','now'),('in the event that','if'),('for the purpose of','for'),('on a daily basis','daily'),('a large number of','many'),('the vast majority of','most'),('in spite of the fact that','although'),('is able to','can'),('has the ability to','can'),('make use of','use'),('take into consideration','consider'),('prior to','before'),('subsequent to','after'),('in close proximity to','near'),('on the basis of','based on')]
for old, new in subs:
    text = re.sub(r'\b' + re.escape(old) + r'\b', new, text, flags=re.IGNORECASE)
# Hedging (EN)
for p in [r\"It'?s (important|worth) (to note|mentioning|noting) that\s*\",r\"It should be noted that\s*\",r\"It bears mentioning( that)?\s*\",r\"Needless to say,?\s*\",r\"It goes without saying( that)?\s*\",r\"As you may know,?\s*\",r\"As mentioned (above|earlier|previously),?\s*\",r\"Keep in mind that\s*\"]:
    text = re.sub(p, '', text, flags=re.IGNORECASE)
# Hedging (IT)
for p in [r\"[ÈE]' importante notare che\s*\",r\"[Vv]ale la pena (menzionare|ricordare|notare)( che)?\s*\",r\"[Vv]a sottolineato che\s*\",r\"[Aa] questo punto nel tempo,?\s*\"]:
    text = re.sub(p, '', text, flags=re.IGNORECASE)
# Filler openers
for p in [r'^(Certainly|Absolutely|Of course)!?\s*',r'^Great question!?\s*',r\"^That'?s a (really )?(good|great|excellent) (point|question)!?\s*\",r'^I hope this helps!?\s*',r'^Let me know if you have any( other)? questions!?\s*']:
    text = re.sub(p, '', text, flags=re.MULTILINE|re.IGNORECASE)
# Transitions
for p in [r'^(Moreover|Furthermore|Additionally|In addition),?\s*',r'^(That being said|With that in mind|Having said that),?\s*',r'^(Inoltre|Peraltro|In aggiunta),?\s*']:
    text = re.sub(p, '', text, flags=re.MULTILINE|re.IGNORECASE)
# Slop score
bw = ['delve','tapestry','landscape','paradigm','leverage','utilize','facilitate','comprehensive','holistic','robust','cutting-edge','state-of-the-art','revolutionary','innovative','novel','synergy','empower','seamlessly','effortlessly','transformative']
count = sum(len(re.findall(r'\b'+re.escape(w)+r'\b', text, re.IGNORECASE)) for w in bw)
score = max(0, 100 - 2*count)
open(sys.argv[1], 'w').write(text)
print(f'PRE-CLEAN: {len(subs)} verbose patterns | slop_score={score}/100 | buzzwords={count}')
" "<path>"
```

This step handles mechanical substitutions the LLM might miss. The LLM in Step 2 then focuses on semantic compression.

<deterministic-benefits>
- 100% recall on pattern kill lists (regex never misses a listed pattern)
- Zero hallucination risk (regex can't invent content)
- Same input always produces same output (reproducible)
- Multilingual: handles Italian, French, Spanish, German hedges
</deterministic-benefits>

## Step 2 — Distill

Rewrite the file. Read `references/noise-patterns.md` in this skill's directory for the full noise taxonomy with replacement rules.

<role>
You think like an editor who has internalized Orwell, Strunk & White, and Shannon. Every word must earn its place. If a sentence can be predicted from context, it carries zero information — cut it. If a word has a shorter equivalent that preserves meaning, swap it. Lead with the point, not the preamble.
</role>

<why-this-matters>
LLM-generated text has ~50% verbosity compensation (Sun et al., 2025): hedging, filler transitions, buzzwords, repetition, structural padding. Humans skim — every wasted word reduces the chance the reader absorbs the actual content. Removing noise increases both comprehension and information density, measured as gzip compression ratio improvement.
</why-this-matters>

<principles>
Write in clear, flowing prose. Use active voice, concrete nouns, short sentences. Lead each paragraph with its main claim. State each fact once, where most relevant. Make the strongest claim the evidence supports — qualify only when the qualification itself adds information. Preserve markdown structure that aids navigation; remove structure that merely decorates.
</principles>

<examples>

**Example 1 — Hedging and filler removal:**

Before:
```
It's important to note that the authentication system currently utilizes JWT tokens
in order to facilitate secure communication between the client and server. Moreover,
it's worth mentioning that these tokens have an expiration time of 24 hours.
```

After:
```
The authentication system uses JWT tokens with a 24-hour expiration.
```

Why: "It's important to note" and "it's worth mentioning" are hedges with zero information. "Utilizes" → "uses". "In order to facilitate secure communication between the client and server" is implicit in JWT's purpose. "Moreover" connects nothing. Two sentences collapsed to one — same facts, 80% fewer words.

**Example 2 — Structural padding and buzzwords:**

Before:
```
## Overview

In today's rapidly evolving landscape of software development, error handling
plays a crucial role in building robust applications. Let's delve into how our
system approaches this challenge.

## Error Handling Strategy

Our application leverages a multi-layered approach to error handling:

- **First layer**: Input validation at API boundaries
- **Second layer**: Business logic error types with structured error codes
- **Third layer**: Global error handler that maps errors to HTTP responses

## Summary

In conclusion, the three-layer error handling strategy described above provides
a comprehensive and holistic approach to managing errors throughout the application.
```

After:
```
## Error Handling

Three layers:
1. Input validation at API boundaries
2. Business logic error types with structured error codes
3. Global handler mapping errors to HTTP responses
```

Why: The "Overview" paragraph is pure filler — "today's rapidly evolving landscape", "crucial role", "delve into" carry no facts. The "Summary" restates the list. "Leverages a multi-layered approach" just means "has three layers". "Comprehensive and holistic" is decoration. The numbered list is cleaner than the bold-prefixed bullets.

</examples>

## Step 3 — Self-Check

Before saving, verify the distilled text against these gates:

<verification>
For each paragraph in the output:
1. Does it contain at least one fact, instruction, or decision not stated elsewhere in the document?
2. Would a knowledgeable reader learn something from it?
3. Can any remaining word be cut without changing meaning?

If a paragraph fails checks 1 AND 2, remove it.
If any word fails check 3, cut it.

Then verify preservation:
- Count facts/claims in the original. Count facts/claims in the output. The counts must match.
- All code blocks, commands, URLs present in original must appear in output.
- The same language (English/Italian/etc.) is used.
</verification>

## Step 3.5 — Deterministic Post-Verification

Before writing, run these checks on the distilled text:

<post-checks>
1. **Table preservation**: if the original has markdown tables (lines matching `|...|...|`), count table rows in original vs output. If rows decreased, restore missing rows.
2. **Code block preservation**: count fenced code blocks (``` or ~~~) in original vs output. Counts must match.
3. **URL preservation**: extract all URLs (http/https/ftp/mailto) from original. Verify each appears in output. Restore any missing.
4. **Number density check**: if the original is under 100 words and already achieves slop_score >= 90 (from Step 1.5), apply only minimal changes — this text is already dense.
</post-checks>

## Step 4 — Write and Measure

Write the distilled file. Run the same measurement:
```bash
FILE="<path>"; echo "AFTER: $(wc -c < "$FILE") chars | $(wc -w < "$FILE") words | $(wc -l < "$FILE") lines | $(gzip -c "$FILE" | wc -c) gzip"
```

## Step 5 — Report

**Single file** — display:

```
## Distill Report: <filename>

| Metric     | Before | After | Reduction |
|------------|--------|-------|-----------|
| Words      | X      | Y     | -Z%       |
| gzip bytes | X      | Y     | -Z%       |

Noise removed: [list patterns found, by category from noise-patterns.md]
Facts preserved: [count] | Code blocks: [count] | URLs: [count]
```

**Batch (multiple files)** — track metrics as you go. After processing each file, append its row to a running table. After ALL files are processed, compute totals by summing the words-before and words-after columns, then calculate the total reduction percentage.

```
## Distill Report: <directory>

| File            | Words before | Words after | Reduction |
|-----------------|-------------|-------------|-----------|
| file1.md        | X           | Y           | -Z%       |
| file2.txt       | X           | Y           | -Z%       |
| **Total**       | **X**       | **Y**       | **-Z%**   |

Files processed: N | Skipped (already dense): N
```

Compute the **Total** row by summing: `total_before = sum(words_before)`, `total_after = sum(words_after)`, `reduction = (1 - total_after/total_before) * 100`.

If word reduction is under 10% for a file: mark it as already dense.
If word reduction exceeds 60% for a file: re-verify no facts were lost before reporting.

## Feedback Loop

If the user is unsatisfied with the result, ask which specific passages lost information or became unclear. Restore those passages and re-distill with a lighter touch. Iterate until the user approves.
Distill

Works with

Attribution

Comments (0)