Evaluate Agent Skill quality using an 8-dimension scoring framework. A Skill is a knowledge externalization mechanism, not a tutorial.
Scanned 6/6/2026
Install via CLI
openskills install frank-luongt/faos-skills-marketplace<!-- AUTO-GENERATED by export-skills.py — DO NOT EDIT -->
---
name: skill-judge
description: Evaluate Agent Skill design quality against specifications and best practices. Use when reviewing, auditing, or improving SKILL.md files and skill packages. Provides multi-dimensional scoring (120 points across 8 dimensions) with actionable improvement suggestions.
tags: [skills, quality, evaluation, meta]
---
# Skill Judge
Evaluate Agent Skill quality using an 8-dimension scoring framework. A Skill is a knowledge externalization mechanism, not a tutorial.
## When to Use
- Reviewing a new SKILL.md before merging
- Auditing existing skills for quality improvement
- Comparing skill alternatives
- Training skill authors on best practices
## Core Formula
```
Good Skill = Expert-only Knowledge - What the Model Already Knows
```
### Knowledge Types
| Type | Description | Action |
|---|---|---|
| **Expert Knowledge** | Domain expertise that took years to learn | KEEP -- this is the skill's value |
| **Activation Knowledge** | Trigger words, when-to-use context | KEEP -- helps model select the skill |
| **Redundant Knowledge** | Things the model already knows (language syntax, common patterns) | REMOVE -- wastes context budget |
## 8 Evaluation Dimensions (120 points)
### D1: Knowledge Delta (20 pts) -- THE core dimension
Does the skill add genuine expert knowledge the model doesn't already have?
| Score | Criteria |
|---|---|
| 18-20 | Expert practitioners would recognize non-obvious insights |
| 13-17 | Useful domain knowledge, some obvious content |
| 7-12 | Mostly available in training data |
| 0-6 | Entirely redundant with model knowledge |
**Red flags**: Generic programming tutorials, language syntax explanations, well-documented API wrappers.
**Green flags**: "I learned this the hard way", non-obvious failure modes, domain-specific decision frameworks.
### D2: Mindset + Procedures (15 pts)
Does the skill install the right thinking patterns and domain-specific procedures?
| Score | Criteria |
|---|---|
| 13-15 | Domain expert mindset with specific procedures |
| 9-12 | Some mindset guidance, generic procedures |
| 0-8 | No mindset, or procedures that are obvious |
### D3: Anti-Pattern Quality (15 pts)
Are anti-patterns specific with concrete consequences?
| Score | Criteria |
|---|---|
| 13-15 | Specific NEVER rules with reasons and alternatives |
| 9-12 | Some anti-patterns but vague consequences |
| 0-8 | Generic warnings or no anti-patterns |
**Good**: "NEVER use `SELECT *` in production queries -- causes full table scans on tables >1M rows"
**Bad**: "Avoid writing bad queries"
### D4: Specification Compliance (15 pts)
Is the frontmatter valid and the description field effective?
The `description` field is THE most critical field -- it determines when the skill gets loaded.
| Check | Points |
|---|---|
| Valid YAML frontmatter with required fields | 5 |
| Description answers WHAT, WHEN, KEYWORDS | 5 |
| Appropriate tags and domain | 5 |
### D5: Progressive Disclosure (15 pts)
Does the skill load efficiently without wasting context?
| Layer | Purpose |
|---|---|
| Metadata (frontmatter) | Skill selection -- always loaded |
| SKILL.md body | Core knowledge -- loaded on activation |
| References/ directory | Deep dives -- loaded on demand |
| Score | Criteria |
|---|---|
| 13-15 | Clear three-layer structure, explicit loading triggers |
| 9-12 | Two layers, some loading guidance |
| 0-8 | Everything in one file, no progressive disclosure |
### D6: Freedom Calibration (15 pts)
Does the skill match specificity to task fragility?
| Task Type | Freedom Level | Example |
|---|---|---|
| Creative (UI design, naming) | High -- guidelines, not rules | "Prefer semantic naming" |
| Standard (CRUD, REST APIs) | Medium -- patterns with flexibility | "Use repository pattern" |
| Fragile (security, compliance, data migration) | Low -- strict procedures | "MUST validate input before..." |
### D7: Pattern Recognition (10 pts)
Does the skill follow a recognized pattern?
| Pattern | Typical Size | Use Case |
|---|---|---|
| Mindset | ~50 lines | Install thinking approach |
| Navigation | ~30 lines | Guide to external resources |
| Philosophy | ~150 lines | Design principles and trade-offs |
| Process | ~200 lines | Step-by-step workflows |
| Tool | ~300 lines | Comprehensive tool usage |
### D8: Practical Usability (15 pts)
Is the skill immediately actionable?
| Score | Criteria |
|---|---|
| 13-15 | Decision trees, working code, error handling, edge cases |
| 9-12 | Some examples, missing edge cases |
| 0-8 | Abstract guidance, no concrete examples |
## Evaluation Protocol
### Step 1: Knowledge Delta Scan
Read the entire skill and mark each section:
- **E** (Expert) -- Non-obvious, hard-won knowledge
- **A** (Activation) -- Trigger/context information
- **R** (Redundant) -- Model already knows this
If >50% is R, the skill needs significant rework.
### Step 2: Structure Analysis
Check the skill follows progressive disclosure and has clear sections.
### Step 3: Score Each Dimension
Score each D1-D8 with justification.
### Step 4: Calculate Grade
| Grade | Score | Meaning |
|---|---|---|
| A | 96-120 (80%+) | Excellent -- merge as-is |
| B | 78-95 (65-79%) | Good -- minor improvements |
| C | 60-77 (50-64%) | Acceptable -- needs work |
| D | 42-59 (35-49%) | Below standard -- major rework |
| F | 0-41 (<35%) | Reject -- fundamental issues |
### Step 5: Generate Report
```markdown
# Skill Evaluation: {skill-name}
## Score: {total}/120 (Grade: {grade})
| Dimension | Score | Max | Notes |
|---|---|---|---|
| D1: Knowledge Delta | X | 20 | ... |
| D2: Mindset + Procedures | X | 15 | ... |
| D3: Anti-Pattern Quality | X | 15 | ... |
| D4: Spec Compliance | X | 15 | ... |
| D5: Progressive Disclosure | X | 15 | ... |
| D6: Freedom Calibration | X | 15 | ... |
| D7: Pattern Recognition | X | 10 | ... |
| D8: Practical Usability | X | 15 | ... |
## Top Strengths
1. ...
2. ...
## Priority Improvements
1. ...
2. ...
## Knowledge Delta Analysis
- Expert (E): X% of content
- Activation (A): X% of content
- Redundant (R): X% of content
```
## Common Failure Patterns
| Pattern | Description | Fix |
|---|---|---|
| **The Tutorial** | Teaches what the model already knows | Remove redundant sections, keep only expert insights |
| **The Dump** | Entire API reference pasted in | Extract key patterns, move details to references/ |
| **The Orphan References** | References dir exists but SKILL.md never mentions when to load them | Add explicit loading triggers |
| **The Checkbox Procedure** | Step 1, Step 2... with no expert judgment | Add decision points and trade-off guidance |
| **The Vague Warning** | "Be careful with X" | Specify what goes wrong and how to prevent it |
| **The Invisible Skill** | Poor description, never gets selected | Rewrite description with WHAT, WHEN, KEYWORDS |
| **The Wrong Location** | Expert knowledge buried in references/ | Move critical knowledge to SKILL.md body |
| **The Over-Engineered** | 1000+ lines for a simple concept | Simplify to match complexity of the domain |
| **The Freedom Mismatch** | Strict rules for creative tasks or loose guidance for fragile operations | Calibrate freedom to task fragility |
## The Meta-Question
> "Would an expert in this domain say: 'Yes, this captures knowledge that took me years to learn'?"
If the answer is no, the skill needs more expert knowledge and less tutorial content.
## References
- Based on [softaworks/agent-toolkit skill-judge](https://github.com/softaworks/agent-toolkit/tree/main/skills/skill-judge) (MIT License)
- Derived from analysis of 17+ official Agent Skill examples
<!-- Source: .faos/custom/skills/tools/skill-judge/SKILL.md -->
No comments yet. Be the first to comment!