Skill Judge

Name: Skill Judge
Author: frank-luongt
ASecurity
Evaluate Agent Skill quality using an 8-dimension scoring framework. A Skill is a knowledge externalization mechanism, not a tutorial.
21 stars
0 votes
0 copies
3 views
Added 6/6/2026
ai-agentsgogitapisecurity
Works with

api
Security Analysis

A100/100
Scanned 6/6/2026
Install via CLI
$openskills install frank-luongt/faos-skills-marketplace
Files
SKILL.md
<!-- AUTO-GENERATED by export-skills.py — DO NOT EDIT -->
---
name: skill-judge
description: Evaluate Agent Skill design quality against specifications and best practices. Use when reviewing, auditing, or improving SKILL.md files and skill packages. Provides multi-dimensional scoring (120 points across 8 dimensions) with actionable improvement suggestions.
tags: [skills, quality, evaluation, meta]
---

# Skill Judge

Evaluate Agent Skill quality using an 8-dimension scoring framework. A Skill is a knowledge externalization mechanism, not a tutorial.

## When to Use

- Reviewing a new SKILL.md before merging
- Auditing existing skills for quality improvement
- Comparing skill alternatives
- Training skill authors on best practices

## Core Formula

```
Good Skill = Expert-only Knowledge - What the Model Already Knows
```

### Knowledge Types

| Type | Description | Action |
|---|---|---|
| **Expert Knowledge** | Domain expertise that took years to learn | KEEP -- this is the skill's value |
| **Activation Knowledge** | Trigger words, when-to-use context | KEEP -- helps model select the skill |
| **Redundant Knowledge** | Things the model already knows (language syntax, common patterns) | REMOVE -- wastes context budget |

## 8 Evaluation Dimensions (120 points)

### D1: Knowledge Delta (20 pts) -- THE core dimension

Does the skill add genuine expert knowledge the model doesn't already have?

| Score | Criteria |
|---|---|
| 18-20 | Expert practitioners would recognize non-obvious insights |
| 13-17 | Useful domain knowledge, some obvious content |
| 7-12 | Mostly available in training data |
| 0-6 | Entirely redundant with model knowledge |

**Red flags**: Generic programming tutorials, language syntax explanations, well-documented API wrappers.
**Green flags**: "I learned this the hard way", non-obvious failure modes, domain-specific decision frameworks.

### D2: Mindset + Procedures (15 pts)

Does the skill install the right thinking patterns and domain-specific procedures?

| Score | Criteria |
|---|---|
| 13-15 | Domain expert mindset with specific procedures |
| 9-12 | Some mindset guidance, generic procedures |
| 0-8 | No mindset, or procedures that are obvious |

### D3: Anti-Pattern Quality (15 pts)

Are anti-patterns specific with concrete consequences?

| Score | Criteria |
|---|---|
| 13-15 | Specific NEVER rules with reasons and alternatives |
| 9-12 | Some anti-patterns but vague consequences |
| 0-8 | Generic warnings or no anti-patterns |

**Good**: "NEVER use `SELECT *` in production queries -- causes full table scans on tables >1M rows"
**Bad**: "Avoid writing bad queries"

### D4: Specification Compliance (15 pts)

Is the frontmatter valid and the description field effective?

The `description` field is THE most critical field -- it determines when the skill gets loaded.

| Check | Points |
|---|---|
| Valid YAML frontmatter with required fields | 5 |
| Description answers WHAT, WHEN, KEYWORDS | 5 |
| Appropriate tags and domain | 5 |

### D5: Progressive Disclosure (15 pts)

Does the skill load efficiently without wasting context?

| Layer | Purpose |
|---|---|
| Metadata (frontmatter) | Skill selection -- always loaded |
| SKILL.md body | Core knowledge -- loaded on activation |
| References/ directory | Deep dives -- loaded on demand |

| Score | Criteria |
|---|---|
| 13-15 | Clear three-layer structure, explicit loading triggers |
| 9-12 | Two layers, some loading guidance |
| 0-8 | Everything in one file, no progressive disclosure |

### D6: Freedom Calibration (15 pts)

Does the skill match specificity to task fragility?

| Task Type | Freedom Level | Example |
|---|---|---|
| Creative (UI design, naming) | High -- guidelines, not rules | "Prefer semantic naming" |
| Standard (CRUD, REST APIs) | Medium -- patterns with flexibility | "Use repository pattern" |
| Fragile (security, compliance, data migration) | Low -- strict procedures | "MUST validate input before..." |

### D7: Pattern Recognition (10 pts)

Does the skill follow a recognized pattern?

| Pattern | Typical Size | Use Case |
|---|---|---|
| Mindset | ~50 lines | Install thinking approach |
| Navigation | ~30 lines | Guide to external resources |
| Philosophy | ~150 lines | Design principles and trade-offs |
| Process | ~200 lines | Step-by-step workflows |
| Tool | ~300 lines | Comprehensive tool usage |

### D8: Practical Usability (15 pts)

Is the skill immediately actionable?

| Score | Criteria |
|---|---|
| 13-15 | Decision trees, working code, error handling, edge cases |
| 9-12 | Some examples, missing edge cases |
| 0-8 | Abstract guidance, no concrete examples |

## Evaluation Protocol

### Step 1: Knowledge Delta Scan

Read the entire skill and mark each section:
- **E** (Expert) -- Non-obvious, hard-won knowledge
- **A** (Activation) -- Trigger/context information
- **R** (Redundant) -- Model already knows this

If >50% is R, the skill needs significant rework.

### Step 2: Structure Analysis

Check the skill follows progressive disclosure and has clear sections.

### Step 3: Score Each Dimension

Score each D1-D8 with justification.

### Step 4: Calculate Grade

| Grade | Score | Meaning |
|---|---|---|
| A | 96-120 (80%+) | Excellent -- merge as-is |
| B | 78-95 (65-79%) | Good -- minor improvements |
| C | 60-77 (50-64%) | Acceptable -- needs work |
| D | 42-59 (35-49%) | Below standard -- major rework |
| F | 0-41 (<35%) | Reject -- fundamental issues |

### Step 5: Generate Report

```markdown
# Skill Evaluation: {skill-name}

## Score: {total}/120 (Grade: {grade})

| Dimension | Score | Max | Notes |
|---|---|---|---|
| D1: Knowledge Delta | X | 20 | ... |
| D2: Mindset + Procedures | X | 15 | ... |
| D3: Anti-Pattern Quality | X | 15 | ... |
| D4: Spec Compliance | X | 15 | ... |
| D5: Progressive Disclosure | X | 15 | ... |
| D6: Freedom Calibration | X | 15 | ... |
| D7: Pattern Recognition | X | 10 | ... |
| D8: Practical Usability | X | 15 | ... |

## Top Strengths
1. ...
2. ...

## Priority Improvements
1. ...
2. ...

## Knowledge Delta Analysis
- Expert (E): X% of content
- Activation (A): X% of content
- Redundant (R): X% of content
```

## Common Failure Patterns

| Pattern | Description | Fix |
|---|---|---|
| **The Tutorial** | Teaches what the model already knows | Remove redundant sections, keep only expert insights |
| **The Dump** | Entire API reference pasted in | Extract key patterns, move details to references/ |
| **The Orphan References** | References dir exists but SKILL.md never mentions when to load them | Add explicit loading triggers |
| **The Checkbox Procedure** | Step 1, Step 2... with no expert judgment | Add decision points and trade-off guidance |
| **The Vague Warning** | "Be careful with X" | Specify what goes wrong and how to prevent it |
| **The Invisible Skill** | Poor description, never gets selected | Rewrite description with WHAT, WHEN, KEYWORDS |
| **The Wrong Location** | Expert knowledge buried in references/ | Move critical knowledge to SKILL.md body |
| **The Over-Engineered** | 1000+ lines for a simple concept | Simplify to match complexity of the domain |
| **The Freedom Mismatch** | Strict rules for creative tasks or loose guidance for fragile operations | Calibrate freedom to task fragility |

## The Meta-Question

> "Would an expert in this domain say: 'Yes, this captures knowledge that took me years to learn'?"

If the answer is no, the skill needs more expert knowledge and less tutorial content.

## References

- Based on [softaworks/agent-toolkit skill-judge](https://github.com/softaworks/agent-toolkit/tree/main/skills/skill-judge) (MIT License)
- Derived from analysis of 17+ official Agent Skill examples

<!-- Source: .faos/custom/skills/tools/skill-judge/SKILL.md -->
Skill Judge

Works with

Security Analysis

Attribution

Comments (0)