Do I need references/ and scripts/ folders in a skill?

Not by default: 11 of the 17 official skills are a single SKILL.md. You split when the domain is heavy, mostly into scripts/ for deterministic operations.

Claude Code's 17 Official Skills: the Rule vs the Practice

Q: Skill, CLAUDE.md or hook?

Skill = a capability triggered by its description. CLAUDE.md = context always loaded for a project. Hook = deterministic automation on an event. If it's always do X, it's a hook or a CLAUDE.md, not a skill.

Anthropic's docs are clear: a SKILL.md should be under 500 lines. Their own docx skill is 590. I read all 17 of Anthropic's official skills, frontmatter by frontmatter, while building my own marketplace alongside. Every time, the same gap: the docs say one thing, the corpus does another. And the corpus is the one that's right.

Here's what actually comes out when you read the code instead of the docs.

The description is 90% of the skill

A skill is useless if it doesn't trigger at the right moment. And what decides triggering isn't the body of the SKILL.md, it's its description in the frontmatter. Claude reads the list of available skills (just name + description) and picks. So all of the "when to use it" must live in the description, not the body.

The counter-intuitive part, which Anthropic repeats in its own skill-creator: Claude under-triggers skills far more than it over-triggers them. It only consults a skill for a task it can't handle trivially. "Read this PDF" will never fire the PDF skill, even with a perfect description. Hence an explicit instruction: write descriptions that are a little pushy.

You see it in the code: the xlsx skill says "Use this skill any time a spreadsheet file is the primary input or output", pptx says "any time a .pptx is involved in any way". No shyness. You push triggering, you don't hold it back.

The 500-line rule nobody follows

"Keep the SKILL.md under 500 lines" is one of the most quoted rules. In practice, docx is 590 lines, and the docs themselves add "feel free to go longer if needed". The others range from 32 to 404. The truth: 500 is a target, not a law.

The real principle behind it isn't length, it's the context budget. The SKILL.md body loads on every trigger. If it's big because it holds the essentials, fine, go over. If it's big because it drags along detail that could live elsewhere, that's the fault. Size alone isn't the sin.

Progressive disclosure: mostly scripts, rarely references

The central skill pattern is staged loading: metadata (name + description) always in context, the body loaded on trigger, and resources (scripts/, references/) only on demand. A script can even execute without being loaded into context.

Except that in the corpus, 11 skills out of 17 are a single SKILL.md, with no subfolder at all. Splitting isn't the norm, it's the exception you reach for when the domain warrants it. And when you do split, it's mostly scripts/: pdf, docx, xlsx, pptx are first and foremost script boxes for deterministic operations. The references/ (docs loaded on demand) appear on only two skills. The lesson: pull the deterministic stuff into executable code long before you pull prose into a side file.

The "NOT for" Anthropic never writes

Writing my own skills, I systematically added a "NOT for X" clause to the description, to avoid false triggers. Reading the official ones, surprise: they almost never do it. Their descriptions list inclusions ("this includes…"), not exclusions.

The reason fits in one word: their domains don't overlap. The PDF skill and the Excel skill can't be confused, so no exclusion is needed, and they'd rather maximize triggering. My case is different: my skills touch each other (reviewing a diff, shipping a feature, wrapping up a branch are adjacent). The "NOT for" lets me disambiguate between my own skills. So it isn't a universal rule, it's a response to overlap. If your skills don't compete, don't add it: you'd only reduce your triggering, the worst flaw of all.

How they actually optimize a description

The most instructive part of the official skill-creator is that they don't guess a good description, they measure it. The protocol, unrolled:

20 eval queries: 8-10 that should trigger, 8-10 that shouldn't. Realistic and messy, the way a real user would type (file paths, personal context, typos, lowercase).
The negatives are near-misses, not obvious ones. "Write a fibonacci function" as a negative for a PDF skill tests nothing. Good negatives share keywords with the skill but need something else.
Each query run 3 times for a reliable trigger rate, then an improvement loop over 5 iterations.
You pick the description by its score on a held-out test set (60% train, 40% test), so you don't overfit the queries you already know.

It's an eval pipeline, not a finger in the air. Most people write a description by feel and move on. Anthropic treats it as the product.

At a glance: the assumption vs the practice

The gap, row by row, across the 17 official skills:

The common assumption	What the official corpus does
A SKILL.md is under 500 lines	`docx` is 590, and the docs add "go longer if needed"
The description just says when to use the skill	It must be pushy: the real risk is under-triggering
You split into `references/` and `scripts/` (progressive disclosure)	11 of 17 skills are a single file; when you split, mostly `scripts/`
You restrict the description ("NOT for") to avoid false triggers	Almost never written: distinct domains, they maximize triggering
A good description is written by feel	It's measured: 20 queries, each run 3 times, a 5-iteration loop, scored on a held-out test set

What to remember

The real lesson isn't in the list of rules, it's in the gap. The docs give clean heuristics; the corpus shows how competent people bend them when the ground demands it. Reading the 17 skills teaches more than reading the doc page, because you see the real trade-offs: going past 500 lines when the content deserves it, skipping the split two-thirds of the time, pushing triggering rather than restricting it.

If you keep one thing: the risk isn't that your skill triggers too much, it's that it doesn't trigger at all. Write the description to be found.

🧩 My skills, installable

I built a marketplace of Claude Code skills from my real workflow, applying these patterns. Browse and install them on the Skills page, including a skill-builder skill that condenses these principles. To frame an agent, see also the CLAUDE.md contexts.

Frequently asked questions

How many lines should a SKILL.md be?

Aim for under 500, but it's not a law: the official docx skill is 590. What matters is that the body doesn't drag detail that should live in a script or a reference file. Go over if the essential content warrants it, then add pointers to side files.

Why doesn't my skill trigger?

Almost always the description. Claude under-triggers by default and only consults a skill for a non-trivial task. Make the description more inclusive and "pushy" ("use this skill whenever… even if the user doesn't say…"), with real trigger phrases. And test: a one-step task will never trigger a skill, and that's normal.

Do I need references/ and scripts/ folders?

Not by default: 11 of the 17 official skills are a single SKILL.md. You split when the domain is heavy, and then mostly into scripts/ for deterministic work (file manipulation, validation). references/ only earns its place for large multi-variant domains.

Skill, CLAUDE.md or hook?

Skill = a capability triggered by its description when the situation calls for it. CLAUDE.md = context always loaded for a specific project. Hook = deterministic automation on an event, enforced by the harness, not the model. If it's "always do X", it's a hook or a CLAUDE.md, not a skill.