← Writing
Best practices to make Claude Code skills token-efficient
AI10 min read

Best practices to make Claude Code skills token-efficient

Field-tested rules for cutting context-window tokens from your Claude Code skills and plugins, each backed by real before/after numbers.

ShareXLinkedInFacebook

Every AI coding agent has the same hidden tax, and it's the context window. Anthropic's guide to effective context engineering calls context "a critical yet limited resource," and the research is blunt about what happens when you ignore it: as the window fills up, the model gets worse. They call it context rot. So token optimization isn't just a cost thing. Every token your plugin injects is a token the model doesn't get to spend on the actual problem your user is trying to solve.

I got a close look at this recently. I maintain the Postman plugin for Claude Code, which is pure instructional Markdown: commands, skills, and agents that teach Claude the whole Postman API lifecycle. There's no runtime to profile, no binary to shrink. Its entire footprint is context-window tokens, and every user pays that bill in every session. I ran an optimization pass on it, and the rules that came out of that work are the ones I'd hand to anyone building a Claude Code skill or plugin. Each one below comes with the number that earned it a spot.

First, find out where your tokens actually go

You can't optimize what you can't see. Before changing anything, run /context and look at what your plugin is costing. A Markdown plugin spends tokens in three places, and they're not equally expensive.

The first is always-on cost. Every skill, command, and agent description in your YAML front matter gets injected into the system prompt of every session, whether or not the user touches your plugin that day. This is the single most expensive token you own. Everyone pays it, every time.

The second is per-trigger cost. When Claude decides a skill is relevant, the whole SKILL.md body loads into context. A 19 KB skill runs you about 4,800 tokens every time it fires, even if the user only needed a third of what's in there. The Claude Code skills docs spell out this layered loading model: metadata is cheap and always present, the body is the part that costs you on each trigger.

The third is runtime cost: tool output, polling loops, and the model narrating its way through a long task. Tool schemas pile up here too. A discussion on the MCP spec repo clocked roughly 1,000 tokens for a single complex tool definition.

Knowing which bucket a cost lives in tells you how much leverage fixing it has. A description fix helps every user forever. A skill-body fix helps only the sessions that trigger it. Spend your time accordingly.

Treat front-matter descriptions as your most expensive real estate

This is the highest-leverage change you can make, because it's the one every user pays for. A lot of my command descriptions had grown into trigger-phrase phrasebooks:

description: Run Postman collection tests using Postman CLI - use when
  user says "run tests", "run collection", "run my postman tests",
  "verify changes", "check if tests pass", or wants to execute API
  test suites after code changes

Claude's router doesn't need you to list every way a person might phrase a request. It needs to know what the thing does and when to reach for it:

description: Run Postman Collection tests with the Postman CLI and
  report failures. Use after code changes or when the user asks to
  run API tests.

Rewriting these across the plugin shrank the always-on description block by 20%. That sounds small until you remember it's 20% off a cost that every user pays in every session, which makes those the highest-value bytes in the whole repo. Anthropic's Agent Skills announcement lands on the same advice: discovery metadata works best as one or two tight sentences. Write the capability, write when to use it, stop.

Keep SKILL.md to the workflow, defer the rest to references/

The biggest per-trigger saving came from progressive disclosure. Anthropic's engineering team describes this pattern as the backbone of the Agent Skills standard: metadata loads at startup, the SKILL.md body loads when the skill is relevant, and bundled reference files load only when a specific step actually needs them.

Your skill does not need to front-load every rule it might ever apply. It needs the workflow plus a pointer to the detailed rules, and Claude reads those at the step that calls for them:

## Step 4: Generate the client code
 
Before writing any code, read `references/code-generation.md` in this
skill's directory. It contains the full rule catalog for idiomatic
client generation.

I applied this split to the plugin's two largest skills:

SkillBeforeAfterSaving per trigger
postman-context~4,760 tokens~1,930 tokens~2,800 tokens (60%)
generate-spec~2,640 tokens~1,800 tokens~840 tokens (32%)

I didn't delete a single rule or template. They're all still there, just deferred instead of removed. Someone asking "find me an email API" no longer pays 2,800 tokens for code-generation rules they're never going to use. Someone who does generate code pays exactly what they did before, just at the moment they need it. If your largest SKILL.md is over 10 KB, this is where your easy win is.

Don't rebuild what the harness already does

The plugin used to ship a postman-routing skill, about 835 tokens, whose trigger was "use when user mentions APIs." Think about how broad that is. It fired in nearly any backend session, Postman-related or not. And its body was a routing table that restated what every command's description already told Claude.

Modern Claude Code routes natively by matching user intent against your component descriptions. So that skill was pure duplicate state: tokens spent to tell Claude something it already knew, plus a second copy of the routing logic to keep in sync every time I added a command. I deleted it. That's 835 tokens back in every session where it used to fire, and given the trigger, it fired a lot.

The general rule: before you write a skill, check whether the harness already does the job. Routing by description is built in. If you find yourself maintaining a table that mirrors your own front matter, that's the tell.

Scope allowed-tools to what each component actually calls

Every MCP-backed command and the plugin's readiness-analyzer agent used to declare a wildcard, which handed them all 100+ tools on the Postman MCP Server. I changed each one to list exactly the tools it uses. The readiness analyzer dropped from 111 tools to 11. The setup command now declares six.

For subagents and clients that resolve tool schemas from the allowlist, that's an order of magnitude fewer schemas loaded into context. It also keeps the model from wandering off into unrelated tools mid-command, which is the same ambiguity problem Anthropic warns about when tool sets overlap. Clear tool boundaries are a big part of whether an agent can actually use your surface well, which is the whole thing I measure over at axrank.ai.

There was a bonus I didn't expect. The audit turned up three latent permission bugs, commands told to write or edit files without the permission to do it. Wildcards hide that class of bug completely. Explicit lists put it right in front of you in review. Least privilege paid off twice.

Make async workflows report only the final result

Some Postman MCP Server tools return an HTTP 202 and make you poll for completion. Left alone, the model will cheerfully narrate every single poll, and all that chatter piles up in context. The fix is two lines of instruction in the affected commands: poll with increasing waits (2s, then 4s, then 8s), and report only the final outcome. Fewer round-trips, far less narration, same result. Any async workflow you ship needs this, or the model turns a background job into a running commentary.

Proof it works: one real task, measured

Numbers in a table are easy to argue with, so I ran the same task against a live GitHub repo using both the old and new versions of the plugin. The job was a genuinely agentic one: scan a workspace for API spec drift, validate the contract, and open a pull request with any fixes.

look at my workspace. Identify all API spec drift. Ensure the API contract is valid.
Open a PR in https://github.com/buildwithtalia/enterprise-resource-planning
with any code fixes.

Both runs are real, and the raw /cost output is where every number below comes from. Here's the run on the old plugin:

Total cost:               $3.60
Total duration (API):     15m 52s
Total duration (wall):    17m 10s
Total code changes:       1159 lines added, 13 lines removed
Usage by model:
  claude-haiku-4-5:       104.3k input, 22.8k output, 0 cache read, 0 cache write ($0.2181)
  claude-sonnet-4-6:      1.8k input, 60.5k output, 4.5m cache read, 298.7k cache write ($3.38)

And the same task on the optimized plugin:

Total cost:               $2.64
Total duration (API):     7m 47s
Total duration (wall):    9m 12s
Total code changes:       119 lines added, 12 lines removed
Usage by model:
  claude-haiku-4-5:       482 input, 18 output, 0 cache read, 0 cache write ($0.0006)
  claude-sonnet-4-6:      395 input, 31.4k output, 5.5m cache read, 137.3k cache write ($2.64)

That second claude-sonnet-4-6 line is where the headline numbers live: the 5.5m cache reads, the 137.3k cache writes, and the 31.4k output tokens. Here's how the two runs stack up side by side.

MetricPre-optimizationPost-optimizationChange
Session cost$3.60$2.6427% cheaper
Wall time17m 10s9m 12s~46% faster
Actual new tokens (Sonnet)~361k~169k~53% fewer
Cache efficiency (read:write)~15:1~40:12.7x better

The 5.5 million cache reads in the optimized run might look alarming until you remember cache reads cost about 10% of input tokens. That 40:1 read-to-write ratio means the context window stayed stable across turns, so almost nothing had to be re-cached. A well-cached session is supposed to look exactly like that.

The detail I keep coming back to is in the code-changes column, which isn't even in the table. The old run made 1,159 line changes. The new one made 119. Same task, same repo. The optimized plugin didn't just spend fewer tokens, it spent them more precisely and produced a tighter, more surgical diff. Less noise in the context, more signal in the output. I didn't predict that one, and it's the result that convinced me this work was worth more than the dollar figure.

Where to start tomorrow

None of this took clever engineering. It took measuring where the tokens went and being honest about which ones earned their place. If you maintain a Claude Code skill or plugin, here's the short version:

  1. Measure first. Run /context and find out what your plugin costs before you change anything.
  2. Cut your descriptions to one or two sentences. They load every session for every user, so they're the most expensive real estate you own. Capability plus when-to-use, nothing more.
  3. Move bulk out of SKILL.md into references/. Keep the body to the workflow, around 6 KB or less, and let Claude pull the rule catalogs and templates on demand.
  4. Delete anything that duplicates the harness. Native routing already exists. Don't pay for it twice.
  5. Scope allowed-tools and add backoff to async work. Fewer schemas, less narration, and you'll probably find a permission bug while you're in there.

The same context-engineering discipline Anthropic recommends for agents applies one level down, to the tooling we hand them. If you want to see the patterns in place, clone the plugin repo, then go check what your own skills cost per trigger. If your biggest SKILL.md is over 10 KB, try the references/ split and compare your /context before and after. That's the whole game: look at the number, then earn it back.

ShareXLinkedInFacebook