Tools

Your AI tools cost more than you think. Here's how to cut LLM spend by 65%.

April 11, 2026 | 9 min read

Terminal showing AI token usage metrics and cost optimization dashboard

Output tokens cost 3-5x more than input tokens across every major LLM provider. A 5-engineer team spends $500-$3,000/month on AI coding tools, and 30-60% of those output tokens are filler: pleasantries, hedging, restated questions, articles. Token compression tools like Caveman (open-source) cut that waste by 65% with no workflow change.

Five engineers. Claude Code, Cursor, and Copilot in daily rotation. $500-$3,000 per month in AI token costs. That number grows every quarter as adoption increases and nobody tracks the spend.

Most of that budget goes to output tokens, the text your AI tools generate. Output tokens cost 3-5x more than input tokens across every provider. And 30-60% of those output tokens are filler. Pleasantries, hedging, preambles, restated questions, connector words. Tokens that carry zero information but bill at full rate.

Strip the filler, and you cut 65% of output token volume without changing your tools, your workflow, or your answers.

Why output tokens eat your budget

LLM providers charge more for output tokens because generation requires more compute than comprehension. Reading your prompt is cheap. Writing the response is expensive.

Model	Input (per 1M tokens)	Output (per 1M tokens)	Output/input ratio
Claude Sonnet	$3	$15	5x
Claude Opus	$15	$75	5x
Claude Haiku	$0.25	$1.25	5x
GPT-4o	$2.50	$10	4x
GPT-4o mini	$0.15	$0.60	4x

That 3-5x multiplier means a verbose response costs far more than the prompt that triggered it. A model that opens with "I'd be happy to help you with that! Let me walk you through this step by step" bills 16 words at output rates before saying anything useful.

Run the numbers on a single interaction. Your engineer asks Claude Code to fix a failing test. The useful answer: 150 tokens of code plus a 30-token explanation. The model delivers 600 tokens because it restates the problem, walks through its reasoning, adds caveats, and closes with a courtesy line. At Sonnet pricing ($15/M output), those 420 filler tokens cost $0.0063 per interaction. Small.

Multiply by 200 interactions per day across 5 engineers. Filler costs $6.30 per day. $138 per month. $1,656 per year. On text nobody reads.

Where the filler hides

Break a typical AI coding response into parts. Four categories of waste show up in every session:

Pleasantries and preambles (5-15% of output). "Certainly! Let me help you with that." "Great question!" The engineer skips past these tokens. They're social filler in a non-social interaction.
Hedging language (5-10% of output). "It might be worth considering..." "You could try..." "This may or may not work depending on..." Hedging pads the response without adding information.
Repeated context (10-20% of output). The model restates your question, narrates what it plans to do, then summarizes what it did. You asked once. You got the answer sandwiched between a recap and a summary.
Filler words (10-15% of output). Articles (the, a), qualifiers (basically, simply, really, just, actually). In technical explanations, removing these changes nothing about the meaning.

Total: 30-60% of output tokens in a typical coding session carry no information.

One interaction, before and after compression:

Default response (47 tokens)

"Sure, I'd be happy to help! The issue is that your useEffect hook is missing the dependency array. You should add [userId] as a dependency so that the effect re-runs when the user ID changes. Here's the updated code:"

Compressed response (13 tokens)

"useEffect missing dependency array. Add [userId] for re-run on user change. Updated:"

Same fix. Same accuracy. 73% fewer tokens.

Token compression in practice

Token compression instructs the model to drop filler while keeping technical content intact. Caveman, an open-source MIT-licensed tool, does this through a skill/plugin system compatible with Claude Code, Cursor, Copilot, Windsurf, and 40+ other AI coding environments.

The rules are explicit:

Drop: articles, filler words, pleasantries, hedging language.
Keep: technical terms (exact), code blocks (unchanged), error messages (quoted exactly).

Three intensity levels let you dial the compression to your comfort:

Lite mode

Professional tone. Full sentences. Drops filler and hedging, keeps articles and grammar. Good for pair programming where you want concise but readable explanations.

Full mode (default)

Fragments allowed. Articles dropped. Short synonyms. For engineers who know the codebase and want answers, not tutorials.

Ultra mode

Telegraphic. Abbreviations (DB, auth, config). Arrows for causality. For senior engineers who want raw signal with zero padding.

Setup takes under a minute:

claude plugin add JuliusBrussee/caveman

Type "/caveman" in any session. The mode stays active until you disable it.

Benchmark results

Caveman's benchmarks across 10 real coding tasks show a 65% average token reduction. The range: 22% on tasks with heavy code output (code passes through unchanged) to 87% on explanation-heavy tasks (where prose carries the most waste).

Task type	Default tokens	Compressed tokens	Reduction
Code generation	800	624	22%
Bug fix with explanation	600	210	65%
Architecture explanation	1,200	360	70%
Debugging walkthrough	900	270	70%
Concept explanation	1,500	195	87%

A March 2026 research paper found that brevity constraints on language models improved accuracy by 26 percentage points on certain benchmarks. Less filler meant more focus. Compressed responses scored higher, not lower.

What stays untouched

Compression has built-in safety exceptions. Caveman auto-suspends for:

Security warnings. Full verbosity when the model flags a vulnerability or dangerous operation.
Irreversible actions. "This will delete your production database" gets spelled out in full.
Multi-step sequences. When fragment order could cause misunderstanding, full sentences return.
Code output. Code blocks pass through unchanged. No abbreviated variable names. No dropped syntax.
Commit messages and PRs. Written artifacts use normal prose.

These guardrails target the routine interactions where verbosity wastes money, without compromising the interactions where precision prevents disasters.

Four more ways to cut LLM costs

Token compression is one optimization. Stack it with these four:

Prompt caching

Anthropic offers a 90% discount on cached input tokens. If your team sends the same system prompt or codebase context across sessions (you do), caching drops that portion of your bill by 90%. One configuration change. Minutes to set up.

Model routing

Use cheaper models for simple tasks. A code formatting question doesn't need Opus at $75/M output tokens. Sonnet handles it at $15/M. Route complex architecture decisions to the expensive model, routine completions to the efficient one. 80% savings on routed tasks.

Structured prompts

Vague prompts generate long responses because the model hedges across multiple interpretations. "Fix the auth bug" produces 800 tokens of exploration. "The JWT validation in auth.ts:42 returns 403 for expired tokens; it should return 401" produces 150 tokens of fix. Specific input, shorter output. Less cost.

Batch processing

Anthropic's Batch API gives 50% off for requests that don't need real-time responses. Background code reviews, documentation generation, test suite creation. Queue them and save half.

Optimization	Typical savings	Setup effort
Token compression	65% output reduction	1 minute
Prompt caching	90% input reduction (repeated context)	30 minutes
Model routing	80% on routed tasks	1-2 hours
Structured prompts	40-60% fewer output tokens	Team habit (ongoing)
Batch API	50% on queued requests	1-2 hours

Stack compression (65% output savings) with caching (90% on repeated input) and routing (80% on simple tasks). A $2,000/month AI bill drops to $400-$600.

Measure before you optimize

Track your team's token spend for one week before changing anything. Most AI coding tools ship with usage dashboards:

Claude Code: Anthropic Console shows daily token counts by model.
Cursor: Settings > Usage shows monthly consumption.
GitHub Copilot: Organization admins see per-seat usage in billing.

Multiply your output tokens by the per-token rate. That's your baseline. Enable compression for one week and compare. Teams doing heavy code generation see 22-40% savings. Teams using AI for explanations, debugging, and pair programming see 60-87%.

LLM costs are the new cloud bill

Ten years ago, teams treated AWS bills as an afterthought. Usage grew unchecked. The monthly number hit five figures. Someone in finance started asking questions. Cloud cost optimization became a discipline: reserved instances, right-sizing, FinOps teams.

LLM costs follow the same curve. Teams adopt AI tools. Engineers find them productive. Usage grows. Six months later the "AI tools" line item has doubled, and nobody made a conscious decision to spend more.

The difference: LLM cost optimization takes less effort than cloud optimization. You don't need a FinOps team. Token compression takes one minute to set up. Prompt caching takes an afternoon. Model routing takes a day. The savings start the same week.

At Savi, our engineers run Claude Code and Cursor on every project. Token optimization is part of the workflow. When a client pays for AI-accelerated development, we make sure every API call delivers value, not filler.

Frequently asked questions

How much do AI coding tools cost per month?

A team of 5 engineers using Claude Code, Cursor, and Copilot spends $500-$3,000 per month on AI tokens. Output tokens (model responses) cost 3-5x more than input tokens (your prompts). Heavy API users can exceed these ranges.

What is token compression?

Token compression instructs the AI model to drop filler words, pleasantries, hedging, and restated context while keeping technical terms, code blocks, and error messages intact. Open-source tools like Caveman achieve 65% average token reduction across coding tasks.

Does compression affect code quality?

No. Code blocks pass through unchanged. Error messages stay quoted exactly. A 2026 research paper found brevity constraints improved model accuracy by 26 percentage points. The model spends fewer tokens on filler and more on the answer.

How do I reduce Claude Code costs?

Four ways: enable token compression (65% output savings), turn on prompt caching (90% off repeated input context), route simple tasks to cheaper models like Sonnet instead of Opus, and use the Batch API (50% off) for non-real-time tasks like code reviews and documentation.

Stop overpaying for AI tokens

Our engineers use AI tools daily and optimize every API call. 30-minute call. No commitment.

Book a free consultation