Tools
Your AI tools cost more than you think. Here's how to cut LLM spend by 65%.
Output tokens cost 3-5x more than input tokens across every major LLM provider. A 5-engineer team spends $500-$3,000/month on AI coding tools, and 30-60% of those output tokens are filler: pleasantries, hedging, restated questions, articles. Token compression tools like Caveman (open-source) cut that waste by 65% with no workflow change.
Five engineers. Claude Code, Cursor, and Copilot in daily rotation. $500-$3,000 per month in AI token costs. That number grows every quarter as adoption increases and nobody tracks the spend.
Most of that budget goes to output tokens, the text your AI tools generate. Output tokens cost 3-5x more than input tokens across every provider. And 30-60% of those output tokens are filler. Pleasantries, hedging, preambles, restated questions, connector words. Tokens that carry zero information but bill at full rate.
Strip the filler, and you cut 65% of output token volume without changing your tools, your workflow, or your answers.
Why output tokens eat your budget
LLM providers charge more for output tokens because generation requires more compute than comprehension. Reading your prompt is cheap. Writing the response is expensive.
| Model | Input (per 1M tokens) | Output (per 1M tokens) | Output/input ratio |
|---|---|---|---|
| Claude Sonnet | $3 | $15 | 5x |
| Claude Opus | $15 | $75 | 5x |
| Claude Haiku | $0.25 | $1.25 | 5x |
| GPT-4o | $2.50 | $10 | 4x |
| GPT-4o mini | $0.15 | $0.60 | 4x |
That 3-5x multiplier means a verbose response costs far more than the prompt that triggered it. A model that opens with "I'd be happy to help you with that! Let me walk you through this step by step" bills 16 words at output rates before saying anything useful.
Run the numbers on a single interaction. Your engineer asks Claude Code to fix a failing test. The useful answer: 150 tokens of code plus a 30-token explanation. The model delivers 600 tokens because it restates the problem, walks through its reasoning, adds caveats, and closes with a courtesy line. At Sonnet pricing ($15/M output), those 420 filler tokens cost $0.0063 per interaction. Small.
Multiply by 200 interactions per day across 5 engineers. Filler costs $6.30 per day. $138 per month. $1,656 per year. On text nobody reads.
Where the filler hides
Break a typical AI coding response into parts. Four categories of waste show up in every session:
- Pleasantries and preambles (5-15% of output). "Certainly! Let me help you with that." "Great question!" The engineer skips past these tokens. They're social filler in a non-social interaction.
- Hedging language (5-10% of output). "It might be worth considering..." "You could try..." "This may or may not work depending on..." Hedging pads the response without adding information.
- Repeated context (10-20% of output). The model restates your question, narrates what it plans to do, then summarizes what it did. You asked once. You got the answer sandwiched between a recap and a summary.
- Filler words (10-15% of output). Articles (the, a), qualifiers (basically, simply, really, just, actually). In technical explanations, removing these changes nothing about the meaning.
Total: 30-60% of output tokens in a typical coding session carry no information.
One interaction, before and after compression:
Default response (47 tokens)
"Sure, I'd be happy to help! The issue is that your useEffect hook is missing the dependency array. You should add [userId] as a dependency so that the effect re-runs when the user ID changes. Here's the updated code:"
Compressed response (13 tokens)
"useEffect missing dependency array. Add [userId] for re-run on user change. Updated:"
Same fix. Same accuracy. 73% fewer tokens.
Token compression in practice
Token compression instructs the model to drop filler while keeping technical content intact. Caveman, an open-source MIT-licensed tool, does this through a skill/plugin system compatible with Claude Code, Cursor, Copilot, Windsurf, and 40+ other AI coding environments.
The rules are explicit:
- Drop: articles, filler words, pleasantries, hedging language.
- Keep: technical terms (exact), code blocks (unchanged), error messages (quoted exactly).
Three intensity levels let you dial the compression to your comfort:
Lite mode
Professional tone. Full sentences. Drops filler and hedging, keeps articles and grammar. Good for pair programming where you want concise but readable explanations.
Full mode (default)
Fragments allowed. Articles dropped. Short synonyms. For engineers who know the codebase and want answers, not tutorials.
Ultra mode
Telegraphic. Abbreviations (DB, auth, config). Arrows for causality. For senior engineers who want raw signal with zero padding.
Setup takes under a minute:
Type "/caveman" in any session. The mode stays active until you disable it.
Benchmark results
Caveman's benchmarks across 10 real coding tasks show a 65% average token reduction. The range: 22% on tasks with heavy code output (code passes through unchanged) to 87% on explanation-heavy tasks (where prose carries the most waste).
| Task type | Default tokens | Compressed tokens | Reduction |
|---|---|---|---|
| Code generation | 800 | 624 | 22% |
| Bug fix with explanation | 600 | 210 | 65% |
| Architecture explanation | 1,200 | 360 | 70% |
| Debugging walkthrough | 900 | 270 | 70% |
| Concept explanation | 1,500 | 195 | 87% |
A March 2026 research paper found that brevity constraints on language models improved accuracy by 26 percentage points on certain benchmarks. Less filler meant more focus. Compressed responses scored higher, not lower.
What stays untouched
Compression has built-in safety exceptions. Caveman auto-suspends for:
- Security warnings. Full verbosity when the model flags a vulnerability or dangerous operation.
- Irreversible actions. "This will delete your production database" gets spelled out in full.
- Multi-step sequences. When fragment order could cause misunderstanding, full sentences return.
- Code output. Code blocks pass through unchanged. No abbreviated variable names. No dropped syntax.
- Commit messages and PRs. Written artifacts use normal prose.
These guardrails target the routine interactions where verbosity wastes money, without compromising the interactions where precision prevents disasters.
Four more ways to cut LLM costs
Token compression is one optimization. Stack it with these four:
Prompt caching
Anthropic offers a 90% discount on cached input tokens. If your team sends the same system prompt or codebase context across sessions (you do), caching drops that portion of your bill by 90%. One configuration change. Minutes to set up.
Model routing
Use cheaper models for simple tasks. A code formatting question doesn't need Opus at $75/M output tokens. Sonnet handles it at $15/M. Route complex architecture decisions to the expensive model, routine completions to the efficient one. 80% savings on routed tasks.
Structured prompts
Vague prompts generate long responses because the model hedges across multiple interpretations. "Fix the auth bug" produces 800 tokens of exploration. "The JWT validation in auth.ts:42 returns 403 for expired tokens; it should return 401" produces 150 tokens of fix. Specific input, shorter output. Less cost.
Batch processing
Anthropic's Batch API gives 50% off for requests that don't need real-time responses. Background code reviews, documentation generation, test suite creation. Queue them and save half.
| Optimization | Typical savings | Setup effort |
|---|---|---|
| Token compression | 65% output reduction | 1 minute |
| Prompt caching | 90% input reduction (repeated context) | 30 minutes |
| Model routing | 80% on routed tasks | 1-2 hours |
| Structured prompts | 40-60% fewer output tokens | Team habit (ongoing) |
| Batch API | 50% on queued requests | 1-2 hours |
Stack compression (65% output savings) with caching (90% on repeated input) and routing (80% on simple tasks). A $2,000/month AI bill drops to $400-$600.
Measure before you optimize
Track your team's token spend for one week before changing anything. Most AI coding tools ship with usage dashboards:
- Claude Code: Anthropic Console shows daily token counts by model.
- Cursor: Settings > Usage shows monthly consumption.
- GitHub Copilot: Organization admins see per-seat usage in billing.
Multiply your output tokens by the per-token rate. That's your baseline. Enable compression for one week and compare. Teams doing heavy code generation see 22-40% savings. Teams using AI for explanations, debugging, and pair programming see 60-87%.
LLM costs are the new cloud bill
Ten years ago, teams treated AWS bills as an afterthought. Usage grew unchecked. The monthly number hit five figures. Someone in finance started asking questions. Cloud cost optimization became a discipline: reserved instances, right-sizing, FinOps teams.
LLM costs follow the same curve. Teams adopt AI tools. Engineers find them productive. Usage grows. Six months later the "AI tools" line item has doubled, and nobody made a conscious decision to spend more.
The difference: LLM cost optimization takes less effort than cloud optimization. You don't need a FinOps team. Token compression takes one minute to set up. Prompt caching takes an afternoon. Model routing takes a day. The savings start the same week.
At Savi, our engineers run Claude Code and Cursor on every project. Token optimization is part of the workflow. When a client pays for AI-accelerated development, we make sure every API call delivers value, not filler.
Frequently asked questions
How much do AI coding tools cost per month?
A team of 5 engineers using Claude Code, Cursor, and Copilot spends $500-$3,000 per month on AI tokens. Output tokens (model responses) cost 3-5x more than input tokens (your prompts). Heavy API users can exceed these ranges.
What is token compression?
Token compression instructs the AI model to drop filler words, pleasantries, hedging, and restated context while keeping technical terms, code blocks, and error messages intact. Open-source tools like Caveman achieve 65% average token reduction across coding tasks.
Does compression affect code quality?
No. Code blocks pass through unchanged. Error messages stay quoted exactly. A 2026 research paper found brevity constraints improved model accuracy by 26 percentage points. The model spends fewer tokens on filler and more on the answer.
How do I reduce Claude Code costs?
Four ways: enable token compression (65% output savings), turn on prompt caching (90% off repeated input context), route simple tasks to cheaper models like Sonnet instead of Opus, and use the Batch API (50% off) for non-real-time tasks like code reviews and documentation.
Related reading
AI coding assistants: what they can and can't do for your product
84% of developers use AI coding tools. They ship boilerplate 30-50% faster. They also generate 2.74x more security vulnerabilities. Here's how to get the speed without the risk.
How much does it cost to build an AI agent in 2026?
Off-the-shelf AI agents cost $500-$5,000/month. Custom builds run $20K-$180K+. But initial development is only 25-35% of your three-year cost. Full pricing breakdown with a build-vs-buy decision framework.
AI is changing what software costs to build. Here's what to demand from your dev partner.
Developers are 40% faster with AI tools, but enterprise software spending rose 15% to $1.4 trillion in 2026. Where's the savings going? A pricing breakdown for founders hiring dev teams.
Stop overpaying for AI tokens
Our engineers use AI tools daily and optimize every API call. 30-minute call. No commitment.
Book a free consultation