AI Isn't Free: How Prompt Caching Cut My API Bill by 57%

AI isn't free. It never was — but a lot of people are finding that out for the first time this week.

Anthropic just announced that third-party tools like OpenClaw and Cline can no longer piggyback on Claude Pro and Max subscriptions. If you want to use Claude through these tools, you now pay per token, either through the API or through Anthropic's new "extra usage" bundles. The era of all-you-can-eat AI subscriptions quietly subsidizing heavy automation use? Over.

I get why this is a shock for a lot of people. But here's the thing: I've been paying per token from day one on Deep Dugout, my AI baseball simulation project, and managing that cost was one of the more interesting engineering challenges of the whole build. I want to share one specific technique — prompt caching — that cut my costs by 57%. If you're about to start paying for AI by the token for the first time, this is the single most impactful thing you can do.

For context: Deep Dugout is a full baseball simulation where Claude makes every managerial decision — lineups, starting pitchers, bullpen calls — for all 30 MLB teams, grounded in real player data from FanGraphs and the MLB Stats API. I simulated a projected 2026 World Series, generated post-game articles and audio podcasts, and then broadcast all 15 Opening Day games live on Discord. Total API cost for the entire project: about $50.

That $50 figure is the part that surprises people. I made over 2,500 API calls across 86 baseball games. That's a lot of calls! So how on earth did it cost less than a decent dinner?

Prompt caching. And it is so, so much simpler than you probably think it is.

Here's the deal. Every time one of my AI managers makes a decision, the API call includes a big system prompt: the team's manager personality profile (built from real interviews and press conferences — Dave Roberts manages aggressively, Dan Wilson trusts his starters, Kevin Cash platoons everything), the full roster with player stats, and the response format instructions. That's about 3,100 tokens that are completely identical across every single call within a game.

Without caching, you're paying full price to send those same 3,100 tokens to Claude 20 to 30 times per game. You're essentially photocopying the same document and handing it over again and again and again, paying for a new photocopy each time. It's incredibly wasteful.

With prompt caching, you send it once at full price, and every subsequent call gets a 90% discount on those cached tokens. Claude says "oh I've already read this, got it" and you pay almost nothing to include it.

The implementation is almost comically simple. You add a single parameter — cache_control: {"type": "ephemeral"} — to the system prompt in your API call. That's it. One line of code. The cache persists for five minutes, which is more than enough time when your AI manager is making rapid-fire decisions throughout a baseball game.

The results across my 15-series batch run (86 games, 2,588 API calls):

88% of all input tokens hit the cache
Without caching, the run would have cost about $37
With caching, it cost $15.81
That's a 57% reduction in total cost

Now look — I'll be the first to admit that saving $21 is not going to change anyone's life. But the whole project cost $50! The percentage is the point. Scale those numbers up to whatever you're building — a chatbot, an AI agent, a content pipeline — and you're talking about real money. Anything where the "instructions" stay the same but the "questions" change is a candidate for this. And I would bet that describes the vast majority of what people are building with AI right now.

This mattered a lot less a month ago. When your Claude usage was bundled into a $20/month subscription, who cared about token efficiency? Nobody! You were already paying! But now that Anthropic is pushing everyone toward per-token billing, the people who understand their cost structure — and know tricks like this — are going to have a real advantage over the people who don't.

A few practical tips if you're new to this:

First, identify your static content. System prompts, persona definitions, format instructions, reference data — anything that doesn't change between calls is a caching candidate. In my case, the manager personality and roster data were obvious targets.

Second, put the static parts at the beginning of your prompt. This is important! The cache works on prefix matching, meaning Claude checks if the start of your new prompt matches something it's already seen. If your cacheable content is buried in the middle or sprinkled throughout, the cache won't hit. Front-load the stuff that doesn't change.

Third, keep your calls sequential within a session. The cache has a time-to-live of five minutes. If you're making calls in rapid succession — which most agents and pipelines do — you'll get excellent hit rates. My AI managers make decisions every few seconds during a game, so the cache was almost always warm.

Fourth, monitor your cache hits. The API returns a field called cache_read_input_tokens in the usage response. If that number is low relative to your total input tokens, something is wrong with your prompt structure and you should investigate.

I built an entire AI baseball universe — 200+ simulated games, 30 AI managers with distinct personalities, post-game journalism, audio podcasts, a live Discord broadcast — for $50. Prompt caching is a big reason why that number isn't $120.

AI costs are engineering problems, not just business problems. And now that we're all paying by the token? It's worth learning how to engineer them well.

Deep Dugout: https://www.deepdugout.com