Anthropic released Claude Opus 4.7 today. They're touting a 13% benchmark improvement on coding tasks, enhanced instruction following, and a new tokenizer.

Benchmarks are important. But they're also abstract. A number in a spreadsheet doesn't tell you what "better" actually looks like when a model has to make a decision that matters.

So I gave it a baseball team.

If you've been following Deep Dugout — my AI baseball simulation project — you know the setup: a full baseball simulation engine where Claude AI models make every managerial decision. Real MLB rosters, real stats, real tactical complexity. Lineups, starting pitchers, bullpen calls. Every decision logged with reasoning and confidence levels.

Today, within hours of the Opus 4.7 release, I ran a new 2026 World Series simulation: the Los Angeles Dodgers managed by Opus 4.7 vs. the Seattle Mariners managed by Opus 4.6. Same seed (same dice rolls for identical situations), same rosters, same personality prompts. The only variable: the model generation calling the shots.

Opus 4.7's Dodgers won the series 4-1.

But the scoreboard is the least interesting part. What's fascinating is what the decision logs reveal about HOW the two models reason differently under pressure:

Opus 4.7 reasons about situations. Opus 4.6 recognizes them.

When Opus 4.6 managed George Kirby at 130 pitches, down 6 runs, it kept repeating: "He's only been through the order once, let him work." At 61 pitches, 111 pitches, 130 pitches — same reasoning, same conclusion, same 82% confidence. The model was pattern-matching to a framework rather than updating on new information.

Opus 4.7, facing a similar scenario with Yoshinobu Yamamoto laboring through 6 walks, held multiple variables simultaneously: pitch count vs. times through order, tonight's evidence vs. season stats, current leverage vs. bullpen conservation. Its confidence ranged from 65% to 82% depending on the actual difficulty of the call.

The decisive moment came in Game 4's ninth inning. Opus 4.7 pulled closer Edwin Díaz with bases loaded and nobody out, citing evidence from this game: "2 walks in a 7-batter sample, leverage index is 2.84 and climbing, his command profile tonight tells me the stuff isn't playing." That's not a template. That's judgment.

The counterpoint: Every AI decision in Deep Dugout has to come back as structured data — which pitcher to pull, which reliever to insert, a confidence level, reasoning. Five times across the series, Opus 4.7 returned something the system couldn't parse: malformed output instead of a clean decision. When that happens, the system falls back to a simple rule-based manager and logs the failure. Opus 4.6? Zero failures. Not once. Despite Anthropic touting "improved instruction following," the older model was more mechanically reliable at the basic task of returning a well-formatted answer under pressure.

So what does a 13% benchmark improvement actually look like outside of benchmarks?

It looks like a manager who changes his mind when the evidence changes, who knows what he doesn't know, and who can hold more than one variable in his head when the situation demands it. It also looks like a manager who occasionally fumbles the paperwork — better thinking, slightly less reliable execution.

The full series — complete with box scores, AP-style recaps, press conferences, beat writer analysis, and the deep-dive model performance article — is live now on deepdugout.com.

176 AI decisions. 5 games. $3–5 in API costs. One afternoon of work.

The future of AI evaluation isn't just benchmarks. It's watching models make decisions that actually matter.

deepdugout.com/series/2026-world-series-generation-war/