Using Anthropic Fable Efficiently

Fable is Anthropic’s new frontier tier. It’s astonishingly good. The first day I had it, I used it the way I’d used every model before it: point it at a task and let it implement the whole thing. I hit my usage limits within a couple of hours. Mid-task, nothing to do but wait.

That’s the trap. Frontier tokens are the scarcest resource in the workflow. They cost the most, they’re capped the hardest, and they fill the context window the fastest. All three pressures point the same direction, and the model is good enough that it’s easy to ignore all of them until you hit a wall.

Here’s the counterintuitive part. The most expensive way to use the smartest model is to let it type. You don’t pay a principal engineer to implement a spec that any competent engineer could follow. You pay them to decide what’s worth building and to catch what everyone else missed. Fable is the same. Its real value isn’t the code it can write. It’s the judgment it brings before anyone writes anything.

The mental model: three tiers, three jobs

Once I stopped treating Fable like a faster Opus, the workflow sorted itself into three tiers.

Fable, the frontier tier. Code review, architecture decisions, converting a roadmap into a spec, any call that hinges on ambiguity. This is where judgment lives.

Opus, the workhorse. Implementing a well-specified feature, multi-file refactors, anything with clear acceptance criteria. Give it a good spec and it executes.

Sonnet, the fast tier. Mechanical changes, boilerplate, test scaffolding, docs. When you know exactly what you want and the task doesn’t need judgment, Sonnet gets it done.

The insight that ties it together: delegation quality depends entirely on spec quality. The frontier model’s real output isn’t code. It’s a specification precise enough that a cheaper model can’t get lost. When you use Fable well, you’re not asking it to build. You’re asking it to produce the document that lets something else build.

Case study 1: the Current watch app release

I ship an iOS app, Current: Recovery Score. This summer I was planning its 2.0 release, and it’s a clean example of the frontier model doing frontier work.

I had Fable run full code reviews on both apps. It came back with 19 findings on the phone app and 15 on the watch app. All fixed.

Then it did something more interesting. I gave it the roadmap and asked what to do next. The plan had a v1.8 growth release scheduled before the v2.0 watch app. But the watch app had landed about nine months early. Fable’s recommendation: skip the standalone v1.8, fold the highest-value items into v2.0, and rescope the whole release around the watch launch as the re-launch moment. That’s a strategy call, not a coding task. It’s exactly what you bring a frontier model in for.

The artifact it produced is the point. The v2.0 plan isn’t a to-do list. Every item carries scope, file-level guidance, and acceptance criteria, written to be handed straight to an implementation agent. Two examples.

The Siri Shortcut spec: read from the shared score store, do not re-fetch from HealthKit, honor Weekly Mode so the answer matches the widget’s phrasing. Acceptance: “Hey Siri, what’s my recovery score” answers with the same number the widget shows, and it works with the app closed. No ambiguity left for the implementer to resolve.

The day-tags correlation feature: a fixed set of tags, a pure engine that computes mean score on tagged versus untagged days, and a minimum of five tagged samples before it reports anything. Deliberately not free-text, deliberately not AI, deliberately a stats table. The spec locked in what it was and what it wasn’t.

Implementation agents took that plan and built it. The release came together building clean on four schemes with 89 unit tests passing. Fable never wrote a line of Swift. It decided what to build and how to prove it worked, and cheaper models did the building.

Case study 2: this website, one session

The same loop, smaller scale, one afternoon. I had Fable take a pass over this site.

It came back with a ranked findings list. Two were real bugs: a corrupted SVG path in a share button, and a reading progress bar that broke under View Transitions. It flagged a photo lightbox serving 18 MB original camera JPGs, a missing 404 page, no robots.txt, and no structured data. A clean triage of what actually mattered.

Every finding became a spec: file paths, current behavior, desired behavior, and a verification bar. Build passes, all Playwright tests green, measurable outcomes where they applied. Opus executed all 11 fixes. The built site dropped from 146 MB to 35 MB.

The feature batch worked the same way. A command palette, project case studies, EXIF extraction at build time. Fable wrote the spec, including the accessibility requirements and the new tests it wanted. Opus built it. Then Fable came back at the end to review the copy, and while it was there it queried the real health database and corrected three stats I’d gotten wrong.

That last part is the shape of the whole thing. Review, delegate, verify is a loop, not a handoff. The frontier model comes back at the end to check the work, and checking costs a fraction of what doing the work would have. You pay frontier rates twice, for the spec and the review, and workhorse rates for the expensive middle.

What makes a spec delegable

The difference between a spec that saves tokens and one that wastes them comes down to a few things.

Self-contained. File paths, current behavior, desired behavior. Assume the implementing agent starts cold, because it does. It has none of the context that produced the spec.

Acceptance criteria the agent can check itself. “Works app-closed and returns the same number as the widget” is verifiable. “Make it good” is not.

Guardrails written down. Conventions to follow, what not to touch, what to report back. The day-tags spec saying “not free-text, not AI” saved a round of the implementer guessing.

A required report format. Files changed, test results, and deviations. Deviations matter most. The best moment of the website session was Opus refusing part of a spec: it found that an “unused” image file was actually load-bearing for OG image generation, and said so instead of deleting it. A good spec gives the implementer room to push back, and a good report surfaces it.

The anti-pattern is vague delegation. “Clean this up” burns more total tokens than doing it yourself, because you pay for the implementation and then you pay again for the rework when it goes the wrong direction.

When not to delegate

Delegation isn’t always right. Skip it when:

The problem is genuinely ambiguous and the spec would run longer than the diff. At that point you’re doing the hard part anyway.

The work needs rounds of clarification. Resolving ambiguity is the frontier model’s job. Don’t spec it away and hand a cheaper model a guessing game.

You’re debugging something with an unknown root cause. There’s nothing to specify yet. Find the bug first, then decide who fixes it.

The thing I keep coming back to is that the token budget is really an attention budget. Frontier tokens go to judgment and verification, workhorse tokens go to execution, and the discipline is knowing which is which before you start. It’s the same discipline as running a good engineering team: your best people decide and review, and you don’t have them typing out the parts anyone could type.

I still catch myself reaching for Fable to do the easy thing, because it’s right there and it’s so good at everything. Then I remember that first afternoon, and I write the spec instead.