When Your AI Agent Hits a Wall: Token Limits, API Costs, and the Documentation Tax

You can build the most elegant agent-ready architecture in the world. None of it matters if you can’t afford to run it.

Where We Left Off

In the previous posts, we established the foundation for RepCheck. A system design document built through structured Q&A. All the Scala code patterns documented. Compile-time enforcement configured. A multi-repository architecture designed.

That work produced real artifacts: a 1,700-line code patterns reference, a system design document with Mermaid diagrams, and compile-time tooling (WartRemover, Scalafix, tpolecat) that catches mistakes during compilation.

This post is about what happened next, and the problems I did not see coming.

The Documentation Was Too Expensive to Use

The documentation I created was thorough. That was the point. Give coding agents enough context to implement components independently. But thorough has a cost, and that cost is measured in tokens.

The system design doc alone was about 8,000 words. The code patterns reference was another 6,500. The patterns guide, the annotated references, the skeleton templates. All told, roughly 33,000 words of documentation, or about 40,000 tokens. Every time an agent starts a task, it needs to load the relevant subset. With Claude Code’s context window, that is a significant chunk consumed before the agent writes a single line of code.

I had built documentation that was good. I had also built documentation that was expensive to use.

The fix turned out to be simple: stop loading everything. I created CLAUDE.md as a routing guide that maps specific tasks to the exact files an agent needs to read. Instead of loading all 33,000 words to build a Pub/Sub subscriber, the agent reads the routing table and loads only the three files it actually needs. About 1,500 words instead of 33,000.

The routing table has 13 task categories covering everything from “building a Congress.gov API client” to “writing tests for any module.” Each entry lists exactly which files to load. Nothing more, nothing less.

The lesson was obvious in hindsight. Comprehensive documentation is useless if the consumer cannot find the relevant parts quickly. This applies to human developers too, but it is critical for agents because they pay per token for everything they read.

The 24 Templates

Before the cost problems hit, I built the template library that the routing table points to. This was the largest deliverable of this phase.

Eight of the templates are annotated references. These take existing, working code from the prototype and add inline annotations explaining every pattern decision. Each one references actual source file paths so agents can cross-reference. They cover the core patterns: tagless final traits, DTO/DO layering with Either-based conversion, IOApp entry points with for-comprehension orchestration, Firestore persistence wrapping the Java SDK, PureConfig loading, enum parsing with Circe emap, semi-auto Circe codecs, and test patterns with equivalence class negative testing.

The other sixteen are skeleton templates. Copy-and-fill scaffolds for new components. Each one compiles structurally and uses TODO markers where project-specific values go. They cover retry wrappers, error patterns, config patterns, Pub/Sub publisher and subscriber, orchestrator, Doobie repositories, GCS readers, snapshot services, streaming pipelines, LLM client adapters, prompt engines, Firestore repositories, pipeline apps, workflow definitions, and test scaffolds.

Every skeleton follows the patterns from SCALA_CODE_PATTERNS.md. Every one was reviewed against the system design to make sure it reflects actual architectural decisions, not generic boilerplate.

The Documentation Needed Compression

The routing table helped, but some tasks still require loading several files. Building a scoring engine, for example, needs five templates plus architecture docs. The documentation was written for humans and agents alike. Readable prose, full code blocks, detailed explanations.

Agents do not need readable prose. They need dense, accurate context.

So I built a doc compressor: a Scala program that reads each documentation file, sends it to Claude’s API, and produces a semantically compressed version optimized for LLM consumption. The compressed versions live in .claude/agent-docs/ and the routing guide points agents there.

The compression preserves all code blocks exactly as-is, preserves all table data, and compresses prose into terse single-line statements. Introductory and summary paragraphs get removed. The output has to be self-contained. An agent reading only the compressed file must understand the pattern without needing the original.

This was supposed to be straightforward. It was not.

The API Key That Did Not Work

To run the compressor, I needed an Anthropic API key. I had a Claude Pro subscription. I assumed that meant I had API access.

It does not.

Claude Pro is the $20/month subscription that gives you access to claude.ai, the chat interface. The Anthropic API is a completely separate product with separate billing. Having one does not give you the other. This is not obvious from Anthropic’s marketing. I suspect many developers hit this same wall.

After purchasing API credits, the key still returned “credit balance too low.” The credits were on the right organization, the right workspace, the limits were fine. The issue turned out to be the API key itself. It was not registering. Deleting it and generating a fresh key from the same workspace fixed it.

I lost about 45 minutes to API key debugging. That is 45 minutes of a Claude Code session spent not writing code, not designing architecture. Just figuring out why an HTTP 400 kept coming back.

The Real Cost of LLM-Powered Features

The doc compressor was the wake-up call, but the real issue was bigger.

RepCheck’s core design depends on LLMs. The bill analysis pipeline sends every Congressional bill through Claude to extract key provisions, classify political alignment, identify earmarks, and generate plain-language summaries. This is the heart of the product.

I ran the numbers on what that would cost with Claude Opus, which was the original plan. About 15,000 bills per Congress, average bill around 5,000 tokens input and 2,000 tokens output, Opus pricing at $15 per million input and $75 per million output. The initial extraction pass alone would run about $3,375. And that is one pass. The original design called for multiple analysis layers. With Opus for everything, I was looking at roughly $10,000 per Congress in LLM costs alone.

These are rough estimates. Bill sizes vary enormously. A one-page naming resolution might be 200 tokens. A 2,700-page omnibus spending bill could be 500,000 or more. The actual distribution is heavily skewed: most bills are short, but the long ones are very long. I would not know the real numbers until the pipeline ran against actual bill text. But the direction was clear. For a side project, $10,000 per two-year Congressional session was not viable.

The Tiered Model Strategy

The fix was acknowledging that not every task needs the most expensive model. I redesigned the bill analysis pipeline into three passes.

Pass 1 sends every bill through Haiku first. Structured extraction: key provisions, spending amounts, affected agencies, topic classification. Haiku handles this well and costs a fraction of Opus. Pass 2 uses Sonnet, but only on bills that score above a relevance threshold in Pass 1. Typically about 30% of bills warrant deeper analysis. This is where plain-language summaries, political alignment signals, and nuance detection happen. Pass 3 uses Opus, but only when Sonnet flags genuine ambiguity or conflicting signals. Typically less than 5% of the filtered set.

The total estimated cost dropped from roughly $10,000 to about $500 per Congress. Most bills are straightforward. A farm subsidy bill does not need Opus-level reasoning to classify. Haiku handles it fine. You only escalate when the cheaper model signals uncertainty.

I added this to the system design as a formal architecture decision, with configurable thresholds per pass and budget caps that halt processing if costs exceed projections.

The Token Limit Problem During Development

There is a separate cost that is easy to overlook: the tokens consumed while building all of this with an AI agent.

Claude Code sessions have context windows. When you are doing extended architectural work (reading files, discussing design decisions, writing code, reviewing changes, debugging Git issues) you consume context fast. I hit the limit multiple times during this phase.

When you hit the limit, the conversation gets summarized and compressed. You lose the nuanced context of earlier decisions. The agent may contradict something you decided two hours ago. You spend tokens re-establishing context that was already established.

This is a structural problem with long-running agent sessions. The work I did in this phase (24 templates, a routing guide, a compression generator, a tiered cost strategy, system design updates) took multiple sessions across multiple days. Each session boundary meant re-loading context, re-reading files, and re-explaining decisions.

The practical mitigation was keeping a plan file that tracked exactly what was done and what remained. When a session ran out, the next one could pick up from the plan rather than re-deriving the state. This helped, but it is a workaround, not a solution. Later posts in this series cover how plan files evolved into something more deliberate.

Vector-Based Context Memory

A colleague pointed me to crowd-control, an open-source tool that tackles the session memory problem. Studying how it works taught me some useful concepts, even though we do not plan to use the tool directly.

The approach is clever. After each session ends, a hook sends the transcript to Haiku to distill it into discrete learnings (architectural decisions, discovered patterns, gotchas). Those learnings get embedded as vectors and stored in a local LanceDB database. At the start of a new session, the agent searches the vector DB with semantically relevant queries and gets back distilled knowledge from past sessions. A warm start.

Embeddings do not reduce context. They change where it comes from. The savings come from two places: massive reduction versus loading full transcripts, and semantic search retrieves only relevant learnings instead of dumping everything every session.

I saw two potential applications. For agent development context, storing coding rules, guidelines, skeleton templates, and annotated examples in a vector database so agents can semantically search for the right pattern instead of loading everything through the routing table. For the application itself, embeddings would be useful for concept-level discovery in the bill analysis pipeline (“find laws related to environmental liability”) but wrong for precise retrieval of legal text where exact wording matters. Legal text needs a traditional database for exact citation lookup, with embeddings only as a discovery layer.

I did not implement either at this point. But understanding how embeddings, vector search, and semantic retrieval work influenced how I thought about both agent tooling and application architecture going forward.

What Got Shipped and What Did Not

The concrete output of this phase: 24 code templates covering every component type in the system, a routing guide that cuts agent token consumption significantly, a doc compression generator, a tiered LLM cost strategy, and a CI/CD pipeline running compile checks, Scalafix linting, and tests on every PR.

What did not get done: no code had been generated from the templates yet. The templates existed, the routing worked, but no agent had actually used them to implement a component end-to-end. No acceptance criteria per component. The templates told agents how to build something but not when it was done. No Congress.gov API specs for votes, members, and amendments. No Docker or deployment templates. And the cost estimates still needed validation against real bill text sizes.

Project Repositories

All code for RepCheck is on GitHub:

votr: Main monorepo. Pipelines, migrations, infrastructure code, acceptance criteria, and this blog
repcheck-shared-models: Shared models library. DTOs, domain objects, Circe codecs, Doobie codecs
repcheck-pipeline-models: Pipeline models library. Events, workflow schemas, error handling, configuration
repcheck-ingestion-common: Ingestion common library. API client, XML parsing, change detection, event publishing, repository base, placeholders, execution helpers, structured logging
repcheck-g8: Giter8 template for scaffolding new RepCheck Scala repositories
tf-repcheck-infra: Terraform infrastructure-as-code for GCP (dev/staging/prod)

This is part of an ongoing series documenting RepCheck’s development. Previous posts: Introducing RepCheck | Building Agent-Ready Context