Closing the Gaps: Shell Battles, Token Economics, and the Last Mile to Agent-Generated Code

Building documentation for AI agents is tedious, time-consuming work. The question is whether it pays off when you finally let the agents loose.

Where We Left Off

In the previous post, we talked about the economics of LLM-powered development: token costs, the doc compressor, tiered model strategies, and the routing table that cuts 85% of context loading. We had identified 10 documentation gaps that needed to be filled before agents could independently implement RepCheck’s repositories.

This post covers the grind of actually closing those gaps, and the unexpected problems that showed up along the way.

The Tedium of Context Creation

Here’s something nobody tells you about agent-ready documentation: it’s boring to create.

Writing a system design document is interesting. Debating architectural trade-offs is engaging. But sitting down to write the fifteenth skeleton template, carefully ensuring every F[_] constraint is correct, every error pattern follows the convention, every config loader uses the exact PureConfig pattern, is tedious work.

And it’s not just writing. Each document needs to be:

Internally consistent with every other document
Cross-referenced in the routing table
Compressed into a token-efficient version for agents
Verified to ensure compression didn’t lose critical details

For 30 files, that’s a pipeline. Every new document means updating CLAUDE.md, updating PATTERNS_GUIDE.md, running the compressor, and checking the output. Add a behavioral spec? That’s 5 routing table entries to update. Rename a file? That’s a cascade through the index.

The question I keep coming back to is: is this template-able? Not just within this project. Can this structure of routing tables, compressed docs, and task-specific file lists transfer to other repositories and other projects entirely? We’ve invested significant effort building this scaffolding for RepCheck. If it only works here, it’s a sunk cost. If the patterns generalize, it’s an investment.

The honest answer is: we won’t know until we try. The real test comes in two phases: first when agents use these templates to generate RepCheck’s repositories, and second when we apply the same approach to a completely different project. Only then will we know whether this is a repeatable process or a one-off effort that happened to work for one codebase. We’ve designed for portability; the doc compressor auto-discovers files, the routing table is task-oriented, the skeletons use placeholder patterns, but design intent and practical reality are different things.

Skills: A Way to Stop Paying the Context Tax

Every conversation with Claude Code starts by loading CLAUDE.md into context. Our CLAUDE.md is lean, mostly a routing table pointing to other files, but it’s still ~3,000 tokens occupying the context window for the entire session. That’s true whether the conversation is a 10-step implementation task or a quick “what branch am I on?”

Claude Code supports skills: markdown files in .claude/skills/ whose full instructions are lazy-loaded. Each installed skill contributes about 100 tokens of metadata at startup (the YAML frontmatter: name, description, trigger), but the body only loads when the skill is triggered. You can install many skills without context penalty. The full instructions enter the context window only when the skill fires.

There are three ways a skill fires. The explicit one: you type /skill-name as a slash command and it loads immediately. The description-based one: Claude reads the skill descriptions from the startup metadata and decides on its own whether the current task matches. If you ask it to “build the scoring engine,” it checks whether any installed skill covers that, and if one does, it loads the full instructions without you having to ask. No slash command required. The path-based one: a skill can declare a paths field in its frontmatter with glob patterns (e.g., "src/pipelines/**/*.scala"), and it fires automatically whenever Claude is working with files that match. Useful for project-specific skills where you want them active any time someone touches a particular part of the codebase, regardless of what they typed.

The quality of the description in the YAML frontmatter determines how reliably the automatic trigger works. A vague description (“helps with code tasks”) means the skill fires inconsistently or not at all. A precise one (“load when implementing a RepCheck pipeline component from scratch: scoring, ingestion, members, or bills”) fires when it should and stays quiet when it shouldn’t.

The math is compelling:

Approach	Tokens per session
Everything in CLAUDE.md	~3,000+ always resident
Slim CLAUDE.md + skills	~500 base + ~100/skill metadata + full body only when triggered

Each routing table entry in our CLAUDE.md could become its own skill file. “Building the scoring engine” would be a skill that loads the relevant skeleton templates, BEHAVIORAL_SPECS.md references, and scoring-specific rules, but only when an agent is actually building the scoring engine.

We haven’t converted yet. This is an experiment we’re planning, not a result we’re reporting. Instead of worrying about CLAUDE.md growing unwieldy, we have an escape hatch.

When the Agent Can’t Run Anything

Here’s a problem that consumed more time than any architectural decision: Claude Code’s shell environment kept breaking between sessions.

Claude Code runs in a bash shell on Windows (via Git Bash). To run our Scala build, it needs Java 21 on the PATH and access to sbt. We configured this: added JAVA_HOME to ~/.bashrc, put Coursier’s bin directory on the PATH, exported the Anthropic API key for the doc compressor.

It worked. Then the next session started, and none of it was there.

The shell environment doesn’t persist the way you’d expect. Environment variables set in one session don’t carry over. The agent would configure Java 21, run the compressor successfully, and then in the next conversation start fresh with Java 11 and no API key.

We tried multiple approaches:

Setting variables in ~/.bashrc (works for interactive shells, but sbt spawns subprocesses that don’t always inherit)
Setting variables in PowerShell’s $PROFILE (the file didn’t exist and had to be created)
Using Coursier’s --setup flag (installed Java 21 but didn’t update the PATH correctly)

The fix that eventually stuck was explicitly sourcing ~/.bashrc before running sbt, and using the full path to both the Java installation and the sbt binary. It’s not elegant. It works.

We’re still experimenting with this. The root cause seems to be that Claude Code’s bash environment initializes differently from an interactive terminal. What works when you type commands manually doesn’t always work when the agent executes them. We haven’t found a definitive solution, just workarounds that hold for now.

The practical cost: at least an hour across sessions spent on environment debugging instead of actual development work. For a tool that’s supposed to accelerate development, that’s a painful tax.

The Tooling We Added

The original 10 gaps were about giving agents the information they need to write correct code. There’s also a gap in what happens after they write it: nothing prevents agent-generated code from being sloppy, unsafe, or inconsistent. We closed that with tooling.

Scalafmt (Code Formatting)

We configured sbt-scalafmt with project-specific rules: 120-column lines, 2-space indentation, trailing commas, redundant brace cleanup. All 26 existing source files were auto-formatted in one pass. CI now runs scalafmtCheckAll before compilation; unformatted code won’t merge.

Without an enforced formatter, every agent session produces slightly different styling. With Scalafmt, it doesn’t matter how the agent formats its output. The formatter normalizes it.

Codecov with Patch Coverage Enforcement

We set up sbt-scoverage for local coverage reports and Codecov for CI integration. The key decision: enforce coverage on patch, not project.

A project-wide coverage minimum would block every PR if legacy code has low coverage. Patch coverage checks only the lines changed in the PR. Our codecov.yml requires 90% coverage on new and changed code. The project-level target floats with the repo’s actual coverage, allowing a 5% threshold before failing.

When an agent implements a new component from our templates, Codecov will verify that the agent also wrote adequate tests. If coverage on the new code is below 90%, the PR is blocked.

Scalafmt + Scalafix + WartRemover + tpolecat + Codecov

The full quality gate stack now looks like this:

Tool	What It Catches	When
Scalafmt	Formatting inconsistencies	CI (pre-compile)
Scalafix	Import ordering, unused imports	CI (post-compile)
WartRemover	Unsafe patterns (nulls, vars, .get calls)	Compile time
tpolecat	Compiler warnings as errors	Compile time
Codecov	Insufficient test coverage on new code	PR check

An agent can write sloppy code. These tools ensure it doesn’t get merged.

WartRemover is the one with the most teeth for functional-style Scala. It rejects patterns that are legal Scala but unsound in practice: null literals, var declarations, Option.get and .head calls, asInstanceOf casts, throw expressions, and type inference resolving to Any or Nothing. If an agent writes val x: String = null, reaches for .get on an Option, or declares a mutable var instead of using a Ref, it won’t compile. The errors are specific enough that an agent can self-correct without human intervention.

tpolecat enables a set of strict compiler flags that Scala doesn’t turn on by default. Unused imports become errors. Discarded values in for-comprehensions get flagged (the common case: forgetting to chain an effect, so a line’s result silently disappears). Exhaustivity checks are tightened. The practical effect is that code which compiles cleanly under tpolecat has a noticeably smaller surface area for common Scala bugs. It doesn’t catch everything, but it catches enough that the agent’s first compile attempt is usually much closer to correct.

Gap Tracker: 8 of 10 Done

Six of the eight closed gaps had the same root cause: RepCheck’s codebase was built during prototyping without documentation written for agents. The patterns, conventions, and decisions existed in code and in developers’ heads. When we decided to bring agents in, we had to retrofit everything.

Gap 1: Scala code patterns/signatures. An agent writing Scala for RepCheck without guidance would produce syntactically valid code that violates every project convention. Tagless final F[_] everywhere, parEvalMap for streaming, Sync[F].delay for side effects: none of this was written down. We produced SCALA_CODE_PATTERNS.md and a set of subsection files covering the effect system, streaming, error handling, and config patterns.

Gap 2: References to existing code as templates. Abstract pattern descriptions don’t give agents enough signal. They need annotated examples they can mirror directly. We extracted the most representative patterns from the existing codebase into documented templates: the paginated API client, AlloyDB repository, DTO/DO layering, and the Circe codec conventions.

Gap 3: Congress.gov API specs. Congress.gov publishes no official OpenAPI spec. To build ingestion clients, agents need endpoint shapes, field names, pagination behavior, and rate limits from somewhere. Without a local reference, agents hallucinate field names or stop to ask. We wrote congress-gov-api.yaml from real API responses.

Gap 4: build.sbt / project scaffolding. RepCheck is a multi-project SBT build with specific plugin versions, compiler flag configurations, and dependency patterns. An agent creating a new repository without documented scaffolding would invent its own build structure or diverge from project conventions in ways that break later. We documented the build patterns and created the repcheck-g8 Giter8 template for scaffolding new repos.

Gap 5: Error handling & retry strategy. RepCheck uses a specific error model: flat exception hierarchy (not sealed ADTs), ErrorClassifier per subsystem categorizing failures as Transient or Systemic, and RetryWrapper with exponential backoff. Without documentation, every agent invents its own error handling, producing code that’s incompatible with the pipeline’s fault tolerance design. We produced error-pattern.scala and retry-wrapper.scala skeleton templates.

Gap 6: Testing guidance. The testing stack is specific: ScalaTest with AnyFlatSpec, AlloyDB Omni for integration tests, WireMock for HTTP simulation, equivalence class negative testing, and WartRemover’s Wart.Null applying to test code. Without guidance, agents write tests that pass locally but fail CI, or skip entire categories of failure testing. We produced test-patterns.md, testing-infrastructure.md, and test-templates.scala.

Gap 7: Behavioral ambiguity. Covered in its own section below.

Gap 8: GCP integration patterns. RepCheck wraps Pub/Sub and GCS in F[_] with specific resource management and error propagation conventions. The code existed but was never extracted into templates agents could reuse. Without them, agent-generated cloud integration code works but integrates awkwardly with the rest of the pipeline. We produced Pub/Sub publisher and subscriber skeletons, plus a GCS reader skeleton.

Gap 9: Acceptance criteria per component. Without a success spec for each component, an agent knows how to build but not when to stop. Every unspecified detail is a decision it makes on its own, and some of those will be wrong. The spec needs to go down to exact class signatures, method behaviors, edge case handling, and test expectations — not “implement the members repository” but “the upsert method takes a MemberDO, does this, returns that, throws this specific exception on failure.” We deferred it because accurate acceptance criteria require a stable system design first. That stability exists now. The work is 11 spec files, one per pipeline component. More detail on why this gap matters more than the others in the section below.

Gap 10: Docker / CI / GitHub Actions templates. Partially done. GitHub Actions workflows for compile, test, coverage, formatting, and lint checks are in place and running on every PR. What’s missing are validated Dockerfiles and Cloud Run deployment YAML. They’re written from spec but haven’t been tested against actual deployments. We prioritized code quality infrastructure first; the deployment templates need a working pipeline to validate against, and we’re not there yet.

#	Gap	Status
1	Scala code patterns/signatures	Done
2	References to existing code as templates	Done
3	Congress.gov API specs	Done
4	build.sbt / project scaffolding	Done
5	Error handling & retry strategy	Done
6	Testing guidance	Done
7	Behavioral ambiguity	Done
8	GCP integration patterns	Done
9	Acceptance criteria per component	Not started
10	Docker / CI / GitHub Actions templates	Partial

8 fully done. 1 partial. 1 not started.

There’s also one more gap that doesn’t appear on the original list but became obvious once we added coverage enforcement: we need to backfill missing unit tests. The existing codebase was built during rapid prototyping, and test coverage is thin. Now that Codecov enforces 90% patch coverage on every PR, any agent working on existing code will immediately run into the coverage floor. We need to bring the baseline up before we start generating new code on top of it.

Behavioral Specs: The Gap That Required Design Decisions

Of the 10 documentation gaps we identified, most were about describing patterns that already existed in the codebase. Gap #7 was different. It required making design decisions that hadn’t been made yet.

Gap #7: Behavioral ambiguity. How do pipelines detect changes? What triggers scoring? How do votes link to bills? What happens when events arrive out of order?

These aren’t implementation details. They’re product decisions. The answers required sitting down and thinking through scenarios:

When a bill’s text is updated, do we re-analyze it? Yes, emit bill.text.available on every update where updateDate changed. Cost is managed by the tiered pass routing.
Are votes immutable? No, they’re diffed and upserted. Prior versions are archived to a history subcollection.
Does a legislator’s score span congresses? Yes, we maintain both an aggregate (lifetime) score and per-congress scores.
What happens if a vote arrives before the bill has been analyzed? The scoring message is requeued with exponential backoff. After 5 retries, it goes to dead-letter.
Do all votes count equally? No, floor passage votes (weight 1.0) matter more than committee votes (weight 0.4). Vote type is detected from the Congress.gov question field.

The resulting document, BEHAVIORAL_SPECS.md, is 250 lines of tables, rules, and Firestore schema definitions. It’s not exciting to read. But without it, an agent implementing the scoring engine would need to ask 10+ clarifying questions, each one consuming context and breaking flow.

Gap 9: Acceptance Criteria (The Hardest Gap)

Gap 9 is the biggest piece of remaining work, and the most complicated. It’s also where the quality of documentation most directly determines the quality of what agents produce.

Acceptance criteria for a software component aren’t just a checklist. They’re a specification of every class signature, every method’s behavior (input, output, edge cases), and every test expectation. Not “the repository should persist members” but “the upsert method takes a MemberDO, runs an INSERT ... ON CONFLICT DO UPDATE with these columns in this order, returns Unit in F, and throws MemberPersistenceException if the connection fails.” That level of specificity.

The reason this matters for agents: vague criteria produce vague code. An agent given “implement the bills pipeline” will make decisions on its own: method signatures, error handling edge cases, retry behavior, what to log. Some of those decisions will be wrong. Each ambiguity in the spec is a place where the agent’s output diverges from what you actually wanted, and you find out at review time, not compile time.

The inverse is also true. The more precisely you define what “done” looks like for a component, the less the agent has to guess, and the more likely it is to produce working code on the first pass. This is the gap that cashes in everything else. All the patterns, templates, and behavioral specs are inputs to writing good acceptance criteria. Without them, you can’t be specific enough. With them, you can define exactly what each component does, down to the test assertion.

We deferred it because writing accurate acceptance criteria requires the system design to be stable. It is now. Each of RepCheck’s 11 pipeline components needs its own spec file. That’s the work.

We’re Almost There

The entire premise of this project has been: invest heavily in preparation, then let agents do the implementation. We’ve been in the preparation phase for about a week. It’s been slower and more tedious than expected. The documentation, the tooling, the behavioral specs, the shell debugging, none of it is the “fun part.”

But we’re close. Two more gaps, and we’ll have everything an agent needs to independently implement a RepCheck repository from scratch: the architecture, the code patterns, the behavioral rules, the templates, the API specs, the quality gates, and the compressed versions optimized for LLM consumption.

The next step is to close gaps 9 and 10, and then do what this entire effort has been building toward: point an agent at a repository specification and let it generate the code.

That’s the real test. Not whether we can write good documentation, but whether good documentation produces good code.

Project Repositories

All code for RepCheck is on GitHub:

votr: Main monorepo. Pipelines, migrations, infrastructure code, acceptance criteria, and this blog
repcheck-shared-models: Shared models library. DTOs, domain objects, Circe codecs, Doobie codecs
repcheck-pipeline-models: Pipeline models library. Events, workflow schemas, error handling, configuration
repcheck-ingestion-common: Ingestion common library. API client, XML parsing, change detection, event publishing, repository base, placeholders, execution helpers, structured logging
repcheck-g8: Giter8 template for scaffolding new RepCheck Scala repositories
tf-repcheck-infra: Terraform infrastructure-as-code for GCP (dev/staging/prod)

This is part of an ongoing series documenting RepCheck’s development. Previous posts: Introducing RepCheck | Building Agent-Ready Context | Token Costs and Template Architecture