Code in the Dark: What Happens When an Agent Can't Validate Its Own Work

The compile-test-fix loop is what keeps an agent’s output honest. When it skips the loop, even with the compiler right there, you get code that reads right and does not run.

Where We Left Off

The previous post ended on an optimistic note. Ten of eleven documentation gaps closed. The giter8 template built. Acceptance criteria (Gap #9) was the last remaining piece before we could point an agent at a repository and let it generate real code.

What actually happened next was not that.

The Overnight Session

Claude Code has a mobile interface called Dispatch. I had work to do on the bill ingestion pipeline, which was code I had written by hand, before I started letting agents generate production work. That pipeline was the seed, the example the agents could study, copy, and extend from when I pointed them at new components. My overnight task was adding test coverage to one of its API clients.

Before going to sleep, I queued up the task and let the agent run.

Eight hours of agent time while I slept. Wake up, review the PR, merge, move on. That was the idea.

The code that came back looked impressive. New test files, thoughtful structure, proper mock setup. It read like the right answer.

It did not compile.

It was way off. Multiple errors. Scalafmt violations. Test structures that referenced method signatures that did not exist. JSON decoders expecting a shape that did not match the actual response. The kind of failures sbt compile catches in seconds.

The agent had not run sbt compile. But it could have. Dispatch relays commands back to a machine you have registered. In my case that was my home PC, which had sbt installed, the JVM ready, the full project checked out. All the tools were right there. The agent just did not use them, because the prompt I had queued up before bed did not tell it to. It told the agent to write the tests. Nothing about compiling them. Nothing about running them. Nothing about fixing the failures and looping. I was sleepy. The prompt was naive.

What came back was the kind of code you write on a whiteboard. Structurally plausible. Logically coherent. Completely untested against reality.

Why It Went Wrong

The tool did not fail. The environment did not fail. Dispatch had everything the agent needed. The failure was in the prompt. I asked for the work, and I did not require the feedback loop.

When I am at the keyboard, the loop happens whether I spell it out or not. The agent writes something, I see it try to declare the work done, I say “run sbt test first.” The nudge happens because I am there. Take me out of the room and the nudge does not happen. The agent ships whatever it wrote and calls it done.

Remove the iteration, whether by disabling the compiler or just by not asking for it, and you get code that looks like it went through that process. Right patterns. Right structure. Plausible imports. None of it verified against the actual codebase, the actual dependencies, or the actual type signatures.

The lesson is about the prompt. The feedback loop has to be explicit. “Add coverage for this file” is not enough. “Add coverage, run sbt test, fix every failure, and re-run until the suite is green” is. Without that instruction the agent will happily write confident approximations that do not compile, even when the compiler is sitting right there on the same machine, waiting to be called.

The fix for the night I lost was simple. Redo the work in a session where I was watching and the loop was enforced. It cost more time than the overnight experiment saved.

What We Were Actually Trying to Fix

When we reran the work correctly, the CI failure told us what needed attention. The PR had 66% patch coverage on BillIdentifierApp.scala. Two categories of lines were uncovered:

The BillConversionFailed exception class, defined but never exercised by tests
The IO.fromEither(bill.toDO.left.map(BillConversionFailed(_))) line, inside a private method that had no test entry point

Both were in the diff. Both were flagged by Codecov. The coverage gate was blocking the PR.

The agent’s first instinct was to add BillIdentifierApp.scala to the codecov ignore list. That was the easy answer. The coverage gate stops failing, the PR goes green, you move on. I had to stop it and walk the change back. Coverage ignore lists are for hiding problems, not solving them. If a line cannot be reached by a test, the right answer is to make it reachable.

This is the kind of steering you have to do constantly. The agent will reliably pick the path that unblocks the immediate goal. Left unchecked, it would have shipped a PR where the uncovered lines were simply hidden from the tool measuring them. Technically the PR would have passed. The actual problem would still be there.

That rule is now written into CLAUDE.md explicitly. It came from this moment.

The Testability Pattern That Emerged

The underlying problem was structural. BillIdentifierApp was an IOApp, the outermost entry point of the pipeline. It created its own dependencies. It initialized Firestore, constructed the HTTP client, loaded config. All of that wiring lived in run(), tangled up with domain logic like streaming bills and handling conversion failures.

You cannot unit test an IOApp that creates its own Firestore connection. The test would need GCP credentials, a real Firestore instance, live network access. That is an integration test at best, and it is not what a 90% patch coverage gate is asking for.

The refactoring pattern we landed on has five steps.

1. Extract logic into a class with constructor injection. Move the pipeline logic from BillIdentifierApp into a new BillProcessor class whose constructor takes Firestore, LegislativeBillsApi[IO], and Logger as parameters. Tests construct it with mocks. Production code constructs it with real instances.

Before:

object BillIdentifierApp extends IOApp {
  override def run(args: List[String]): IO[ExitCode] = for {
    config    <- ConfigLoader.LoadConfig(args)
    firestore <- FirestoreScala[IO](config.projectId).InitializeFirestore()
    api       <- LegislativeBillsApi[IO](config.apiKey, config.pageSize)
    _         <- streamBills(firestore, api)
  } yield ExitCode.Success
}

After:

class BillProcessor(
  firestore: Firestore,
  api: LegislativeBillsApi[IO],
  logger: Logger,
) {
  def execute: IO[Seq[WriteResult]] = ...
}

object BillIdentifierApp extends IOApp {
  override def run(args: List[String]): IO[ExitCode] = for {
    config    <- ConfigLoader.LoadConfig(args)
    firestore <- FirestoreScala[IO](config.projectId).InitializeFirestore()
    api       <- LegislativeBillsApi[IO](config.apiKey, config.pageSize)
    processor =  new BillProcessor(firestore, api, logger)
    _         <- processor.execute
  } yield ExitCode.Success
}

// in tests:
val processor = new BillProcessor(mockFirestore, mockApi, mockLogger)
processor.execute.unsafeRunSync()

2. Scope helpers as private[app] instead of private. Methods scoped to private are invisible to tests, even in the same package. Widening to private[app] lets tests in package app call each helper directly without touching the integration wiring. This is the key mechanism. You get unit-level access without making methods part of the public API.

Before:

class BillProcessor(...) {
  private def saveBillsBatch(bills: List[Bill]): IO[Seq[WriteResult]] = ...
  private def convertAndSave(pages: List[Page]): IO[List[Bill]] = ...
}

After:

class BillProcessor(...) {
  private[app] def saveBillsBatch(bills: List[Bill]): IO[Seq[WriteResult]] = ...
  private[app] def convertAndSave(pages: List[Page]): IO[List[Bill]] = ...
}

// in a test file that lives under package repcheck.bills.app:
"saveBillsBatch" should "batch writes and return one WriteResult per bill" in {
  val processor = new BillProcessor(mockFirestore, mockApi, mockLogger)
  val results   = processor.saveBillsBatch(sampleBills).unsafeRunSync()
  results.size shouldBe sampleBills.size
}

3. Add a package declaration to any file that lacks one, before widening scope. private[app] only works if the file actually sits in package app. Older files sometimes lacked the declaration entirely.

Before:

// no package line at the top of the file
import cats.effect.IO
class BillProcessor(...) { ... }

After:

package repcheck.bills.app

import cats.effect.IO
class BillProcessor(...) { ... }

4. Extract each multi-line for-comprehension RHS into its own named method. logOffset, saveBillsBatch, convertAndSave, logBatchProgress. Each one is now independently testable. Each one has its own test case. Each one has a name that appears in test output, making failures legible.

Before (one fused for with the work inline):

def execute: IO[Seq[WriteResult]] = for {
  offset  <- IO(Instant.now().minus(lookbackDays, ChronoUnit.DAYS))
  _       <- logger.info(s"Starting from offset $offset")
  pages   <- api.fetchBills(offset)
  bills   <- IO.fromEither(pages.traverse(_.toDO.left.map(BillConversionFailed(_))))
  results <- firestore.batchWrite(bills)
  _       <- logger.info(s"Saved ${bills.size} bills")
} yield results

After (each sub-step is a named, callable, testable method):

def execute: IO[Seq[WriteResult]] = for {
  offset  <- lookbackStart
  _       <- logOffset(offset)
  pages   <- api.fetchBills(offset)
  bills   <- convertAndSave(pages)
  results <- saveBillsBatch(bills)
  _       <- logBatchProgress(bills.size)
} yield results

private[app] def lookbackStart: IO[Instant] =
  IO(Instant.now().minus(lookbackDays, ChronoUnit.DAYS))

private[app] def logOffset(o: Instant): IO[Unit] =
  logger.info(s"Starting from offset $o")

private[app] def convertAndSave(pages: List[Page]): IO[List[Bill]] =
  IO.fromEither(pages.traverse(_.toDO.left.map(BillConversionFailed(_))))

5. For IOApp run methods specifically: move ALL pipeline logic into a companion object method that accepts factory functions.

This last step is the one we had not fully articulated before. Even after extracting BillProcessor, the run() method still contained lines like:

val firestoreProjectId = sys.env.getOrElse("FIRESTORE_PROJECT_ID", "votr-421801")
db <- FirestoreScala[IO](firestoreProjectId).InitializeFirestore()
api <- LegislativeBillsApi[IO](config.apiKey, config.pageSize)

Those lines cannot be covered by unit tests. They require real GCP infrastructure. So we moved them too, but instead of creating another class, we created a private[app] companion object method (BillProcessor.run) that accepts factory functions:

private[app] def run(
  args: List[String],
  configLoader: List[String] => IO[BillIdentifierConfig],
  firestoreInit: IO[Firestore],
  apiFactory: BillIdentifierConfig => IO[LegislativeBillsApi[IO]],
  logger: Logger,
): IO[Seq[WriteResult]]

Tests inject stub factories that return mocks. Production code passes the real implementations. The orchestration logic (“load config, init firestore, create API, run pipeline”) is now fully unit-testable without touching a single cloud service.

What is left in BillIdentifierApp.run after all of this:

override def run(args: List[String]): IO[ExitCode] = {
  val firestoreProjectId: String = sys.env.getOrElse("FIRESTORE_PROJECT_ID", "votr-421801")
  BillProcessor
    .run(args, ConfigLoader.LoadConfig, FirestoreScala[IO](firestoreProjectId).InitializeFirestore(),
      config => LegislativeBillsApi[IO](config.apiKey, config.pageSize), logger)
    .as(ExitCode.Success)
}

Pure wiring. No domain logic. Nothing testable remains in the entry point because there is nothing to test. It is just dependency construction and delegation.

At this point we can safely add BillIdentifierApp to the coverage-exclusion list. Not because we want to hide anything, but because we have proven by construction that there is nothing left to cover. Every branch that could fail, every line that could be wrong, now lives inside BillProcessor and has a test case. The only thing in the App is the line that reads an env var and the call that hands control to BillProcessor.run. Excluding it is an honest statement about what the file is, not a workaround for what it is not.

The Coverage Policy We Codified

The refactoring raised a question we had not had to answer before. Is there ever a legitimate case for a coverage exclusion?

Yes. Exactly one.

An App or IOApp class that has been fully reduced to infrastructure wiring (env var reads, dependency construction, delegation) contains no domain logic and cannot be tested without live cloud infrastructure. Excluding it from coverage is honest. You are acknowledging that this tiny shell of a class is a wiring harness, not application code, and its behavior is validated by the classes it delegates to.

Here is the rule that was added to our CLAUDE.md, verbatim, so other people can copy it into their own:

Coverage: All newly created or changed code must have test coverage above 90% (enforced by Codecov patch coverage on PRs). Run sbt coverage test coverageReport locally to verify before pushing. Never add files to the ignore list in codecov.yml to work around missing coverage — instead, use the testability refactoring pattern below.

The test for “is this really just wiring” is simple. Could you move any remaining line into a testable class? If yes, move it. If all you are left with is sys.env reads and object construction, you are done.

BillIdentifierApp now meets that bar. It is excluded. Everything it used to do is in BillProcessor, tested, and passing.

The Test Count

The refactoring added 8 new test cases to BillProcessorSpec:

BillProcessor.lookbackStart: time bounds check (moved from App)
BillProcessor.currentTime: time bounds check (moved from App)
BillProcessor.execute: empty page case, page-with-bills case
BillProcessor.run (companion): empty result, result with bills, config failure propagation

Total: 33 tests passing across the bill-identifier and gov-apis modules. Patch coverage well above 90%.

What I Took Away

The agent is not the variable. The prompt is.

When I hand off a task now, the question I ask myself first is not “does the agent have enough context?” It is “does the prompt require the agent to prove its work?”

If it does not, the session produces plausible output that still needs a human to run the actual tests. The time you think you are saving disappears into the debugging session on the other side.

I am planning to add this as an explicit rule to CLAUDE.md. Do not consider a task complete until sbt test has passed in the current session. The agent runs the checks, fixes the failures, and re-runs until the suite is green. No silent handoffs.

A Storage Architecture Question We Did Not Expect

While the coverage work was underway, a different conversation was happening in parallel. It may end up being a bigger change than any of the refactoring.

The original system design stores legislative data (bills, votes, members) in Firestore and user data (profiles, preferences) in Cloud SQL via Doobie. The scoring engine bridges the two. It reads a user’s preference profile, reads the bill analysis results, and computes an alignment score.

That works fine on paper. But as we started thinking through what “compute an alignment score” actually means at implementation time, the gap in the design became obvious.

Alignment scoring is not a lookup. It is a similarity computation. A user’s political profile is not a set of key-value pairs you match against a bill’s tags. It is a nuanced set of stances across multiple issue areas. A bill is not a simple record with a label. It is a multi-dimensional analysis result. To score them against each other meaningfully, you need embeddings: numerical representations of both that can be compared geometrically.

Firestore cannot do that. It is a document store. You can retrieve documents from it efficiently, but you cannot ask it “find me the bills whose embeddings are closest to this user profile vector.” For that, you need a vector database or a database with vector search built in.

The obvious Google-native answer is Vertex AI Vector Search (formerly Matching Engine), a managed service for large-scale approximate nearest-neighbor search. The architecture would be: store documents in Firestore, store embeddings in Vertex AI, query both when scoring.

That is two systems, two billing models, two clients to maintain, and a synchronization problem. Every time a bill is analyzed, you write the result to Firestore and index the embedding in Vertex AI. Every time a user profile changes, you re-index their embedding too. Two sources of truth that need to stay in sync.

AlloyDB is a different answer.

AlloyDB is Google’s PostgreSQL-compatible managed database, built on PostgreSQL but with significantly higher throughput than Cloud SQL and native support for the pgvector extension. With pgvector, you store embeddings as a column in a regular table and run similarity queries in SQL:

SELECT bill_id, analysis_summary,
       embedding <=> $userProfileEmbedding AS distance
FROM bill_analyses
ORDER BY distance
LIMIT 20;

That is a cosine similarity search across all bill embeddings, in a single query, returning ranked results alongside the full analysis data. No second system. No synchronization. No separate vector index to maintain.

It also fits the existing architecture better than Vertex AI does. The user data is already in a PostgreSQL-compatible database (Cloud SQL). We are already using Doobie for typed SQL access. AlloyDB is wire-compatible with PostgreSQL, so the migration from Cloud SQL is substantially simpler than introducing an entirely new managed service like Vertex AI Vector Search.

The trade-off is scale. Vertex AI Vector Search is engineered for billions of vectors with millisecond latency. AlloyDB with pgvector is efficient up to tens of millions of rows with the right indexing (ivfflat or hnsw). RepCheck’s realistic scale (roughly 15,000 to 30,000 bills per Congress, millions of users at ambitious projections) fits comfortably within what AlloyDB can handle. We are not Google Search. Vertex AI’s scale guarantees are more than we need.

The working hypothesis: consolidate bill analysis data, user profiles, embeddings, and the similarity search into AlloyDB. Drop the Firestore dependency for the scoring path entirely. Keep Firestore only for the raw legislative data ingestion pipeline where it currently works well.

What still needs evaluation is cost. AlloyDB is not cheap. A minimum viable cluster starts around $100 to $150 per month before storage and I/O, compared to Firestore’s pay-per-operation model which can be very low at low traffic. At RepCheck’s expected scale and query patterns, the monthly AlloyDB cost needs to be modeled against what Firestore plus Vertex AI Vector Search would actually run. The architecturally cleaner answer is not always the economically sensible one.

This decision is not made yet. It is the kind of question that belongs in the acceptance criteria work (Gap #9), because “how does scoring work” is exactly what acceptance criteria are supposed to define. But the direction is clear enough that it is worth naming. We think Firestore is the wrong store for the scoring path, AlloyDB is probably the right one, and we need to run the numbers before committing.

Where We Are Now

The pipeline is in better shape than it was before the overnight session, because fixing the overnight session’s mistakes forced us to articulate patterns we had been applying intuitively.

The testability refactoring pattern is now written down. The coverage policy is explicit. The PR that was stuck at 66% coverage passed CI. And we have a cleaner BillProcessor architecture that makes every sub-operation independently testable.

Gap #9 (acceptance criteria per component) is still on the list. The overnight experiment delayed it, but it also gave us something arguably more valuable. A concrete, repeatable pattern for making any entry point fully testable, and a documented policy for the one case where coverage exclusion is legitimate.

That is worth a sleepless night of debugging.

Project Repositories

All code for RepCheck is on GitHub:

votr: Main monorepo. Pipelines, migrations, infrastructure code, acceptance criteria, and this blog
repcheck-shared-models: Shared models library. DTOs, domain objects, Circe codecs, Doobie codecs
repcheck-pipeline-models: Pipeline models library. Events, workflow schemas, error handling, configuration
repcheck-ingestion-common: Ingestion common library. API client, XML parsing, change detection, event publishing, repository base, placeholders, execution helpers, structured logging
repcheck-g8: Giter8 template for scaffolding new RepCheck Scala repositories
tf-repcheck-infra: Terraform infrastructure-as-code for GCP (dev/staging/prod)

This is part of an ongoing series documenting RepCheck’s development. Previous posts: Introducing RepCheck | Building Agent-Ready Context | Token Costs and Template Architecture | Closing the Gaps | Almost Ready