Guardrails That Work: Tagless Final, Integration Tests, and Teaching an Agent to Check Its Own Homework

The best guardrails don’t slow the agent down. They prevent it from sprinting confidently in the wrong direction.

Where We Left Off

The previous post covered the 29-table schema design session — the JSONB confrontations, the AlloyDB-to-Cloud-SQL cost reversal, and guiding the agent to: “ask me what to do rather than choosing on your own.”

That post ended with the schema committed and the Firestore-to-PostgreSQL migration underway across three repositories. PR #59 landed the big change: Doobie replacing Firestore in the bill-identifier pipeline. But the PR that passed CI wasn’t done. The SQL was untested against real PostgreSQL, the code still used concrete IO where polymorphic F[_] belonged, and the agent’s test documentation didn’t reflect what we’d actually learned.

This session was about closing those gaps — and discovering, in the process, that the agents do not listen to instructions well if they are rules in claude.md (those are more like guidelines that they violate as they choose). Enforcing process is required in order to get true quality.

Tagless Final: Removing the Concrete Where It Doesn’t Belong

The bill-identifier pipeline had a common Scala problem: IO everywhere. BillProcessor took IO-typed dependencies. ConfigLoader.LoadConfig returned IO[BillIdentifierConfig]. Every method signature was pinned to a concrete effect type, even though nothing about the logic required it.

This matters for testing. When a class is parameterized on F[_]: Sync instead of hardcoded to IO, you can test it with IO in integration tests and potentially with SyncIO or other lightweight effects in unit tests. More importantly, it’s how the rest of the codebase is designed — tagless final is a universal rule in CLAUDE.md.

The refactoring was mechanical but instructive. Here’s what BillProcessor looked like before:

class BillProcessor(repo: BillRepository[IO], api: LegislativeBillsApi[IO], logger: Logger) {
  def convertAndSave(bill: LegislativeBillDTO): IO[String] =
    IO.fromEither(bill.toDO.left.map(BillConversionFailed(_))).flatMap { billDO =>
      IO.delay(logger.info(s"Saving bill: ${billDO.url}")) *> repo.upsert(billDO)
    }
}

And after:

class BillProcessor[F[_]: Sync](repo: BillRepository[F], api: LegislativeBillsApi[F], logger: Logger) {
  private[app] def convertAndSave(bill: LegislativeBillDTO): F[String] =
    Sync[F].fromEither(bill.toDO.left.map(BillConversionFailed(_))).flatMap { billDO =>
      Sync[F].delay(logger.info(s"Saving bill: ${billDO.url}")) *> repo.upsert(billDO)
    }
}

IO.fromEither becomes Sync[F].fromEither. IO.delay becomes Sync[F].delay. IO.raiseError becomes Sync[F].raiseError. The only place that mentions IO is the application entry point — BillIdentifierApp — where it passes IO as the concrete type:

BillProcessor.run[IO](args, configLoader = ConfigLoader.LoadConfig[IO], ...)

The agent handled this refactoring cleanly. The pattern is well-established, the transformation is regular, and the compiler catches every missed substitution. This is the kind of task where AI agents excel: mechanically consistent transformations across multiple files with a clear rule to follow.

What the agent didn’t do and what I had to prompt: was apply the same transformation to the companion object’s run method and to ConfigLoader. Left to its own devices, the agent would have refactored the class and stopped. The full surface area required a nudge.

Lesson: Agents are good at applying a pattern within a scope you define. They’re less good at recognizing that the scope should be wider than the file they’re currently editing. So be sure top define, scope. Say something like “across the bill-identifier project change all uses of IO to F”.

Integration Tests: Validating SQL Against Real PostgreSQL

PR #59 had replaced Firestore with Doobie for bill persistence. A choice that was a significant redirection in terms of architecture but the correct one. Firestore is a good db for fast, small, and non-structured data. However, in repcheck bills are inherently hierarchied and structured storing them is firestore was admittedly a choice I made because it was easy. However, that was during the days of manual development, and my time was the premium. Now with agents I can be a perfectionist because the most time consuming task is handled by the agent.
The newly agent created SQL compiled. The unit tests passed with mocks. But nobody had run the actual INSERT and SELECT queries against a real PostgreSQL instance with the real schema.

This is the gap that integration tests exist to fill. The project already had the infrastructure — DockerPostgresSpec, a custom trait built two sessions ago that spins up a pgvector/pgvector:pg16 container, applies all Liquibase migrations, and tears it down after the suite. Any module can reuse it via SBT’s cross-project test dependency:

lazy val billIdentifier = (project in file("bill-identifier"))
  .dependsOn(govApis, dbMigrations % "test->test")

The "test->test" configuration makes db-migrations’ test classes — including DockerPostgresSpec — available in bill-identifier’s test classpath. You don’t duplicate container management logic. You inherit it.

I pointed the agent at the routing table in CLAUDE.md:

“Writing tests for any module” — read test-patterns.md, testing-infrastructure.md, test-templates.scala

The agent read the templates, found the DockerPostgresSpec pattern, and produced a first draft of DoobieBillRepositorySpec with five test cases: insert, persist-and-readback, upsert-on-conflict, no-duplicate-rows, and multiple-distinct-bills.

It failed immediately.

The Unique Constraint Lesson

The first test run produced a PSQLException:

ERROR: duplicate key value violates unique constraint "uq_bills_natural_key"
Detail: Key (congress, bill_type, number)=(117, 1, 0) already exists.

The agent’s test data helper function looked like this:

private def makeBill(billId: String = "hr1319-117"): LegislativeBillDO =
  LegislativeBillDO(congress = 117, billType = BillTypes.HouseBill, ...)

Each test varied billId — the primary key — but left congress, billType, and number at their defaults. The bills table has both a bill_id primary key and a UNIQUE (congress, bill_type, number) composite constraint. The agent didn’t know about the second constraint because it hadn’t read the migration SQL.

The fix was straightforward — expose the constrained columns as parameters:

private def makeBill(
  billId: String = "hr1319-117",
  congress: Int = 117,
  billType: BillTypes = BillTypes.HouseBill,
  title: String = "American Rescue Plan Act of 2021",
): LegislativeBillDO = ...

Each test now varies the natural key columns:

makeBill(billId = "test-1", congress = 100)
makeBill(billId = "test-2", congress = 200, title = "Persisted")
makeBill(billId = "test-3", congress = 300, title = "Original")

This is a pattern, not a one-off fix. When writing integration tests that INSERT into real PostgreSQL tables, your test data must have unique values for all unique constraints on the table — not just the primary key. If you only vary the PK, you’ll get constraint violations that don’t appear in unit tests with mocks.

The agent learned this the hard way. I made sure the lesson went into the documentation — test-patterns.md, testing-infrastructure.md, and test-templates.scala all got updated with explicit constraint awareness guidance. Future agent sessions reading those templates won’t make the same mistake.

Lesson: Agents trust their mocks. Mocks don’t have constraints. Real databases do. If your test templates don’t explicitly call out constraint-aware test data patterns, the agent will produce tests that pass against mocks and fail against PostgreSQL. Entries into the test documentation have proven over time to solve this particular issue well.

The Routing Table: Making the Agent Self-Check

Halfway through the session, I asked the agent a pointed question:

“How many times did you utilize the routing table in CLAUDE.md?”

The honest answer was: not enough. The routing table had been in CLAUDE.md since the template architecture session, but the agent treated it as optional background context rather than a mandatory checklist.

I updated CLAUDE.md with an explicit enforcement rule:

MANDATORY: You must use the routing table below whenever you are planning a task. Before writing any code, match your task to the closest entry and read every file listed under it. Additionally, every 5th task step during implementation, pause and check whether you need information from the routing table documentation that you have not yet loaded.

Two changes packed into one instruction. First, the routing table is no longer advisory — it’s a pre-work requirement. Second, the periodic re-check prevents the agent from reading the docs at the start and then drifting as the work evolves. By the fifth step of any implementation, new questions have emerged that the initial reading might not have covered.

This is an example of a meta-guardrail: instead of adding a rule about a specific technical decision, you add a rule about how the agent makes decisions. “Read the docs before writing code” is more durable than “use Transactor.fromDriverManager” because it applies to every task, not just database tests. One issue left for debate here is whether to use a hook or not. The open question on that would be: “If we set up a hook would it trigger to frequently, We would probably use the hook but make it a nudge rather than a mandatory command.”

Lesson: Don’t just document patterns. Document the process for discovering which patterns apply. A routing table that the agent is required to consult before every task is more valuable than a hundred pages of documentation it might not think to read.

Shell Functions: The Guardrail That Actually Works

Two sessions ago, we added CreatePR and pushToPR to scripts/ci-functions.sh. They wrap git push with mandatory local CI checks: sbt compile, sbt test, sbt scalafmtCheckAll, sbt scalafixAll --check. If any check fails, the push doesn’t happen.

Here’s the thing: they work.

Not “they work” in the sense of “the code runs.” They work in the sense of changing behavior. Before these functions existed, the agent would push code that failed scalafmt in CI. It would push code with import ordering violations. It would push code that compiled locally but hadn’t been tested. Each failure meant a round-trip: push, wait for CI, read the failure, fix, push again.

Now the feedback loop is local and immediate. The agent runs pushToPR, scalafmt fails, the agent fixes the formatting, runs pushToPR again, and pushes clean code. No wasted CI minutes. No PR with a red X that needs to be fixed in a follow-up commit.

This session was the proof. Every push went through pushToPR. Every push passed CI on the first try. Zero failed GitHub Actions runs. The shell functions have eliminated an entire class of friction.

Why do they work better than a CLAUDE.md rule saying “run CI checks before pushing”? Because they’re structural, not advisory. The agent can’t forget to run the checks because the checks are embedded in the only path to pushing. There’s no git push to reach for — there’s only pushToPR, which runs the checks first.

This is the most transferable insight from this project so far: structural guardrails beat documentation guardrails. A rule in a docs file says “you should do X.” A shell function that wraps the dangerous operation with X says “you will do X, or the operation doesn’t happen.”

If you’re working with AI agents and you find yourself writing the same correction repeatedly — “run the tests first,” “check the formatting,” “don’t push to main” — stop writing documentation and start writing wrapper functions. Make the correct behavior the only available behavior.

The Doc Compressor Pipeline

The test documentation updates created a downstream task: the compressed agent docs in .claude/agent-docs/ were now stale. The doc compressor — an SBT task that sends each doc through Claude Haiku for ~30% size reduction — needed to run.

This had been a pain point. The compressor needs an ANTHROPIC_API_KEY environment variable. On Windows with Git Bash, environment variables set in PowerShell aren’t automatically available in the bash session. Every time we needed to run the compressor, we had to manually export the key.

The fix was a new shell function in ci-functions.sh:

runDocCompressor() {
  local threshold="${1:-0.20}"
  if [ -z "${ANTHROPIC_API_KEY:-}" ]; then
    local key
    key="$(powershell.exe -Command \
      "[System.Environment]::GetEnvironmentVariable('ANTHROPIC_API_KEY', 'User')" \
      | tr -d '\r\n')"
    export ANTHROPIC_API_KEY="$key"
  fi
  "$sbt_cmd" "docGenerator/run $threshold"
}

It checks for the key in the bash environment first, then falls back to reading it from PowerShell’s user environment variables. The tr -d '\r\n' strips Windows line endings. One function call, no manual steps.

This is the same principle as pushToPR: wrap a multi-step manual process in a single function that handles the annoying parts automatically. The agent can call runDocCompressor without knowing how API keys work on Windows. The friction disappears.

The compressor regenerated all 38 compressed files — a 30% reduction from 41,880 to 29,278 words. Those files were committed and synced to the G8 template repository via a separate PR.

Cross-Repo Synchronization

The documentation updates touched files that exist in multiple repositories. The main votr repo is the source of truth, but the repcheck-g8 template repo needs copies of:

All files under docs/
All compressed files under .claude/agent-docs/
CLAUDE.md
scripts/ci-functions.sh

This session, that sync was manual: copy files, commit, push, create PR. While writing this post, the agent suggested automating this with a syncDocsToG8 shell function — and we went ahead and built it. The function lives alongside pushToPR and runDocCompressor in ci-functions.sh:

syncDocsToG8               # uses default sibling path
syncDocsToG8 /path/to/g8   # custom path

It validates both repos exist, copies the four file groups into the G8 template layout (src/main/g8/), creates a timestamped branch, commits, pushes, and opens a PR — all in one call. If the G8 repo is already up to date, it exits cleanly with no commit.

This is the structural guardrail pattern applied to a different problem. The manual sync worked, but it had multiple steps where you could forget something — miss a directory, commit with the wrong message, forget to push. The function makes it one command with one outcome. Another case where the agent identified the friction and the solution was to eliminate the manual path entirely.

The infrastructure repo (tf-repcheck-infra) doesn’t need doc sync — it’s pure Terraform with no Scala code or agent docs.

What This Session Taught Me About Guardrails

Every session in this series has produced rules. “No JSONB.” “No @nowarn.” “Ask before deciding.” “Run CI before pushing.” They’re all in CLAUDE.md now, and they all help.

But they don’t all help equally.

The rules that work best are the ones the agent can’t accidentally skip. pushToPR isn’t a suggestion — it’s the mechanism. The routing table mandate isn’t “consider reading the docs” — it’s “you must read these specific files before writing code, and re-check every 5 steps.” The tagless final rule isn’t “prefer F[_]” — it’s “use F[_] everywhere except the top-level entry point.”

The rules that work least well are the ones that require the agent to remember to do something at the right moment. “Check for unique constraints in test data” is useful once you’ve been burned. Before that, it’s just another line in a long document.

The progression I’m seeing across these sessions:

Documentation rules — write it down so the agent knows (necessary but insufficient)
Routing table rules — force the agent to find the right documentation (better)
Structural rules — make the wrong thing impossible to do (best)

Each level builds on the previous one. You still need the documentation. You still need the routing table. But wherever you can replace “remember to do X” with “X happens automatically,” do it.

What’s Next

Tomorrow: Gap #9 — acceptance criteria per component. This is the last documented gap from the original project assessment. Acceptance criteria define what “done” looks like for each module — the behavioral contracts that tests should verify, the error handling paths that must exist, the performance characteristics that matter.

It’s the piece that turns templates and patterns into working code. Everything we’ve built so far — the routing table, the test infrastructure, the CI guardrails, the tagless final architecture — is scaffolding. Acceptance criteria are the blueprint that tells the agent what to build on that scaffolding.

Project Repositories

All code for RepCheck is on GitHub:

votr: Main monorepo. Pipelines, migrations, infrastructure code, acceptance criteria, and this blog
repcheck-shared-models: Shared models library. DTOs, domain objects, Circe codecs, Doobie codecs
repcheck-pipeline-models: Pipeline models library. Events, workflow schemas, error handling, configuration
repcheck-ingestion-common: Ingestion common library. API client, XML parsing, change detection, event publishing, repository base, placeholders, execution helpers, structured logging
repcheck-g8: Giter8 template for scaffolding new RepCheck Scala repositories
tf-repcheck-infra: Terraform infrastructure-as-code for GCP (dev/staging/prod)