How much work does it actually take before an AI agent can write good code independently? A lot more than you think. This is a practical guide based on a real project — still in progress.
What This Article Is
This is not a success story. This is a field report from the early stages of a multi-phase project called RepCheck — a platform designed to help citizens understand how their legislators vote relative to their personal interests.
The goal was to build enough architectural context — design documents, code pattern references, compile-time enforcement — so that AI coding agents could eventually implement components independently, across multiple repositories, without constant hand-holding.
We are not there yet. But the process of getting closer to that goal revealed just how much deliberate, human-driven effort is required before an agent can produce anything resembling “good” code on its own.
This article walks through exactly what we did, step by step, so you can replicate it.
Step 1: Establish What Exists Before You Design Anything
Before any design work began, the first task was having the agent read through the entire existing codebase and describe what it found. This seems obvious, but it matters because the agent needs to ground its future suggestions in what actually exists — not what it imagines a project might look like.
The existing codebase was a Scala 3 prototype: a batch pipeline that fetches U.S. Congressional bills from the Congress.gov API and stores them in Firestore. Two SBT modules (bill-identifier and gov-apis), Cats Effect IO, Http4s, Circe, FS2 streaming, PureConfig, Firebase Admin SDK.
This baseline scan is critical. Without it, the agent will suggest patterns that conflict with your existing code, recommend libraries you’ve already chosen against, or propose architectures that don’t account for decisions you’ve already made.
Takeaway: Always start by having the agent read and summarize your codebase. Don’t assume it knows what’s there.
Step 2: Design Through Structured Q&A, Not Free-Form Prompting
The system design document was not produced by saying “design me a system.” It was produced through an extended question-and-answer process where I described what I wanted at a high level, and the agent asked increasingly specific questions to pin down the details.
Here’s what that actually looked like:
The opening prompt:
“I want you to ask me questions, and work on a design mermaid diagram with a markdown explaining how each component behaves. I want to then use those diagrams as the context for coding agents to start implementing the different behaviors.”
This told the agent three things: (1) ask questions, don’t assume, (2) produce visual diagrams plus written specs, and (3) the audience is other agents, not humans browsing docs.
What the agent asked:
The agent came back with foundational questions about:
- What data sources exist (Congress.gov endpoints for bills, votes, members, amendments)
- How data flows through the system (ingestion → analysis → scoring → presentation)
- What the user-facing behavior should be (personalized alignment scores)
- Where data should live (Firestore for legislative data, Cloud SQL for user profiles)
- What triggers processing (Pub/Sub events vs cron schedules)
Each question I answered generated follow-up questions. My answer about wanting “LLM-powered bill analysis” led to questions about which LLMs, how prompts are structured, whether analysis runs per-bill or in batches, and how results are stored.
Takeaway: The Q&A format forces you to make decisions you’d otherwise defer. The agent surfaces questions you haven’t considered — but only if you tell it to ask rather than assume.
Step 3: Challenge Every Default the Agent Proposes
The agent will propose reasonable defaults. Your job is to interrogate them. Here are real examples where pushing back on defaults produced better architecture:
Example: Pub/Sub event pruning
The agent’s initial event catalog included eight events:
bill.created,bill.updated,bill.text.availablevote.recordedmember.synced,member.updatedamendment.addedanalysis.completed,scoring.completed,user.profile.updated
My pushback:
“I don’t think that we need events for actions that don’t trigger any subsequent events. For example bill.created/bill.updated don’t result in any event being triggered. We should only fire pub sub events for behaviors that trigger downstream action.”
This eliminated five events. The final catalog has only four:
bill.text.available→ triggers LLM analysisvote.recorded→ triggers scoringanalysis.completed→ triggers scoringuser.profile.updated→ triggers scoring
The agent’s default was technically correct — you could fire events for everything. But the principle of “only fire events with downstream consumers” is a design decision that reduces system complexity. The agent didn’t volunteer that principle. I had to state it.
Example: Prompt engine content location
The agent initially proposed that base prompt fragments would be defined in Scala code, with additional fragments loaded from GCS at runtime. My correction:
“Even the initial prompt blocks we create should be created as GCS entities.”
One sentence. But it fundamentally changed the prompt engine architecture. Instead of a hybrid code-plus-config system, the prompt engine repos became purely loaders and assemblers — zero prompt content in application code. All prompt tuning becomes a GCS operation. No redeployment needed to adjust how the LLM interprets bills.
Example: Missing dependency awareness
The agent drew a dependency graph for repcheck-llm-analysis that only included prompt-engine-bills, pipeline-models, and shared-models. I caught it:
“The llm-analysis is missing all the other gov ingested things like members, votes, and amendments. The scoring engine likely has the same requirements.”
The analysis pipeline can’t analyze a bill in isolation — it needs member context, voting records, amendment history. The agent’s graph was structurally valid but functionally incomplete. This is the kind of error that would cascade into broken implementations if not caught at the design stage.
Takeaway: The agent optimizes for structure. You optimize for behavior. Every default deserves a “why?”
Step 4: Make Every Architectural Decision Explicit Through Specific Questions
After the system design was drafted, we moved to defining Scala code patterns. This was the most intensive Q&A phase — 21 questions, each requiring a specific architectural choice. Here’s what that looked like in practice:
Q8: Where do DTOs and DOs live?
The agent presented three options:
- A) ApiDTOs and DbDTOs in the consuming repo
- B) All DTOs in
shared-modelsalongside DOs - C) ApiDTOs in consuming repo, DbDTOs in
shared-models
My answer:
“DTO and DOs should live in the same repo as the service that uses them but should be separated into its own project within the repo and that project should be published as a library for use by other repositories.”
This created the models/ + app/ sub-project pattern:
repcheck-data-ingestion/
├── models/ ← published as "repcheck-data-ingestion-models"
│ ├── api/dtos/
│ ├── db/dtos/
│ └── domain/
└── pipelines/ ← application code, depends on models/
Other repos can depend on repcheck-data-ingestion-models to read bill types without pulling in the ingestion pipeline code.
Q9: Error handling — hierarchy or flat?
The agent offered:
- A) Sealed hierarchies per domain for pattern matching
- B) Flat standalone case classes extending Exception
My answer:
“Let’s go with the flat option. Since each ingestion should be its own application and project. The level of BillIngestion is implied by the executing application in the system.”
This meant error types stay generic and contextless — FetchFailed, DecodeFailed, PersistFailed — because the application name already tells you what was being processed. The same error classes work across all four ingestion pipelines.
Q10: How should batch processing failures be handled?
The agent presented three patterns. My answer introduced constraints the agent hadn’t considered:
“Let’s go with the ProcessingResult approach. However, it is important to note that we need to move into streaming individual items, and writing them out immediately and not storing them in memory. So the result should be stored somewhere and then we move on to the next item while letting go of all the things we knew about the previous items in memory. We may need to write an aggregator after all processing is done to summarize the results, and leaving out memory heavy aspects of the results.”
The agent had offered ProcessingResult as an in-memory accumulator. My answer turned it into an externalized, stream-and-forget pattern:
Stream item → Process → Write ProcessingResult to Firestore → Release from memory
↓
After stream completes → Aggregator reads results → Writes PipelineRunSummary
Q13: Tagless final or concrete IO?
Three options: tagless for libraries / IO for apps, concrete IO everywhere, or tagless everywhere.
My answer: tagless everywhere. One word. But it has implications for every trait, every service interface, every type signature in the project.
Q14: ID strategy
Three options: natural keys everywhere, generated UUIDs everywhere, or a hybrid.
My answer: natural keys for legislative data, generated IDs for RepCheck-specific entities. Congress.gov already has stable identifiers (bill IDs, bioguide IDs). There’s no reason to generate our own for data that comes with keys. But user profiles, alignment scores, and pipeline runs need generated IDs because they’re our domain.
Q15: Prompt engine — how should fragments be represented?
I chose composable, serializable PromptFragment traits with a PromptBuilder that assembles them by priority. But I added a critical requirement:
“Let’s ensure that we can serialize and deserialize them. Additionally, we want to be able to add any number of prompt additions to the chain from GCS.”
Q16: LLM client abstraction
The agent offered direct SDK integration vs. an abstracted client. My answer created a three-layer architecture:
“Direct SDK integration, however, we will create an abstraction that enables us to create our prompt and questions first. Then we will have functions that are part of the abstraction that are pluggable to convert the abstraction into a Claude or GPT specific DTO.”
This produced:
- Vendor-neutral types —
LlmRequest,LlmResponse(inrepcheck-llm-client/models/) - Pluggable adapters —
ClaudeAdapter,GptAdapter(inrepcheck-llm-client/adapters/) - Prompt engines — build
LlmRequestwithout knowing which vendor will execute it
Then I added another requirement the agent hadn’t anticipated:
“We will potentially send the same prompt and requests to multiple LLMs to be able to give users different insights.”
This added an LlmDispatcher that fans out to multiple adapters and a provider field on every response. Analyses are now keyed by (billId, provider) instead of just billId.
Q17: Where does configuration live?
My answer split configuration by concern:
“User specific preferences related to prompt configuration should be stored in Cloud SQL. However, in general, prompt configuration should all live in GCS and be dynamically read as the analysis application needs it.”
Q19: Cloud SQL client library
When the agent asked about Doobie vs. Skunk vs. raw JDBC, I was honest:
“I’m not familiar with Doobie or Skunk. Provide me with more context.”
The agent gave me a focused comparison with code examples, pros/cons, and a recommendation. I chose Doobie in under a minute. Don’t pretend to know things you don’t — the agent can educate you quickly, and you make a better decision with context.
Q21: Should every repo have the models/app split?
“Only use it when it is appropriate.”
Not every repo needs sub-projects. repcheck-shared-models is already purely a library. repcheck-api-server is purely an app. The split only matters when a repo has both publishable types AND application code.
Takeaway: Each of these 21 questions required a specific human decision. The agent can present options, explain trade-offs, and implement your choice. But it cannot make these decisions for you, and if you skip them, the resulting code will be generic rather than tailored to your system.
Step 5: Build Enforcement Before Implementation
After defining all the patterns, the natural question was:
“Is there a way to ensure that we enforce these patterns through tooling that errors on compile?”
This produced:
- WartRemover — 11 error rules: no
null,var,.get,.head,asInstanceOf,isInstanceOf, mutable collections,return,Try.get,String + Any - Scalafix — import ordering rules (java → scala → cats → circe → http4s → fs2 → google → project)
- tpolecat — strict compiler flags with
-Xfatal-warnings - GitHub Actions CI — runs compile, scalafix check, and tests on every PR
Then we ran it. And it immediately found violations in the existing code:
FormatType.scalahad a barethrow new IllegalArgumentException— WartRemover flagged itPagingApiBase.scalahad a silent HTTPS fallback — should have been a raised error- Import ordering was wrong across nearly every file
- The CI workflow didn’t have SBT installed on the runner
Each of these required fixes. Some were straightforward (run sbt scalafixAll to auto-fix imports). Some required design decisions (should FormatType.fromString return Either or Option?). The agent made the fixes, but I had to verify them.
For example, the agent changed PagingApiBase to silently fall back to HTTPS when it encountered an unknown protocol. I had to question that:
“Educate me on how this change is the same behavior?”
It wasn’t the same behavior. The original threw an exception. The agent’s “fix” silently swallowed a potential misconfiguration. We corrected it to Async[F].raiseError(InvalidProtocol(...)) — fail fast through the effect system, which matches our established error handling pattern.
Takeaway: Enforcement tooling catches violations from both humans and agents. But setting it up requires running it, reading the failures, and making judgment calls about what the fixes should be. The agent can configure the tools, but you verify the configuration catches what it should.
Step 6: Manage Git Workflow Actively — The Agent Will Get Lost
This was a recurring friction point throughout our sessions. The agent does not naturally track which branches are active, which PRs are merged, or where it should be working. Real examples:
Pushing to a merged PR’s branch:
After PR #8 was merged, I asked the agent to commit new changes. It pushed to claude/enforcement-tooling — the branch from the already-merged PR. I had to redirect:
“PR #8 is merged so we need a new branch and PR.”
Not tracking base branches:
The project uses feature/createDesignPlan as the integration branch, not main. Every new branch needs to be based on feature/createDesignPlan and every PR needs to target it. The agent sometimes needed reminding.
Encoding issues surfacing on GitHub: At one point the README was saved in UTF-16 encoding, causing GitHub to display “Binary file not shown.” I noticed this on GitHub — the agent didn’t. Later, the same README showed blank in IntelliJ’s preview (turned out to be an IDE cache issue, not a real problem).
CI failures from missing tools:
The GitHub Actions workflow failed because the Ubuntu runner didn’t have SBT installed. The agent had written run: sbt compile assuming SBT would be available. It required adding setup-java and setup-sbt actions.
Takeaway: Treat the agent like a contributor who doesn’t read PR status emails. Tell it explicitly: what branch to work from, what branch to target, whether the previous PR is merged. Check CI results yourself.
Where We Are Now — and What’s Still Missing
After all of this work, we have:
- A system design document with Mermaid diagrams covering 9 repositories
- A 19-section Scala code patterns document covering every major technical decision
- Compile-time enforcement that catches violations before code reaches review
- A CI pipeline that runs on every PR
And we are still at the beginning.
The design doc, when evaluated as an instructional tool for coding agents, has significant gaps:
- No concrete Scala signatures — the patterns doc has code templates, but the design doc describes components in prose. An agent implementing
repcheck-data-ingestionwould need case class definitions, trait signatures, and FS2 stream type signatures - No Congress.gov API specs — endpoint URLs, query parameters, response schemas for votes, members, and amendments
- No build.sbt scaffolding — agents need to know exactly what a new repo’s build file looks like
- No error handling specifics — which operations retry, which fail fast, what retry backoff looks like
- No testing guidance — what gets unit tested, what gets integration tested, how to mock Firestore/GCS/Pub/Sub
- No acceptance criteria — how does an agent know when a component is “done”?
- No GCP integration patterns — actual Pub/Sub publisher/subscriber code, GCS read/write operations
- No Docker or deployment templates
Each of these gaps means a coding agent would need to stop and ask clarifying questions — or worse, make assumptions that may not match the intended design.
Filling these gaps is the next phase. It will require the same structured Q&A process described in this article, applied to each component individually.
The Replicable Process
If you want to prepare your own codebase for agent-driven implementation, here’s the process distilled:
Phase 1: Baseline
- Have the agent read and summarize your existing codebase
- Identify what exists vs. what needs to be built
Phase 2: System Design (Q&A)
- Describe your system at the highest level
- Let the agent ask questions — answer every one specifically
- Challenge proposed defaults against your actual requirements
- Iterate until you have component diagrams and event flows
- Apply design principles actively (e.g., “only fire events with downstream consumers”)
Phase 3: Code Patterns (Q&A)
- For every technical decision (error handling, serialization, streaming, storage, testing), have the agent present options
- Choose explicitly — don’t let the agent default for you
- Demand concrete code templates, not prose descriptions
- When you don’t know something, ask the agent to educate you, then decide
Phase 4: Enforcement
- Configure compile-time tools to enforce your patterns
- Run them against existing code
- Fix violations — verify that fixes match your intended patterns, not just that they compile
- Set up CI to run enforcement on every PR
Phase 5: Gap Analysis
- Evaluate your design docs as if you were a coding agent reading them for the first time
- Identify every place where an agent would need to guess or ask
- Fill those gaps with the same Q&A process
Phase 6: Implementation (not yet reached)
- Hand individual components to agents with their specific design docs as context
- Review output against your patterns and enforcement
- Iterate
The Honest Summary
Building sufficient context for AI coding agents to work independently is itself a substantial engineering effort. It requires the same architectural thinking, the same design rigor, and the same attention to detail as building the system yourself — just applied to documentation and tooling instead of application code.
The agent accelerates every step. Questions get surfaced faster. Options get compared faster. Patterns get documented faster. Enforcement gets configured faster. But the decisions, the corrections, and the quality bar remain entirely yours.
We produced 20 merged PRs and 2,500+ lines of architectural context. The implementation hasn’t started yet. That’s not a failure — it’s the reality of what “agent-ready” actually requires.
Project Repositories
All code for RepCheck is on GitHub:
- votr: Main monorepo. Pipelines, migrations, infrastructure code, acceptance criteria, and this blog
- repcheck-shared-models: Shared models library. DTOs, domain objects, Circe codecs, Doobie codecs
- repcheck-pipeline-models: Pipeline models library. Events, workflow schemas, error handling, configuration
- repcheck-ingestion-common: Ingestion common library. API client, XML parsing, change detection, event publishing, repository base, placeholders, execution helpers, structured logging
- repcheck-g8: Giter8 template for scaffolding new RepCheck Scala repositories
- tf-repcheck-infra: Terraform infrastructure-as-code for GCP (dev/staging/prod)