700 commits in 6 days: how to trust LLM-generated code
fakecloud went from first commit to 17 AWS services, 1,000+ tests, and 700 commits in 6 days. Almost all of the code was generated by LLMs.
The obvious question: how do you trust any of it?
The problem with vibe coding
Andrej Karpathy coined "vibe coding" in early 2025 — the idea of just accepting whatever the LLM gives you without really looking at it. A year later, the results are in: a CodeRabbit study of 470 GitHub repositories found that AI co-authored code contains 1.7x more bugs than human-written code, with logic errors up 75%, security vulnerabilities up 2.74x, and concurrency bugs doubled.
That's what happens without guardrails. The answer isn't to stop using LLMs — it's to build the systems that make LLM output trustworthy.
The short answer
Tests. Not unit tests that mock everything — real end-to-end tests that spin up the server, make actual AWS SDK calls, and verify the responses. If the LLM generates something that doesn't match real AWS behavior, a test fails. If no test catches it, there should be a test for it.
That's the foundation. The rest is process.
The workflow
Here's what building a new feature in fakecloud actually looks like:
1. Evaluate. Start by looking at the current conformance state. What's missing? What's the next highest-value thing to implement? The LLM helps here — it can cross-reference the AWS API models with what's already implemented and suggest what to tackle next.
2. Plan. This is the most important step, and it's mostly human. The LLM proposes an approach, but there's a lot of back and forth. Multiple alternatives get explored. Architecture decisions get debated. The right technical direction matters more than speed — a bad foundation generates bad code forever. This is what some are calling spec-driven development — writing a real plan with goals, constraints, and implementation notes before any code gets generated.
3. Break it down. The work gets split into well-scoped pieces. Each piece includes E2E tests — not just relying on the conformance framework, but dedicated tests that exercise real workflows through the AWS SDK.
4. Implement. The LLM writes the code and the tests, runs them locally, iterates until green. By the time a PR opens, the tests are already passing.
5. CI. The PR hits CI — clippy, fmt, the full test suite. If something fails, the LLM iterates until it's green. No human intervention needed for mechanical fixes.
6. Code review. Two layers: an AI code reviewer catches surface issues, then a human review. Addy Osmani put it well: "AI didn't eliminate code review — it made verification mandatory."
What the human actually reviews
Not every line the same way.
Tests get reviewed exhaustively. Every line. Every assertion. Does this test actually verify the right behavior? Does it match what real AWS would do? Could it pass while the implementation is wrong? The tests are the source of truth for the project, so they have to be right.
Implementation gets scanned. A top-to-bottom read looking for architectural issues, weird patterns, unnecessary complexity. Sometimes things get sent back for refactoring — DRY violations, odd abstractions, things that don't fit the codebase style.
The mental model: if the tests are comprehensive and correct, the implementation is constrained to be correct. It might not be elegant, but it works. Elegance can be fixed later. Correctness can't be faked.
This aligns with what Gergely Orosz describes as the shift in engineering — from writing every line to "assessing quality and architectural soundness." The engineer becomes a supervisor and decision-maker, not a typist.
Why this only works with real tests
Most codebases test with mocks. The test asserts that function A calls function B with the right arguments. That tells you nothing about whether the actual behavior is correct — it tells you the plumbing is connected.
Mocked tests are particularly dangerous with LLM-generated code because the LLM writes both the implementation and the test. If the implementation has a bug, the LLM will happily write a test that asserts the buggy behavior. The mock doesn't care — it returns whatever it's told to. This is likely a contributor to the 75% increase in logic errors found in AI-generated PRs — the tests pass, but they're testing the wrong thing.
fakecloud's E2E tests use the official aws-sdk-rust crate. They spin up a real fakecloud server, make real API calls, and verify real responses. If CreateQueue followed by SendMessage followed by ReceiveMessage doesn't return the right message with the right attributes, the test fails. No mocks. No fakes. Real behavior.
That's 650+ E2E tests, plus 600+ conformance tests that validate operations against official AWS Smithy API models using checksums. Together, they form a safety net tight enough to let LLMs write code at speed without sacrificing correctness.
What tests don't catch
To be clear: this isn't bulletproof. E2E tests are great at catching behavioral bugs, but they're weaker at catching security vulnerabilities, performance issues, and concurrency bugs — exactly the categories where AI-generated code is worst (2.74x more security issues, 8x more excessive I/O operations).
That's where the other layers matter. Static analysis (clippy, in Rust's case) catches entire categories of bugs at compile time. AI code reviewers catch patterns that tests miss. The human review catches architectural problems. No single layer is sufficient — it's the combination that makes the system trustworthy.
Making a project LLM-friendly
This approach isn't specific to fakecloud. Any project can get here:
Start with tests. If the project doesn't have comprehensive E2E or integration tests, that's the first thing to build — and LLMs are great at helping with that. Build the test infrastructure first, then use it as the guardrail for everything that follows. This is the single highest-leverage investment for LLM-assisted development.
Invest in CI. Linting, formatting, full test suites, automated code review. Every automated check is one less thing that can slip through. If speed goes up, the number of tests should go up proportionally.
Steer the architecture. The LLM is a force multiplier, not an architect. Discuss the plan. Explore alternatives. Push back when the approach doesn't feel right. The human decides the direction; the LLM executes it.
Review what matters. Tests get the most scrutiny. Implementation gets scanned for patterns. Don't try to read every line like you wrote it — that doesn't scale. Trust the tests, verify the tests are right.
The real insight
The question isn't "can you trust LLM-generated code?" The answer is obviously no — not by default. The 1.7x bug rate proves it.
The real question is: can you build a system where LLM-generated code is continuously verified? Where every PR runs against hundreds of tests before a human ever sees it? Where the test suite is comprehensive enough that "tests pass" is a meaningful signal? Where static analysis, AI review, and human review each catch what the others miss?
If yes, then the LLM is just a very fast, very tireless junior developer that needs good guardrails. And 700 commits in 6 days stops being scary — it's just what happens when the guardrails are in place.