Look, the AI-generated code is humming along just fine on your dev box. It compiles, maybe even passes a few perfunctory tests. But if you’re thinking that means it’s ready to face the harsh glare of production, think again.
I’ve been staring at Silicon Valley’s shiny new toys for twenty years, and this feels like déjà vu, just with more buzzwords. Shipping features with tools like Claude Code or Cursor might be lightning fast now, but getting that code to hold up in the unforgiving wilderness of production is a separate, and frankly, much harder problem. AI is brilliant at accelerating the implementation phase. It’s abysmal at producing true production engineering.
I spent time sifting through eight AI-generated production applications, and guess what? They all shared a remarkably consistent, and alarming, set of flaws. We’re talking misconfigured Supabase RLS, secrets just hanging out in the codebase like they’re on vacation, a complete absence of rate limiting or caching, some truly questionable data structures, components that seemed to re-render themselves into oblivion, AI features wide open to prompt injection and RAG attacks, and — this is the kicker — virtually no meaningful tests around anything that actually mattered.
Most of them worked, sure. But were they production ready? Almost none. A year ago, the bottleneck was writing the darn code. Now? It’s the painstaking process of reviewing and hardening what the AI spat out. That requires a fundamentally different skill set, one that most development teams haven’t even begun to cultivate.
Why does this keep happening? Simple. AI excels at extending local patterns, at mimicking what it sees right in front of it. It’s a digital parrot. But it’s god-awful at grasping long-term system boundaries, understanding the nuances of scaling behavior, or contemplating operational risk. It churns out code that looks plausible in a vacuum, but buckles under real-world conditions.
Six Things to Scrutinize Before Calling AI Code ‘Production Ready’
This is where things quietly, insidiously go wrong. The code compiles. Tests, likely written by the AI itself, pass with flying colors. And then, six months down the line, some poor soul stumbles upon a misconfigured authentication check, a gaping hole in your security.
Happy path works fine. The edges are where it falls apart.
The Security Chasm: Auth, Secrets, and Injection Risks
Let’s talk about the obvious vulnerabilities. Are your authentication flows actually strong? Does every protected route rigorously verify the session? Are role checks happening server-side, where they belong, or are they just a flimsy facade on the client? Then there are those lingering secrets: API keys tucked away in frontend code, .env values used as insecure fallbacks, or worse, secrets being logged during error handling. It’s a hacker’s dream. And don’t even get me started on injection risks – SQL, command, path injection are all still very much alive and kicking in user-controlled inputs. But the new frontier is LLM prompt injection and RAG document injection. Can a user rewrite your AI’s behavior with a cleverly crafted input or an uploaded document? Most AI-generated code doesn’t even consider these possibilities.
Architectural Drift and Performance Nightmares
Beyond outright security flaws, AI code often suffers from a lack of architectural integrity. You’ll find dead code and unused imports aplenty, as AI generates confidently, including tangential bits it never actually connects. Weak typing, often using any to paper over its own uncertainties, missing null checks, and unsafe type assertions are rampant. Anti-patterns like misused hooks or unnecessary useEffect calls pop up like weeds. After dozens of prompts, does the codebase still resemble the original architectural vision, or has it devolved into a chaotic mess? Probably the latter.
And then there’s performance. AI-generated code tends to duplicate logic instead of abstracting correctly, completely overlooks caching layers, and spits out database access patterns that are fine for local development but will crumble under any real load. Think slow queries due to missing indexes or N+1 patterns, cold starts on serverless functions that balloon with heavy dependencies, and render cascades where components re-render themselves into oblivion because nothing is properly memoized. Heavy bundles, pulling in entire libraries when only a single function was needed, are also commonplace.
The Unseen Costs: PII, Payments, and Global Compliance
Some of these issues are stack-specific, but PII handling is a universal headache. Are payment flows correctly processing Stripe webhooks? Is any sensitive card data being stored inappropriately? For mobile apps, are in-app purchases routed correctly, avoiding App Store rejection? On a global scale, are you even thinking about GDPR basics – deletion, consent, data residency for EU users? And the most insidious: are you inadvertently sending user data to third-party AI APIs without proper consent or agreements? It’s a minefield most AI-generated code blithely ignores.
Testing: The Illusion of Validation
Sure, there are usually tests. But most of them test the wrong things. They might check that a function runs without throwing an error, but do they actually validate critical paths like authentication, payments, or crucial data writes? Are edge cases even considered, or is it just the happy path the AI was spoon-fed in the prompt?
Observability: Flying Blind in Production
This is perhaps the most damning indictment: most AI-generated codebases have virtually zero observability built in. Everything is fine until it isn’t, and then you’re left scrambling in the dark. Are exceptions being captured, or are they silently swallowed? Is your logging structured and useful, or just a chaotic mess of console.log statements? Do you find out when something breaks via alerts, or do users have to tell you? And for complex AI calls or external API interactions, can you actually trace a request end-to-end?
The fundamental issue here is that most teams are conflating code generation with code review. They’re the same problem? Absolutely not. The faster teams ship AI-generated code, the faster review debt accumulates, and most teams have no existing process to manage it. This isn’t about stopping AI; it’s about understanding that the human element – experienced engineering judgment – is now more critical than ever to ensure what ships is actually built to last.