Auditing a Data Platform with AI: Iterative Security Hardening
Four audit rounds with escalating AI personas found 42 security issues and removed 6,500 lines of dead code.
We used Claude Code to run 4 audit rounds, fix 42 security issues, remove 6,500 lines of dead code, and push test coverage from 40% to 60%. The prompts matter more than you think.
The starting point
We had a data platform with a React 19 frontend and a Python FastAPI backend. About 17,000 lines of backend code and 15,000 lines of frontend code. The platform handles data catalog browsing, SQL query execution, dbt model management, AI-assisted flows, OPA policy management, and ODCS data contracts. It worked. But we hadn’t done a proper security audit.
Round 1: “Check for security issues”
Our first prompt was straightforward:
“Do an audit for security, best practices, performance, logic, quality”
This found 14 issues. The obvious ones:
- SQL injection in the quality check compiler (f-string interpolation)
- Missing auth on the OPA proxy endpoint
- Sync
requestscalls blocking the async event loop - Unbounded database queries
These are the issues any security scanner would flag. The AI found them quickly because the patterns are well-known. It missed deeper problems.
What we learned: A generic “check for security” prompt produces generic results. The AI scans for known vulnerability patterns but doesn’t reason deeply about the application’s specific architecture.
Round 2: “Check logic, typing, data consistency”
After fixing Round 1, we changed the prompt:
“Check many things more, usage, quality, typing, logic”
This shifted the AI’s focus from pattern-matching to reasoning about the code’s behavior. It found 20 more issues, including:
- Column allowlists we just added were incomplete.
quality_scorewasn’t in the allowlist, so every quality run crashed. The security fix broke functionality. - Flow router path shadowing.
/{flow_id}was registered before/categories, making the categories endpoint unreachable. - dbt session IDOR. Any user could read/write any other user’s session just by knowing the UUID.
- ComposeExecutor memory leak. Process references and log data accumulated forever.
The key insight: Round 2 found bugs that Round 1 introduced. The column allowlists were a security fix that created a runtime crash. Without a second pass, we’d have shipped broken code.
Round 3: “You are a senior security engineer doing a final sign-off”
After fixing Round 2, we escalated the persona:
“You are a senior security engineer performing a FINAL sign-off audit”
This produced a qualitatively different result. The AI:
- Verified previous fixes were correctly implemented (spot-checked 6 critical ones)
- Found 5 more issues the previous rounds missed:
- Incomplete IDOR fix (only write endpoints were protected, read/execute endpoints were still open)
- Missing
encodeURIComponentin 2 admin API calls that the bulk fix missed - SQL injection in the suggestion engine’s single-quote handling
The “final sign-off” framing changed the AI’s behavior from “find new issues” to “verify everything is actually fixed and nothing was missed.” A fundamentally different task.
Round 4: The bank-grade prompt
The biggest shift came when we changed the prompt to:
“You are a Senior Full-Stack Security & Architecture Engineer with 15+ years of experience. You are a software and security auditor for a bank. Make sure the software is safe, secure, and works as expected.”
This produced the most thorough results yet:
- 3 new critical issues found. An open redirect in the GitHub OAuth callback, missing auth on the same endpoint, and error detail leaking.
- Comprehensive PASS/FAIL/WARN categorization for every security area.
- Explicit approval statement with caveats. The AI actually committed to “APPROVED FOR PRODUCTION” or “NOT YET APPROVED” instead of hedging.
The bank auditor framing did three things:
- Raised the bar. The AI treated everything as potentially production-critical.
- Demanded completeness. It checked EVERY router file, not just the ones we pointed at.
- Required sign-off. It gave a clear pass/fail verdict instead of an open-ended list.
The dead code surprise
After 4 security rounds, we ran one more prompt:
“Final audit: dead code, weird constructions, unsafe patterns”
This found 45 dead files totaling 6,500 lines. Entire component directories (text2sql/, chat/), old replaced components, deprecated hooks, pages with removed routes, and about 800 lines of unused error handling classes in the backend.
None of the security prompts found this. Dead code isn’t a security vulnerability, but it increases bundle size, confuses new developers, creates false positives in grep searches, and makes coverage numbers look worse than they are.
What we learned about prompts
Small changes, big impact
| Prompt style | Issues found | Character |
|---|---|---|
| ”Check for security issues” | 14 | Pattern matcher |
| ”Check logic, typing, quality” | 20 | Reasoning engine |
| ”Senior engineer final sign-off” | 5 | Verifier |
| ”Bank security auditor” | 3 | Adversarial thinker |
| ”Dead code, weird constructions” | 45 dead files | Code janitor |
The same codebase, audited 5 times with slightly different prompts, produced 84 unique findings with zero overlap. Each prompt activated a different mode of analysis.
The persona matters more than the instructions
Telling the AI “check for SQL injection” finds SQL injection. Telling it “you are a bank auditor who must sign off on this for production” finds SQL injection AND the subtle edge cases around it. The suggestion engine’s single-quote escaping that the direct “check for SQL injection” prompt missed. The persona activates broader reasoning.
Iterative beats comprehensive
A single “audit everything” prompt produces a mediocre result across all categories. Running focused rounds, security first, then logic, then verification, then dead code, produces dramatically better results because:
- Each round starts fresh (no context pollution from previous findings)
- Each round can verify the previous round’s fixes
- Each round has a clear, narrow objective
The “fix then re-audit” loop catches its own bugs
Round 2 found bugs introduced by Round 1’s fixes. This is the most valuable pattern: every fix is a potential new bug. The iterative approach naturally catches these regressions.
The numbers
| Metric | Before | After |
|---|---|---|
| Security issues | Unknown | 42 found and fixed |
| Backend tests | 1,147 | 1,732 (+585) |
| Frontend tests | 403 | 464 (+61) |
| Backend coverage | ~40% | 60%+ |
| Frontend coverage | ~52% | 65%+ |
| Dead code removed | 0 | 6,502 lines |
| Dead files removed | 0 | 45 files |
The recommended prompt
After iterating through 5 prompt styles, here is what we would use for a comprehensive audit from the start:
You are a Senior Full-Stack Security & Architecture Engineer with 15+ years of experience in React/TypeScript and Python backends. You are performing a production sign-off audit for a regulated financial services platform.
Audit the codebase across these categories, in order of priority:
1. Security — Auth bypass, injection (SQL/shell/path/SSRF), IDOR, CSRF, XSS, secrets, session management, CORS, rate limiting, error information leaking, dependency vulnerabilities.
2. Runtime correctness — Code that will crash in production: missing imports, type mismatches, unhandled null/undefined, race conditions, resource leaks, infinite loops, deadlocks.
3. Logic bugs — Data consistency issues, state management bugs, stale closures, incorrect cascading behavior, edge cases with empty/null data, off-by-one errors.
4. Dead code — Unused files, unreachable functions, deprecated modules, commented-out blocks, test-only production code.
5. Code quality — Duplicated logic, inconsistent patterns, overly complex abstractions, missing error handling, insufficient test coverage for critical paths.
For each finding, provide:
- Severity: Critical / High / Medium / Low
- File:line reference
- What’s wrong (1-2 sentences)
- How to fix (concrete, not vague)
- Why it matters (impact if unfixed)
After listing all findings, provide a PASS / FAIL verdict with the condition: “Would you approve this for production at a bank?”
Do NOT report: style preferences, IDE warnings for unresolved venv imports, or issues that only affect developer experience without production impact.
The key elements that make this prompt effective:
- Persona. “Bank auditor” raises the standard of what’s acceptable.
- Ordered priorities. Security first, then correctness, then quality.
- Concrete output format. file:line + what + fix + why.
- Explicit exclusions. No style nits, no IDE noise.
- Binary verdict. Forces the AI to commit to pass/fail instead of endless hedging.
The takeaway
AI-assisted code auditing works. But the quality of the audit is directly proportional to the quality of the prompt and the number of iterations. A single “check my code” pass finds the obvious issues. Four focused rounds with escalating personas find the subtle ones. The iterative fix-then-reaudit loop catches the bugs that the fixes themselves introduce.
The most surprising finding: the dead code round removed more lines (6,500) than all security fixes combined added. Cleaning house turned out to be the highest-impact improvement for long-term maintainability.
This audit was performed on the Chameleon Data Platform using Claude Code (Anthropic) across multiple sessions in April 2026. All findings and fixes are tracked in docs/ISSUES.md.