Iterative Security Audit: 45 Probes, 0 Critical, 6 Regression Tests Kept

Throughout this series, I've shared patterns discovered during a security audit on a Go authentication service: PKCS#12, timing oracle, lockout, CSRF, mTLS, CRL, CQRS. Let's now talk about the methodology itself: how the audit was conducted, and how we turned results into permanent tests.

The audit in successive passes

An audit isn't a single scan. It's an iterative process in passes, each with a different objective and decreasing yield:

Static analysis — read the code, identify security patterns (or their absence), note questions
Reconnaissance — understand flows, dependencies, trust boundaries, TLS/session/auth configurations
Critical analysis — challenge each finding from pass 1, verify whether mitigations exist
Stabilization — fix real findings, document false positives, prepare for runtime
Active runtime probes — send real requests against a test instance

Each pass yields new insights. Pass 1 finds missing patterns ("no CSRF token"). Pass 2 discovers the CSRF token is in a middleware you hadn't seen. Pass 3 verifies the middleware is in the right order. Pass 4 confirms the order is correct. Pass 5 verifies it holds under a real request.

Verify findings before panicking

I learned this the hard way: an analysis tool (or an audit sub-agent) outputs a finding "CRITICAL: timing leak on login endpoint". You panic. You spend 4 hours instrumenting the code. Result: the "timing leak" is a 5ms delta from network latency, not the code.

The discipline: each finding must go through a verification cycle before escalation.

Reproduce — is the finding reliably reproducible?
Isolate — is the problem in the code or in the test environment?
Measure — for timing findings, 50 measurements minimum, not just one
Contextualize — is the finding exploitable in the real context (with mTLS, rate limiting, etc.)?

On this audit, the static analysis pass produced 23 potential findings. After verification: 8 true findings, 15 false positives. A 35% true positive ratio. That's normal — an audit that produces no false positives probably didn't search wide enough.

How you generate your own false positives

The most insidious trap: fooling yourself. A concrete example from this audit:

I measured the login endpoint timing with active lockout. The 6th failure took 55ms, the 7th took 52ms. Delta: 3ms. My first instinct: "lockout bypass — timing varies, so we can distinguish locked attempts from unlocked ones".

Except no. The 3ms delta was statistical noise. Over 100 measurements, the average was identical (54ms +/- 4ms) whether the account was locked or not. The dummy hash was working perfectly. I almost created a finding from nothing.

The lesson: the human brain is an excellent pattern matcher, including on noise. Timing measurements must always be statistical (N > 30, mean comparison) and never based on one or two observations.

From static analysis to runtime probes

When to switch from static passes to active probes? When you've exhausted what the code can tell you and need to see how the system actually behaves.

Signals to switch:

Static passes haven't produced new findings for 2 iterations
You have hypotheses that only runtime can confirm (timing, race conditions, behavior under load)
Mitigations identified in static analysis need validation under real conditions

On this audit, we switched to runtime after 4 static passes. The runtime pass confirmed 7 of the 8 findings and added 1 new one (a CRL reload edge case that the code didn't make obvious).

45 probes to 6 regression tests

During the runtime phase, we launched 45 probes against a real instance. Result: 0 Critical or High vulnerabilities. All findings were either already fixed or informational observations.

The question: which probes to keep in the permanent regression suite?

Keep: patterns not covered by existing E2Es

Timing consistency — measure |unknown - known| < threshold on login
Unicode homoglyphs — attempt login with visually identical Unicode characters (e.g., Cyrillic 'a' vs Latin 'a')
Multi-CSRF fields — send multiple CSRF tokens in the same request to verify the server only accepts one
Host header injection — verify the 421 Misdirected Request when Host != SNI
Duplicate Origin header — send two Origin headers to test CORS resistance
Conditional GET ETag — verify authenticated responses aren't cached by a proxy via ETag

Drop: patterns already covered

Rate limiting (already tested in rate limiter E2Es)
Path traversal (covered by router tests)
XSS (covered by templating tests + CSP headers)
SQL injection (covered by query builder tests)
RBAC (covered by permission E2Es)

The ratio: 45 probes to 6 regression tests kept. 13% retention. The remaining 87% are either redundant with existing E2Es or one-shot verifications that only make sense during the initial audit.

The selection criterion

For each probe, the question to ask:

Could a future code change, made in good faith by a developer who doesn't know about this finding, reintroduce the vulnerability?

If yes: regression test. If no (because the framework prevents it, or because it would require a deliberate and visible change): no test.

The timing test is the perfect example: a login handler refactoring could easily forget the dummy hash. The path traversal test, however, would only break from a router change — a change so visible it would be reviewed by the entire team.

Conclusion

A security audit isn't a scan. It's an iterative process where each pass refines understanding, and where the discipline of verifying findings prevents wasting time on ghosts.

The real deliverable of an audit isn't the report — it's the 6 regression tests that survive in the CI and prevent silent regressions. The report is read once. The tests run on every commit.

Last article in this series: after the audit, how to document all this for AI agents that will touch the code? The CLAUDE.md discipline — 296 to 142 lines, and my agent codes better than before. That's the subject of the final article.