Each roast was taking 50 seconds per upload. Quality was unknown — we had a feeling, not data. The prompt had been written "by instinct" and never seriously evaluated. The question was simple: how do you know if a prompt is good, and how do you improve it without spending the whole day reading roasts manually?
The answer: automate the evaluation work using AI itself, in a loop. Write a tool that sends 30 photos to Claude, measures quality metrics, and produces a report. Modify the prompt, rerun, compare. Five iterations later, here's what we learned.
Context: RateMyFace
RateMyFace is a side project in progress — an AI roast-by-photo site: the user uploads a photo, Claude analyzes it and generates satirical text along with a score and a "tier label" (e.g. "WiFi Signal With Legs"). The result is rendered as a collectible trading card.
The stack: Go monolith, SQLite, Claude CLI (claude --print) called
as a subprocess. The prompt asked Claude to produce 5 roast styles (standard,
rap, Shakespeare, passive-aggressive mom, Gordon Ramsay) + a score + a label,
all in JSON.
Two concrete problems: roasts were taking ~50 seconds (too slow for interactivity) and their quality was opaque. We knew we were generating something, not whether it was good.
The idea: measure before optimizing
The usual reflex in prompt engineering is to iterate manually — modify, test on 2-3 examples, estimate if it's better. The problem: you optimize on the examples you chose, not on the real distribution. And "it seems better" isn't a metric.
Alternative approach: define what "good" means in a measurable way, generate enough examples to have stable statistics, and automate the evaluation. Metrics chosen:
- Average length — target < 150 chars. A viral roast is short.
- Score variance — target > 2.0. If everyone scores 5-6, the score is useless.
- Fallback rate — how often Claude fails and we return the default text.
- Score distribution — 1-10 histogram, to visualize biases.
The tool: an evaluation harness in Go
A standalone binary in cmd/prompttest/main.go. No HTTP server,
direct call to Claude CLI. 30 fixed test cases (photos from randomuser.me —
men and women FR/EN), run sequentially with duration measurement.
func callClaude(ctx context.Context, photoURL, lang string) (*RoastResult, string, time.Duration, error) {
prompt := buildPrompt(lang)
fullPrompt := fmt.Sprintf("First, read the image file at %s and look at it carefully. Then:\n\n%s", photoURL, prompt)
args := []string{
"--print",
"--model", "sonnet",
"--effort", "low", // reduces time from ~50s to ~27s
"--allowedTools", "Read",
"--dangerously-skip-permissions",
"-p", fullPrompt,
}
start := time.Now()
out, err := exec.CommandContext(ctx, "claude", args...).Output()
dur := time.Since(start)
// ...
}
The --effort low flag is the first speed optimization:
it reduces response time from ~50s to ~27s. Not officially documented
but behavior is stable.
End-of-run report:
╔══════════════════════════════════════════════╗
║ PROMPTTEST v4 — 30 tests
╠══════════════════════════════════════════════╣
║ Avg chars standard : 128 (target < 200)
║ Score variance : 1.09 (target > 2.0)
║ Fallbacks : 0/30 (target < 5%)
║ Avg score : 5.8
║ Avg duration : 33.859s
╠══════════════════════════════════════════════╣
║ Score distribution:
║ 4: █████ (5)
║ 5: ████ (4)
║ 6: ███████ (7)
║ 7: ██████████████ (14)
╠══════════════════════════════════════════════╣
║ Sample roasts (first 5 valid):
║ [homme FR 1] score=4.0 tier="PDG de Rien du Tout"
║ → Le costume + les joues de bébé + le regard vide — t'es le seul
║ mec à avoir l'air d'un enfant ET d'un PDG raté en même temps.
╚══════════════════════════════════════════════╝
Five versions, five lessons
| Version | Avg chars | Score variance | Fallbacks | Main issue |
|---|---|---|---|---|
| v1 | 216 | 0.44 | 0 | "T'as [X] de quelqu'un qui" — 17/30 roasts identical in structure |
| v2 | 110 | 0.58 | 0 | "[item] dit/says [X]" — new dominant cliché |
| v3 | 95 | 0.67 | 0 | "C'est la photo LinkedIn de..." — third cliché |
| v4 | 128 | 1.09 | 0 | Best version — specific, varied roasts |
| v5 | 146 | 0.90 | 1 | 3 scores of 8 appeared, but overall variance dropped |
Lesson 1 — Every positive example creates a cliché
In v1, the prompt gave examples of good roasts. Claude immediately copied the structure of those examples on 17 of the 30 cases. We banned that pattern, gave new examples — and Claude used the new examples as its new cliché. Three times in a row.
The solution (v4): drop positive structure examples entirely. Instead, describe the emotional target ("a roast that a stranger would screenshot and forward to a group chat") and only accumulate negative examples (explicitly banned patterns).
BANNED STARTERS (these patterns are overused trash):
- "[item] dit/crie/says [X]" → BANNED
- "T'as [X] de quelqu'un qui..." → BANNED
- "C'est la photo de profil LinkedIn de..." → BANNED
- Any sentence starting with "C'est la photo" → BANNED
Lesson 2 — Score variance has a natural ceiling
No matter how we phrased the scoring instruction, variance plateaued around
1.1 with randomuser.me photos. These photos are intentionally
"average" — they serve as generic profile photos. You can't extract a variance
of 2.0 from a distribution that's naturally compressed between 4 and 7.
This isn't a prompt problem. It's a physical constraint of the input data. With real user photos (which include genuinely ugly or beautiful people), variance will be naturally higher. The v4 prompt is optimal for what you can get with this test set.
Lesson 3 — Claude is conservative with low scores
Even when explicitly asking for scores of 2-3 for "objectively difficult to look at" people, Claude resists. Anthropic's safety mechanisms push it to avoid saying a real person is ugly. We rarely got below 4.0 despite repeated instructions.
For a use case like ours (consented humorous roasting), this is slightly frustrating but understandable. The real question: does the user who uploads a photo expect a 2/10? Probably not, even if it's "more honest."
Lesson 4 — Text quality improves dramatically
This is the real gain from iteration. Between v1 and v4, the quality of the roasts is incomparable:
v1: "T'as la tête de quelqu'un qui a mis 'passionné par les synergies' dans son bio LinkedIn — le bâtiment derrière toi est plus intéressant que toi."
v4: "Le front avance plus vite que ta carrière et le regard est resté coincé à la page de chargement."
Same subject, same person. v4 is twice as short, twice as specific, three times funnier. Iterating on measurable metrics (length, fallbacks) forced prompt changes that had an indirect effect on subjective quality.
Limits of the approach
The autonomous iteration loop has important limits to keep in mind.
No ground truth. The metrics (length, variance) measure properties of the text, not its quality. A 90-char roast isn't necessarily funnier than a 180-char one. You're optimizing proxies, not the actual target.
The test set doesn't represent real users.
randomuser.me = generic, neutral, well-lit profile photos.
Real users upload party photos, blurry selfies, people in costume.
The real distribution is different.
Each run takes ~15 minutes. 30 calls × ~30s = 15 min of waiting per iteration. We ran 5 iterations = 75 minutes of runs + analysis time. This isn't real-time optimization.
Score variance plateaus, and that's okay. We tried for 3 iterations to improve variance without major success. Recognizing the plateau and stopping is a skill in itself.
What the loop actually enables
The main value isn't reaching the "perfect prompt". It's making visible what's invisible. Without the tool, we wouldn't have known that 17/30 roasts had the exact same structure. We would have kept thinking the roasts were "pretty good" based on the 3 examples we tested manually.
The loop forces you to define "good" before optimizing. That's the real work: not writing the prompt, but deciding which proxy metrics are relevant. Once the metrics are defined, the AI does the rest — generating tests, measuring, revealing patterns.
This approach is reproducible for any text generation with measurable properties: length, presence of certain patterns, JSON parsing failure rate, distribution of a numeric value produced. If you can write it in a 5-line report, you can automate it.
Conclusion
We started with a prompt written by instinct, roasts averaging 216 characters, and a dominant cliché covering 56% of cases. We ended with a prompt averaging 128 characters, 0 fallbacks, specific and varied roasts — and most importantly, a clear understanding of why each version was better or worse than the previous one.
What surprised me: the most effective iteration (v4) wasn't the one that gave the AI the most instructions. It was the one that gave it the fewest — describe the emotional target, ban the failed patterns, and trust the model to find something else. Fewer positive constraints, more creative freedom within negative constraints.
The quality plateau exists. At some point, iterations no longer improve anything measurable. That's the signal to stop — not because the prompt is perfect, but because the marginal gains aren't worth the time invested anymore. Knowing when to stop is just as important as knowing how to iterate.