Quantifiers and greediness

Quand répéter à la main devient ridicule

À la leçon 2, pour matcher un code postal, tu as écrit \d\d\d\d\d. Ça marche. Mais c'est laid, et le jour où il faut quinze chiffres, tu pleures. Répéter un motif à la main, ça ne passe pas à l'échelle.

Les regex ont une réponse : les quantificateurs. Un petit symbole collé après un motif qui dit « répète ça, tant de fois ». \d\d\d\d\d devient \d{5}. Plus court, plus clair, et tu changes le nombre en une frappe.

Voici la boîte à outils complète. Chaque quantificateur se place juste après le caractère ou la classe à répéter :

{5} : exactement 5 fois.
{2,4} : entre 2 et 4 fois.
{3,} : au moins 3 fois, sans limite haute.
* : 0 fois ou plus (raccourci de {0,}).
+ : 1 fois ou plus (raccourci de {1,}).
? : 0 ou 1 fois. Autrement dit : optionnel.

Retiens surtout les trois célèbres : *, + et ?. Tu les croiseras partout. La nuance qui compte : + exige au moins une occurrence, * en accepte zéro. Un champ qui peut être vide, c'est * ; un champ qui doit contenir quelque chose, c'est +.

À toi : compter avec {n} et rendre optionnel avec ?

Premier labo. Matche un code postal français : exactement cinq chiffres. Le motif \d\d\d\d\d marcherait, mais on veut la version courte. Tape le motif, regarde les chiffres se surligner en direct.

🎯 Labo regex · cinq chiffres, sans se répéter

Voir la solution

Motif : \d{5}. La classe \d (un chiffre, vue à la leçon 2) suivie du quantificateur {5} (exactement cinq fois). Le 123 ne fait que trois chiffres : pas de match, c'est attendu. Et Lyon 69007 centre matche parce qu'il contient cinq chiffres d'affilée (on ancrera ce genre de motif à la leçon 4).

Deuxième labo, la lettre optionnelle. On veut matcher « chat » au singulier et au pluriel « chats », avec un seul motif. C'est le rôle du ? : le s final devient facultatif.

🎯 Labo regex · le pluriel optionnel

Voir la solution

Motif : chats?. Le ? porte sur le caractère juste avant lui, ici le s. Il le rend optionnel : zéro ou un s. Donc chat et chats matchent tous les deux.

Le piège du « contient » : attention, chats? matche aussi « chatte », parce que « chatte » contient la suite chat. Tant qu'on n'a pas vu les frontières (leçon 4), un motif se contente d'être présent quelque part dans la chaîne. C'est pour ça qu'ici la chaîne à exclure est « chien » : elle ne contient pas du tout chat.

Le piège n°1 : le moteur est gourmand

Maintenant la mission qui piège tout le monde. Tu as cette phrase :

un "chat" et un "chien"

Tu veux extraire ce qui est entre guillemets. Logique : « un guillemet, puis du texte, puis un guillemet ». Le motif évident, c'est " suivi de .+ (au moins un caractère, n'importe lequel) suivi de ". Soit ".+".

Avant de le tester, prédis. Le labo qui suit affichera en surligné exactement ce que le motif attrape.

Prédis avant de lire

Sur la phrase un "chat" et un "chien", qu'est-ce que le motif ".+" va surligner ? Le mot « chat » seul ? Les deux mots séparément ? Ou autre chose ?

Voir la réponse

Un seul gros bloc : "chat" et un "chien", d'un guillemet à l'autre. Le moteur attrape le premier guillemet, puis .+ avale tout jusqu'au dernier guillemet de la ligne, guillemets du milieu compris. Tu voulais deux petits morceaux, tu obtiens un seul énorme. Joue le labo ci-dessous et regarde le surlignage : c'est flagrant.

Le labo ci-dessous a son motif déjà pré-rempli avec ".+". Son but n'est pas d'obtenir le ✓ : il l'a déjà. Son but est de te faire voir le surlignage. Observe : sur la première chaîne, le moteur surligne un bloc géant, d'un guillemet à l'autre.

🎯 Labo regex · regarde le moteur tout avaler

Le ✓ est trompeur. Le motif « matche », mais il matche trop large. Pourquoi ? Parce que le moteur regex est gourmand (en anglais : greedy). Par défaut, .+ prend le plus de caractères possible. Il avance jusqu'au bout, puis recule juste assez pour trouver le guillemet final. Résultat : il s'arrête au dernier guillemet, pas au premier qu'il rencontre.

Deux remèdes : le lazy et l'exclusion

Il y a deux façons de dire au moteur « arrête-toi plus tôt ». Compare-les honnêtement, parce qu'elles n'ont pas la même qualité.

Remède 1 : le lazy avec ?

Colle un ? juste après le + : ".+?". Ça transforme le quantificateur en paresseux (lazy) : au lieu de prendre le plus possible, il prend le moins possible. Il s'arrête au premier guillemet qu'il rencontre. Sur notre phrase, il surligne enfin "chat" puis "chien" séparément.

Remède 2 : l'exclusion explicite (souvent meilleur)

Souvenir de la leçon 2 : la classe négative [^...] veut dire « tout sauf ça ». Alors dis exactement ce que tu veux : entre les guillemets, accepte tout sauf un guillemet. Soit "[^"]+" : un guillemet, puis un ou plusieurs caractères qui ne sont pas un guillemet, puis un guillemet. Impossible d'avaler le guillemet du milieu : il est explicitement interdit.

Préfère l'exclusion au lazy quand tu peux. Le lazy fait du sur-place : le moteur teste, recule, re-teste à chaque caractère pour prendre le minimum. L'exclusion [^"]+, elle, dit ton intention directement (« pas de guillemet ici-dedans ») : elle est plus rapide, plus lisible, et ne se fait pas piéger par les cas tordus. Le lazy dépanne ; l'exclusion exprime ce que tu veux vraiment.

Attention : le ? a deux casquettes

Tu viens de voir ? dans deux rôles complètement différents. Ne les confonds jamais, c'est une confusion classique.

Le ? a deux significations selon sa position :

Après un caractère ou un groupe → optionnel. chats? = le s est facultatif (zéro ou un).
Juste après un autre quantificateur (+, *, {n,}) → lazy. .+? = prends le moins possible.

La règle pour ne pas se tromper : regarde ce qu'il y a à gauche du ?. Une lettre ou un groupe → optionnel. Un quantificateur → lazy. Même symbole, deux métiers.

Greedy contre lazy, vu d'en haut

Garde cette image en tête. Sur la même phrase, le motif gourmand surligne un seul long bloc ; le motif paresseux (ou l'exclusion) surligne deux petits morceaux, exactement ce qu'on voulait.

Même phrase, deux motifs. Le gourmand prend tout ; le paresseux et l'exclusion prennent juste ce qu'il faut.

À toi : corriger le motif gourmand

Reprends le motif ".+", déjà en place. Le ✓ s'affiche peut-être, mais souviens-toi : sur la première chaîne, il surligne un bloc trop large. Remplace-le par "[^"]+" (l'exclusion, la meilleure version) ou par ".+?" (le lazy), et regarde le surlignage se découper proprement en deux morceaux.

Labo d'observation : ici, le drill ✓/✗ ne suffit pas à départager le gourmand du corrigé : sur ces chaînes, les trois motifs donnent le même verdict match / pas-match. La vraie différence, c'est le surlignage, pas le ✓. Garde l'œil sur ce qui s'allume en jaune : tu cherches deux petits blocs, pas un grand.

🎯 Labo regex · découpe propre entre guillemets

Voir la solution

Meilleure version : "[^"]+". Un guillemet, puis « un ou plusieurs caractères qui ne sont PAS un guillemet », puis un guillemet. Le moteur ne peut pas dépasser le premier guillemet fermant : il est exclu par [^"].

Version paresseuse acceptable : ".+?". Le ? après le + dit « prends le minimum », donc le moteur s'arrête au premier guillemet venu. Note le flag g (global) dans la case de droite : il fait surligner toutes les occurrences, pas seulement la première. Sans g, tu ne verrais qu'un seul morceau allumé.

When repeating by hand gets ridiculous

In lesson 2, to match a postcode, you wrote \d\d\d\d\d. It works. But it's ugly, and the day you need fifteen digits, you cry. Repeating a pattern by hand doesn't scale.

Regex have an answer: quantifiers. A small symbol stuck after a pattern that says "repeat this, this many times". \d\d\d\d\d becomes \d{5}. Shorter, clearer, and you change the count in one keystroke.

Here's the full toolbox. Each quantifier goes right after the character or class to repeat:

{5} — exactly 5 times.
{2,4} — between 2 and 4 times.
{3,} — at least 3 times, no upper limit.
* — 0 times or more (shorthand for {0,}).
+ — 1 time or more (shorthand for {1,}).
? — 0 or 1 time. In other words: optional.

Above all, remember the three famous ones: *, + and ?. You'll meet them everywhere. The nuance that matters: + requires at least one occurrence, * accepts zero. A field that may be empty is *; a field that must contain something is +.

Your turn: count with {n} and make optional with ?

First lab. Match a French postcode: exactly five digits. The pattern \d\d\d\d\d would work, but we want the short version. Type the pattern, watch the digits highlight live.

🎯 Regex lab · five digits, without repeating yourself

Show the solution

Pattern: \d{5}. The class \d (a digit, from lesson 2) followed by the quantifier {5} (exactly five times). The 123 is only three digits: no match, as expected. And Lyon 69007 centre matches because it contains five digits in a row (we'll anchor this kind of pattern in lesson 4).

Second lab, the optional letter. We want to match "cat" in the singular and the plural "cats", with a single pattern. That's the job of ?: the trailing s becomes optional.

🎯 Regex lab · the optional plural

Show the solution

Pattern: cats?. The ? applies to the character right before it, here the s. It makes it optional: zero or one s. So cat and cats both match.

The "contains" trap: beware, cats? also matches "catty", because "catty" contains the run cat. Until we've seen boundaries (lesson 4), a pattern is happy just being present somewhere in the string. That's why the string to exclude here is "dog": it doesn't contain cat at all.

Trap number one: the engine is greedy

Now the mission that trips up everyone. You have this sentence:

a "cat" and a "dog"

You want to extract what's between quotes. Logical: "a quote, then some text, then a quote". The obvious pattern is " followed by .+ (at least one character, any character) followed by ". That is ".+".

Before testing it, predict. The lab below will highlight exactly what the pattern grabs.

Predict before reading on

On the sentence a "cat" and a "dog", what will the pattern ".+" highlight? The word "cat" alone? The two words separately? Or something else?

Show the answer

One big block: "cat" and a "dog", from one quote to the other. The engine grabs the first quote, then .+ swallows everything up to the last quote on the line, the middle quotes included. You wanted two little pieces, you get one huge one. Play the lab below and watch the highlight: it's blatant.

The lab below has its pattern already pre-filled with ".+". Its goal is not to get the ✓: it already has it. Its goal is to make you see the highlight. Look: on the first string, the engine highlights a giant block, from one quote to the other.

🎯 Regex lab · watch the engine swallow everything

The ✓ is misleading. The pattern "matches", but it matches too wide. Why? Because the regex engine is greedy. By default, .+ grabs as many characters as possible. It runs to the end, then backs up just enough to find the closing quote. Result: it stops at the last quote, not the first one it meets.

Two cures: lazy and exclusion

There are two ways to tell the engine "stop earlier". Compare them honestly, because they're not equal in quality.

Cure 1: lazy with ?

Stick a ? right after the +: ".+?". That turns the quantifier lazy: instead of grabbing as much as possible, it grabs as little as possible. It stops at the first quote it meets. On our sentence, it finally highlights "cat" then "dog" separately.

Cure 2: explicit exclusion (often better)

Remember lesson 2: the negated class [^...] means "anything but this". So say exactly what you want: between the quotes, accept anything but a quote. That is "[^"]+": a quote, then one or more characters that are not a quote, then a quote. Impossible to swallow the middle quote: it's explicitly forbidden.

Prefer exclusion to lazy when you can. Lazy shuffles in place: the engine tries, backs up, re-tries at each character to take the minimum. The exclusion [^"]+ instead states your intent directly ("no quote in here"): it's faster, more readable, and doesn't get tricked by edge cases. Lazy gets you out of a jam; exclusion expresses what you actually want.

Watch out: ? wears two hats

You've just seen ? in two completely different roles. Never mix them up — it's a classic confusion.

The ? has two meanings depending on its position:

After a character or group → optional. cats? = the s is optional (zero or one).
Right after another quantifier (+, *, {n,}) → lazy. .+? = take as little as possible.

The rule to avoid mistakes: look at what's to the left of the ?. A letter or a group → optional. A quantifier → lazy. Same symbol, two jobs.

Greedy versus lazy, seen from above

Keep this picture in mind. On the same sentence, the greedy pattern highlights one long block; the lazy pattern (or the exclusion) highlights two small pieces, exactly what we wanted.

Same sentence, two patterns. Greedy takes it all ; lazy and exclusion take just enough.

Your turn: fix the greedy pattern

Take the pattern ".+" again, already in place. The ✓ may show, but remember: on the first string, it highlights a too-wide block. Replace it with "[^"]+" (the exclusion, the best version) or with ".+?" (the lazy one), and watch the highlight split cleanly into two pieces.

Observation lab: here, the ✓/✗ drill is not enough to tell greedy from fixed: on these strings, all three patterns give the same match / no-match verdict. The real difference is the highlight, not the ✓. Keep your eye on what lights up yellow: you're after two small blocks, not one big one.

🎯 Regex lab · clean split between quotes

Show the solution

Best version: "[^"]+". A quote, then "one or more characters that are NOT a quote", then a quote. The engine can't run past the first closing quote: it's excluded by [^"].

Acceptable lazy version: ".+?". The ? after the + says "take the minimum", so the engine stops at the first quote it meets. Note the g flag (global) in the right-hand box: it highlights all occurrences, not just the first. Without g, you'd only see one piece lit up.

Next step

Your pattern matches, but EVERYWHERE in the string: "Lyon 69007 centre" passed when we wanted a postcode on its own. Lesson 4: anchors and boundaries, to demand a match at the start, at the end, or on a whole word.

Lesson 4: Anchors and boundaries →

Quand répéter à la main devient ridicule

À toi : compter avec {n} et rendre optionnel avec ?

Le piège n°1 : le moteur est gourmand

Deux remèdes : le lazy et l'exclusion

Remède 1 : le lazy avec ?

Remède 2 : l'exclusion explicite (souvent meilleur)

Attention : le ? a deux casquettes

Greedy contre lazy, vu d'en haut

À toi : corriger le motif gourmand

When repeating by hand gets ridiculous

Your turn: count with {n} and make optional with ?

Trap number one: the engine is greedy

Two cures: lazy and exclusion

Cure 1: lazy with ?

Cure 2: explicit exclusion (often better)

Watch out: ? wears two hats

Greedy versus lazy, seen from above

Your turn: fix the greedy pattern

🎯 Pratique