Anchors and boundaries

Le piège qu\'on avait semé en leçon 1

Souviens-toi de la toute première leçon. Tu écrivais /chat/ et le motif matchait fièrement le mot « chat ». Mais il matchait aussi DANS « chaton ». Et DANS « achat ». Le motif ne cherche pas un mot entier : il cherche la suite de lettres c-h-a-t n\'importe où dans la chaîne.

Parfois, c\'est exactement ce que tu veux : trouver « erreur » au milieu d\'un log. Souvent, non. Quand tu valides un code postal, un pseudo ou une ligne de titre, tu ne veux pas « quelque part dedans », tu veux exactement ça, et rien d\'autre autour.

C\'est tout le sujet de cette leçon. On va apprendre à ancrer le motif à un endroit précis : le début, la fin, ou la frontière entre deux mots. Tu ne diras plus « ce texte contient », tu diras « ce texte EST ».

^ et $ : le début et la fin

Deux symboles, deux ancres. Le caret ^ dit « ici commence la chaîne ». Le dollar $ dit « ici elle finit ». Ce ne sont pas des caractères à matcher : ce sont des positions, des points de repère invisibles.

Compare. /chat/ matche « chat » caché dans « achat ». Mais /^chat$/ exige que la chaîne commence par chat ET finisse par chat, sans rien entre les deux : la chaîne EST exactement chat. Une seule ancre suffit parfois : /^#/ matche toute chaîne qui commence par un dièse, peu importe ce qui suit.

Deux symboles pour un métier différent : tu as déjà croisé le caret en leçon 2, à l\'intérieur d\'une classe : [^aeiou] voulait dire « tout sauf une voyelle », c\'était la négation. Ici, hors crochets et en tête de motif, ^ n\'a rien à voir avec ça : il marque le début. Même symbole, deux métiers, et c\'est la position qui décide : dans [...] au tout début → négation ; en tête de motif → ancre de début. Le moteur ne confond jamais ; à toi de ne pas confondre non plus.

Place les ancres et joue avec ce premier labo. Le motif est pré-rempli : observe comment ^chat$ refuse tout ce qui dépasse, même un simple espace à la fin.

🎯 Labo regex · la chaîne EST exactement « chat »

Voir la solution

Le motif ^chat$ ancre des deux côtés. ^ impose que ça commence par « chat », $ impose que ça finisse juste après le « t ». « chaton » a un « on » après le « t » → le $ n\'est pas satisfait. « achat » a un « a » avant le « c » → le ^ n\'est pas satisfait. « chat » avec un espace final a un caractère après le « t » → encore raté pour le $. Seul « chat » nu passe.

Deuxième cas, ultra fréquent : repérer les lignes de titre dans du Markdown. Une ligne de titre commence par un dièse. Le motif ^# suffit : pas besoin du $, on ne se soucie pas de ce qui vient après.

🎯 Labo regex · une ligne qui commence par #

Voir la solution

^# demande un dièse en position de début. « # Titre » commence bien par le dièse → match. « pas # un titre » contient un dièse, mais au milieu : la position de début, c\'est le « p » de « pas ». Comme le ^ n\'autorise le # qu\'au tout début, la chaîne est rejetée. C\'est exactement ce qui distingue « contient un # » de « commence par un # ».

\b : la frontière de mot

^ et $ ancrent aux extrémités de la chaîne. Mais comment dire « le mot chat, en entier, où qu\'il soit dans la phrase » ? Il te faut une ancre au milieu du texte : la frontière de mot, notée \b.

/\bchat\b/ matche « chat » dans « le chat dort », mais refuse « chaton ». La raison est précise. \b n\'est pas un caractère, c\'est une position : exactement le point de bascule entre un caractère de mot (\w, vu en leçon 2 : lettre, chiffre ou underscore) et un caractère qui n\'en est pas un (espace, ponctuation, ou le bord de la chaîne). Dans « chaton », après le « t » vient un « o » : deux \w collés, donc pas de frontière. Le \b final échoue, et « chaton » est rejeté.

Le piège des accents : en JavaScript, \b se base sur \w en version ASCII pure : seulement [A-Za-z0-9_]. Les lettres accentuées n\'en font pas partie. Conséquence vicieuse : /\bété\b/ ne se comporte pas comme tu crois, parce que le moteur considère le é comme un non-\w. Il voit donc une « frontière » juste avant le é alors qu\'il est au milieu d\'un mot, et tes validations dérapent sur « café », « élève », « naïf ». Garde-le en tête : sur du texte français, \b est traître. (Source : MDN, Word boundary assertion.)

À toi. Le motif \bchat\b est pré-rempli. Observe surtout la dernière chaîne : la ponctuation aussi est une frontière, donc « un chat. » matche, le point faisant office de bord de mot.

🎯 Labo regex · le mot « chat » en entier

Voir la solution

\bchat\b exige une frontière de mot avant ET après les lettres « chat ». Dans « le chat dort », l\'espace avant et l\'espace après font deux frontières → match. Dans « chaton », il y a bien une frontière avant le « c », mais après le « t » vient un « o » : deux caractères de mot collés, donc pas de frontière finale → rejet. Dans « un chat. », l\'espace avant et le point après comptent tous deux comme des frontières → match. Tu vois : la ponctuation joue le même rôle qu\'un espace.

Prédis avant de lire

Tu as une chaîne sur plusieurs lignes : "ligne 1\nligne 2\n# titre". Le # est sur la dernière ligne, pas au tout début de la chaîne. Tu testes /^#/. Ça matche, ou pas ?

Voir la réponse

Non, ça ne matche pas. Par défaut, ^ désigne le tout début de la chaîne entière, pas le début de chaque ligne. Or la chaîne commence par « ligne 1 », pas par #. Le dièse est bien en début de sa ligne, mais le ^ s\'en moque : pour lui, il n\'existe qu\'un seul début, celui de la chaîne complète. Pour qu\'il reconnaisse le début de chaque ligne, il faudra activer le flag m : c\'est exactement la suite.

Le flag m : ^ et $ sur chaque ligne

Tu viens de le voir dans la prédiction. Sur une chaîne multi-lignes, ^ et $ ne désignent par défaut que le début et la fin de TOUTE la chaîne. La ligne du milieu, elle, n\'a ni début ni fin aux yeux du moteur.

Le flag m (pour multiline) change ce comportement. Avec lui, ^ matche le début de chaque ligne et $ la fin de chaque ligne. C\'est exactement ce qu\'il faut pour traiter un texte ligne par ligne : trouver tous les titres Markdown d\'un document, par exemple.

Rappel de la leçon 1 : les flags se posent après le slash de fermeture, et dans notre labo tu les tapes dans le second champ. Dans ce dernier drill, le motif ^# est déjà bon, mais il échoue sur la chaîne multi-lignes. Le flag m est la solution : tape-le dans le champ de droite et regarde le drill passer au vert.

🎯 Labo regex · le flag m débloque le drill

Voir la solution

Laisse le motif tel quel : ^#. Tape simplement m dans le champ des flags (à droite). Sans le flag, ^ ne pointe que le tout début de la chaîne, c\'est-à-dire le « l » de « ligne 1 » : aucun dièse là, donc échec. Avec le flag m, ^ pointe le début de chaque ligne. La troisième ligne commence par # → match. Le motif n\'a pas changé d\'un caractère ; c\'est le flag, et lui seul, qui débloque le drill.

The trap we planted back in lesson 1

Remember the very first lesson. You wrote /chat/ and the pattern proudly matched the word "chat". But it also matched INSIDE "chaton". And INSIDE "achat". The pattern doesn't look for a whole word: it looks for the sequence of letters c-h-a-t anywhere in the string.

Sometimes that's exactly what you want: find "error" in the middle of a log. Often, it isn't. When you validate a postcode, a username or a heading line, you don't want "somewhere inside", you want exactly that, with nothing else around it.

That's the whole point of this lesson. We're going to anchor the pattern to a precise spot: the start, the end, or the boundary between two words. You'll stop saying "this text contains" and start saying "this text IS".

^ and $: the start and the end

Two symbols, two anchors. The caret ^ says "here the string begins". The dollar $ says "here it ends". They're not characters to match: they're positions, invisible landmarks.

Compare. /chat/ matches "chat" hidden inside "achat". But /^chat$/ demands the string start with chat AND end with chat, with nothing in between: the string IS exactly chat. One anchor is sometimes enough: /^#/ matches any string that starts with a hash, whatever follows.

Same symbol, different job: you already met the caret in lesson 2, inside a class: [^aeiou] meant "anything but a vowel", that was negation. Here, outside brackets and at the head of the pattern, ^ has nothing to do with that: it marks the start. Same symbol, two jobs, and position decides: inside [...] right at the front → negation; at the head of the pattern → start anchor. The engine never confuses them; make sure you don't either.

Place the anchors and play with this first lab. The pattern is pre-filled: watch how ^chat$ rejects anything extra, even a single trailing space.

🎯 Regex lab · the string IS exactly "chat"

Show the solution

The pattern ^chat$ anchors both sides. ^ forces it to start with "chat", $ forces it to end right after the "t". "chaton" has an "on" after the "t" → the $ isn't satisfied. "achat" has an "a" before the "c" → the ^ isn't satisfied. "chat" with a trailing space has a character after the "t" → again the $ fails. Only bare "chat" passes.

A second, very common case: spot heading lines in Markdown. A heading line starts with a hash. The pattern ^# is enough: no need for $, we don't care what comes after.

🎯 Regex lab · a line that starts with #

Show the solution

^# demands a hash at the start position. "# Titre" does start with the hash → match. "pas # un titre" contains a hash, but in the middle: the start position is the "p" of "pas". Since ^ only allows the # right at the very start, the string is rejected. That's exactly what tells apart "contains a #" from "starts with a #".

\b: the word boundary

^ and $ anchor at the ends of the string. But how do you say "the word chat, whole, wherever it sits in the sentence"? You need an anchor in the middle of the text: the word boundary, written \b.

/\bchat\b/ matches "chat" in "le chat dort", but refuses "chaton". The reason is precise. \b is not a character, it's a position: exactly the switch point between a word character (\w, seen in lesson 2: letter, digit or underscore) and a character that isn't one (space, punctuation, or the edge of the string). In "chaton", after the "t" comes an "o": two \w stuck together, so no boundary. The final \b fails, and "chaton" is rejected.

Beware, the accent trap: in JavaScript, \b is based on \w in pure ASCII: only [A-Za-z0-9_]. Accented letters aren't part of it. The nasty consequence: /\bété\b/ doesn't behave the way you'd expect, because the engine treats é as a non-\w. So it sees a "boundary" right before the é even though it's mid-word, and your validations slip on "café", "élève", "naïf". Keep it in mind: on French text, \b is treacherous. (Source: MDN, Word boundary assertion.)

Your turn. The pattern \bchat\b is pre-filled. Watch the last string especially: punctuation is a boundary too, so "un chat." matches, the full stop acting as a word edge.

🎯 Regex lab · the whole word "chat"

Show the solution

\bchat\b requires a word boundary before AND after the letters "chat". In "le chat dort", the space before and the space after make two boundaries → match. In "chaton", there is a boundary before the "c", but after the "t" comes an "o": two word characters stuck together, so no final boundary → rejection. In "un chat.", the space before and the full stop after both count as boundaries → match. See: punctuation plays the same role as a space.

Predict before reading on

You have a multi-line string: "ligne 1\nligne 2\n# titre". The # is on the last line, not at the very start of the string. You test /^#/. Does it match, or not?

Show the answer

No, it doesn't match. By default, ^ means the very start of the whole string, not the start of each line. And the string starts with "ligne 1", not with #. The hash is indeed at the start of its line, but ^ doesn't care: for it, there's only one start, that of the entire string. For it to recognise the start of each line, you'll need to turn on the m flag, which is exactly what comes next.

The m flag: ^ and $ on every line

You just saw it in the prediction. On a multi-line string, ^ and $ by default only mean the start and the end of the WHOLE string. The middle line has neither start nor end in the engine's eyes.

The m flag (for multiline) changes that behaviour. With it, ^ matches the start of every line and $ the end of every line. That's exactly what you need to process text line by line: finding every Markdown heading in a document, for instance.

Reminder from lesson 1: flags go after the closing slash, and in our lab you type them in the second field. In this last drill, the pattern ^# is already correct, but it fails on the multi-line string. The m flag is the solution: type it in the right-hand field and watch the drill turn green.

🎯 Regex lab · the m flag unlocks the drill

Show the solution

Leave the pattern as is: ^#. Just type m in the flags field (on the right). Without the flag, ^ only points to the very start of the string, that is the "l" of "ligne 1": no hash there, so it fails. With the m flag, ^ points to the start of each line. The third line begins with # → match. The pattern hasn't changed by a single character; it's the flag, and the flag alone, that unlocks the drill.

Next step

You can match exactly WHERE you want. Lesson 5: cut what you match into reusable pieces: groups, and the magic of $1.

Lesson 5: Groups, captures and alternation →

Le piège qu\'on avait semé en leçon 1

^ et $ : le début et la fin

\b : la frontière de mot

Le flag m : ^ et $ sur chaque ligne

The trap we planted back in lesson 1

^ and $: the start and the end

\b: the word boundary

The m flag: ^ and $ on every line

🎯 Pratique