Please wait

Word Boundary

A word boundary in regular expressions is a special character sequence that matches the position between a word character (letters, digits, and underscores) and a non-word character. It doesn't match any characters themselves but rather the position between them.

What is a word character?

In regular expressions, a word character is usually any alphanumeric character (letters and numbers) along with the underscore. Specifically, it includes the following:

  • All uppercase letters (A through Z)
  • All lowercase letters (a through z)
  • All digits (0 through 9)
  • The underscore character (_)

Word boundaries can be found in three different places:

  1. At the beginning of the text, if the first thing you see is a letter, number, or underscore.
  2. In the middle of the text, right between a letter, number, or underscore and something that isn't one of those (like a space or punctuation mark).
  3. At the end of the text, if the last thing you see is a letter, number, or underscore.

In PHP, you can use the \b escape sequence to define a word boundary in a regular expression pattern. It is often used to search for whole words rather than substrings within words.

Using Word Boundaries

Let's look at some simple examples using PHP to see where word boundaries come into play. We'll use the preg_match() function to check for matches in different scenarios.

Example 1: Match Found

$string = "I have a cat";
if (preg_match("/\bcat\b/", $string)) {
  echo "Match found!";
} else {
  echo "No match found.";
}

Here, we're looking for the whole word "cat". Since "cat" is a whole word in the string, it prints "Match found!"

Example 2: No Match Found

$string = "I have a category";
if (preg_match("/\bcat\b/", $string)) {
  echo "Match found!";
} else {
  echo "No match found.";
}

In this example, we're looking for "cat" again, but this time "cat" is part of the word "category". Since we're using word boundaries, it won't match "cat" inside "category", so it prints "No match found."

Example 3: Match at the Start

$string = "dog is friendly";
if (preg_match("/\bdog\b/", $string)) {
  echo "Match found!";
} else {
  echo "No match found.";
}

Here, we're looking for the word "dog". Since "dog" is at the beginning of the string and is a whole word, it prints "Match found!"

Example 4: Match at the End

$string = "She has a pet dog";
if (preg_match("/\bdog\b/", $string)) {
  echo "Match found!";
} else {
  echo "No match found.";
}

In this example, we're looking for "dog" at the end of the string. Since "dog" is a whole word and at the end, it prints "Match found!"

The \b in the pattern ensures that you're finding whole words, not parts of other words. It acts like an invisible fence around the word, making sure it's separate from anything else.

Only latin characters are supported

The word boundary \b in regular expressions works according to the ASCII standard, meaning it recognizes word characters as Latin letters (A-Z, a-z), digits (0-9), and the underscore (_). It doesn't recognize non-Latin characters as word characters.

If you're working with text that includes non-Latin characters (such as characters from Cyrillic, Arabic, Chinese, etc.), using \b may not produce the desired result. The word boundary might not correctly identify the position between word characters and non-word characters in these scripts.

Exercise

Typically, times are formatted as "hours:minutes". Both hours and minutes have two digits, like 09:00.

For this exercise, create a regexp (short for regular expression) to find the time in a string. Here is an example of a string you should use for this exercise:

  • "Breakfast at 09:00 in the room 123:456."

For simplicity, don't worry about verifying the time is correct. 25:99 can also be a valid result. The regexp shouldn't match 123:456.

Key Takeaways

  • A word boundary is a position between a word character and a non-word character. It's not a character itself but a location in the text.
  • Word characters usually include letters (A-Z, a-z), digits (0-9), and underscores (_).
  • Anything that is not a word character, like spaces, punctuation, or special symbols, is considered a non-word character.
  • Word boundaries are used in regular expressions to find exact whole words, not parts of other words.
  • In many programming languages, including PHP, the \b escape sequence represents a word boundary.
  • Word boundaries can be found at the start of the string if the first character is a word character, between two characters where one is a word character and the other is not, and at the end of the string if the last character is a word character.
  • The standard word boundary definition may not work with non-Latin characters, so custom patterns might be needed for those cases.

Comments

Please read this before commenting