Please wait

Sets and Ranges

A set is a group of specific characters that you want to match. It's enclosed in square brackets []. When the regular expression engine sees a set, it says, "I can match any one of these characters inside the brackets."

A range is a shorthand way to specify a sequence of characters in a set. It's denoted with a hyphen -.

Sets and ranges allow you to specify groups of characters that you want to match (or not match). They make your regular expressions more concise and flexible, allowing you to match a variety of characters with a simple pattern.

Sets

Imagine you have a box that contains three letters: 'a', 'e', and 'o'. If you have a pattern that includes [eao], it means you want to find any one of these three letters in the text you're looking at.

We call this box with the three letters a "set." You can use this set in a pattern to find these letters along with any other regular characters or symbols you want to match.

$string = "Mop top";
if (preg_match_all("/[tm]op/gi", $string, $matches)) {
  print_r($matches[0]); // Array containing "Mop" and "top"
}

Sets are always written with a pair of [] characters. The list of characters you want to search for must be written inside them like the example above.

Think of a set in a pattern like a question with multiple choices. Even though there are several options in the question, you can only pick one answer. In the same way, a set with multiple characters will match only one character in the text you're looking at.

Take the following:

$string = "Voila";
 
if (preg_match("/V[oi]la/", $string, $matches)) {
  print_r($matches[0]); // This part won't be reached in this example
} else {
  echo "no matches"; // no matches
}
  1. The pattern /V[oi]la/ is looking for the letter "V", followed by either "o" or "i", and then "la".
  2. Since the string "Voila" doesn't match this pattern (as it requires an "o" or an "i" after "V" but not both), the result is "no matches".

Valid values would be Vola or Vila.

Ranges

Inside the square brackets, we've written a list of characters. But what if we want to select every letter in the alphabet?

/[abcdefghijklmnopqrstuvwxyz]/

That looks pretty ridiculous, doesn't it? Luckily, there's a shorter and easier to write a solution. We can use ranges.

A range in a set for regular expressions is a way to specify a sequence of characters that you want to match without having to list them all individually. It's like a shortcut that helps make your pattern more concise.

A range is defined using a hyphen - between two characters inside a set (which is enclosed in a set). The range will match any character that falls between the two characters in the sequence defined by your system's character encoding (usually ASCII).

Some examples would be:

  • [a-z]: This will match any lowercase letter from a to z.
  • [A-Z]: This will match any uppercase letter from A to Z.
  • [0-9]: This will match any digit from 0 to 9.
  • [a-zA-Z]: This will match any letter, whether uppercase or lowercase.

A range allows you to express a whole series of characters easily, making your patterns more straightforward and easier to read. If you wanted to match any lowercase letter, writing [abcdefghijklmnopqrstuvwxyz] would be quite cumbersome, so [a-z] offers a neater alternative.

In the example below, we're searching for "x" followed by two characters: either digit from 0 to 9 or letters from A to F:

$string = "Exception 0xAF";
 
if (preg_match_all("/x[0-9A-F][0-9A-F]/", $string, $matches)) {
  print_r($matches[0]); // Array containing "xAF"
}

Here [0-9A-F] has two ranges: it searches for a character that is either a digit from 0 to 9 or a letter from A to F.

Character Classes in Ranges

Character classes can also be written inside sets. For example, if we want to find a letter, number, or underscore (which is what we call a "wordly" character, represented by \w) or a dash -, we can write the set as [\w-].

We can even mix different character classes together. So, if we want to find either a space (which is what \s represents) or a number (represented by \d), we can write the set as [\s\d]. It's like saying, "Find me either a space or a number."

Did you know?

Some character classes are shorthand for character sets. For example:

  • \d is the same as [0-9].
  • \w is the same as [a-zA-Z0-9_].
  • \s is the same as [\t\n\v\f\r ].

Excluding Ranges

By default, PHP searches for characters listed in a set. However, we can instruct it to exclude the characters in the range by adding the ^ character at the beginning: [^a-z].

Take the following examples:

  • [^aeyo] will match any character except 'a', 'e', 'y', or 'o'.
  • [^0-9] will match any character except a digit, the same as using \D.
  • [^\\s] will match any non-space character, the same as using \\S.

In the PHP code example below, we're looking for any characters except letters, digits, and spaces:

$string = "john1234@example.com";
 
if (preg_match_all("/[^\\d\\sA-Z]/i", $string, $matches)) {
  print_r($matches[0]); // Array containing "@" and "."
}

The result includes "@" and ".", as these are the characters in the string that are not letters, digits, or spaces.

Escaping in Sets

Inside a character set (the square brackets []), the usual rules for escaping characters are relaxed, and many characters that are special outside of a set lose their special meaning inside a set.

Here's a comparison:

  • Outside a Set: Characters like ., ?, *, +, (), {}, |, and others have special meanings in regular expressions and need to be escaped if you want to match them literally.
  • Inside a Set: Most of these characters lose their special meaning and do not need to be escaped. For example, if you want to match a literal period or plus sign inside a set, you can simply write [.+?] without needing backslashes.

The exceptions are the backslash itself (\), the caret (^) if it's at the beginning of the set, the hyphen (-) if it's used in a way where it could be interpreted as indicating a range, and the closing square bracket (]).

So inside a set, you generally only need to worry about escaping those specific characters, and the rest can be included literally without the need for escaping.

In the example below, the regular expression [-().^+] looks for one of the characters -().^+:

// No need to escape
$pattern = "/[-().^+]/";
$string = "1 + 2 - 3";
 
if (preg_match_all($pattern, $string, $matches)) {
    print_r($matches[0]); // Matches +, -
}

But if you decide to escape them "just in case", then there would be no harm:

// Escaped everything
$pattern = "/[\\-\\(\\)\\.\\^\\+]/";
$string = "1 + 2 - 3";
 
if (preg_match_all($pattern, $string, $matches)) {
    print_r($matches[0]); // Also works: +, -
}

Note that in PHP, within double-quoted strings, the backslash itself must be escaped, hence the double backslashes. In these examples, escaping the characters inside the set isn't necessary, but it doesn't cause any problems if you choose to do so.

Key Takeaways

  • A set is a group of characters enclosed in square brackets [], matching any single character from the set.
  • Characters within a set are matched literally, without special meaning, except for ^, -, ], and \.
  • By placing a caret ^ immediately after the opening bracket, you create a negated set that matches any character except those in the set, e.g., [^aeiou].
  • A hyphen - between two characters creates a range, matching any character between those two characters, inclusive, e.g., [a-z].
  • Special characters such as ^, -, ], and \ may need to be escaped with a backslash within a set, depending on their placement.
  • Predefined character classes like \d, \w, \s, etc., can be used inside a set.
  • Characters that usually need escaping outside of a set (such as ., ?, *, +, (), {}, |) lose their special meaning inside a set and don't need to be escaped.

Comments

Please read this before commenting