Please wait

Character Classes

Character classes are a way to tell the computer, "I want to find any character that is one of these specific options." It's like a menu where you say, "I'll have anything from this section." They're like a tool that helps you pick out specific types of characters from a text. It's a special notation that matches any symbol from a certain set.

Let's imagine we had the practical task of converting a formatted phone number like "+7(903)-123-45-67" into pure numbers: 79031234567. This is possible with a character class.

Digit Class

The digit class in regular expressions, represented as \d, matches any single digit from 0 to 9. If you want to find numbers within a text, this character class will help you do that.

Suppose you have a string with various characters, and you want to extract all the individual digits from it. Here's an example of how you can do this using PHP:

$stringWithDigits = "I have 3 cats and 5 dogs.";
$pattern = '/\d/'; // This pattern means "any digit from 0 to 9."
 
// Using preg_match_all to find all matches
preg_match_all($pattern, $stringWithDigits, $matches);
 
print_r($matches[0]); // Output: Array ( [0] => 3 [1] => 5 )

The $pattern variable holds our pattern where \d is the digit class, and it represents any digit from 0 to 9. It's enclosed within forward slashes (/) as delimiters. We're using the preg_match_all() function to find all numbers. The results are printed with the print_r() function, which, in this case, are 3 and 5.

Space Class

The space class in regular expressions is useful for matching any whitespace characters, such as spaces, tabs, and newline characters. In regular expressions, it's represented as \s.

Imagine you have a string with words separated by different types of whitespace (spaces, tabs, etc.), and you want to find all these whitespace characters.

Here's an example of how you can do this using PHP:

$stringWithSpaces = "I have\t3 cats  and\n5 dogs.";
$pattern = '/\s/'; // This pattern means "any whitespace character."
 
// Using preg_match_all to find all matches
preg_match_all($pattern, $stringWithSpaces, $matches);
 
var_dump($matches[0]);

In this example, \s is the space class, representing any whitespace character like spaces, tabs, or newlines. When we print the results with the var_dump() function, which include spaces, a tab (represented as \t), and a newline character (represented as \n).

Word Class

The word class in regular expressions is used to match any word characters, which include letters, digits, and underscores. It's represented by \w.

Suppose you have a string with different characters, and you want to find all the word characters within it.

Here's an example of how you can do this using PHP:

$stringWithWords = "I have 3_cats and 5 dogs!";
$pattern = '/\w/'; // This pattern means "any word character."
 
// Using preg_match_all to find all matches
preg_match_all($pattern, $stringWithWords, $matches);
 
print_r($matches[0]);

The \w is the word class, representing any letter (uppercase or lowercase), digit, or underscore. This prints the array of matched word characters, which include letters, numbers, and the underscore in "3_cats".

Combining Symbols and Classes

Typically, you don't classes on their own. You may want to combine them with other symbols. Combining classes and regular symbols in a regular expression is a common practice to build more specific and useful patterns. By doing this, you can match complex structures within a text. Example:

Suppose you want to extract all the hashtags from a given text. A hashtag is typically composed of a hash symbol (#) followed by a sequence of word characters (letters, digits, underscores).

$textWithHashtags = "Follow #coding and #PHP_tutorials for more!";
$pattern = '/#\w+/'; // This pattern means "a hash symbol followed by one or more word characters."
 
// Using preg_match_all to find all matches
preg_match_all($pattern, $textWithHashtags, $matches);
 
print_r($matches[0]); // Output: Array ( [0] => #coding [1] => #PHP_tutorials )

The pattern '/#\w+/' we've created combines a regular symbol (#) with a word class (\w) and a quantifier (+).

  • #: Matches the hash symbol exactly.
  • \w: Matches any word character (letters, digits, or underscores).
  • +: Means "one or more of the preceding element" (in this case, one or more word characters).

The result from this pattern should yield both #coding and #PHP_tutorials.

Inverse Classes

An inverse class in regular expressions is used to match any characters that are NOT part of a specific set. Inverse classes are often used to exclude certain characters from the match.

Here are the inverse classes for the digit, space, and word classes:

  • Inverse of Digit Class (\d): The inverse is represented by \D, and it matches any character that is NOT a digit (0-9). \d Matches any digit. \D Matches any non-digit character.
  • Inverse of Space Class (\s): The inverse is represented by \S, and it matches any character that is NOT whitespace (spaces, tabs, newline characters). \s Matches any whitespace character. \S Matches any non-whitespace character.
  • Inverse of Word Class (\w): The inverse is represented by \W, and it matches any character that is NOT a word character (letters, digits, underscores). \w Matches any word character (letters, digits, underscores). \W Matches any non-word character.

Example

Let's use the inverse class for digits to grab only the numbers from a phone number.

Suppose you have a phone number in the format "+7(903)-123-45-67" and you want to extract only the digits from it, turning it into "79031234567".

$phoneNumber = "+7(903)-123-45-67";
$pattern = '/\D/'; // This pattern means "any non-digit character."
 
// Using preg_replace to replace all non-digit characters with an empty string
$onlyDigits = preg_replace($pattern, '', $phoneNumber);
 
echo $onlyDigits; // Output: 79031234567

Here, \D is the inverse class for digits, representing any non-digit character. We're using the preg_replace() function to replace all matches of the pattern (all non-digit characters) in the given string ($phoneNumber) with an empty string ''. Essentially, it's removing all non-digit characters from the phone number. Lastly, we echoed the value.

Selecting any Character

In regular expressions, a dot (.) is a special character that matches any character except for a newline (\n). It's often used when you want to match any character in a particular position within a pattern.

$pattern = '/CS.4/'; // Matches "CS" followed by any character, followed by "4"
 
$strings = array("CSS4", "CS-4", "CS 4");
 
foreach ($strings as $string) {
  if (preg_match($pattern, $string, $match)) {
    echo $match[0] . " "; // Outputs: CSS4, CS-4, CS 4
  }
}

Be Careful of Spaces

Typically, spaces are not something we pay much heed to. To most of us, the strings "1-5" and "1 - 5" appear almost the same. However, when working with regular expressions, ignoring spaces can lead to unexpected behavior.

Consider trying to match two digits separated by a hyphen:

preg_match('/\d-\d/', "1 - 5", $match); // Returns no match!
 
print_r($match);

This pattern fails to match the string because it doesn't account for the spaces around the hyphen.

We can correct this by incorporating spaces into the regular expression pattern:

preg_match('/\d - \d/', "1 - 5", $match); // Matches "1 - 5", so it works
// Alternatively, we can use the \s class to represent spaces:
// preg_match('/\d\s-\s\d/', "1 - 5", $match); // Matches "1 - 5", so this also works
 
print_r($match);

In the realm of regular expressions, a space is not merely a void; it's a character as significant as any other. Adding or removing spaces from a pattern can change the behavior of a regular expression.

To put it simply, in regular expressions, every character, including spaces, carries weight and must be handled with care.

Key Takeaways

  • Character classes allow you to match any one character from a specified set of characters.
  • \d: Matches any digit (0-9).
  • \s: Matches any whitespace character (including spaces, tabs, line breaks).
  • \w: Matches any alphanumeric character (letters and digits) plus underscore.
  • \D: Matches anything except a digit.
  • \S: Matches anything except a whitespace.
  • \W: Matches anything except an alphanumeric character.
  • The dot . is a special character class that matches any character except a newline.
  • You can combine classes and regular symbols to create complex patterns. For example, \d\s matches a digit followed by a whitespace.
  • All characters matter in regular expressions, including spaces, and small differences can lead to different matching behavior.
  • Character classes are handy for diverse tasks, such as validating input, extracting specific parts of a string, or transforming data.

Comments

Please read this before commenting