Please wait

Groups

A capture group is a portion of the pattern enclosed in parentheses (). This creates a "sub-pattern" within the full pattern, and anything that matches this sub-pattern is captured as a separate element in the resulting match array.

Capture groups allow you to not only determine whether the full pattern matches the text but also to extract and work with specific portions of the matched text. They're essential in many text processing tasks, enabling more complex pattern matching, replacement, and manipulation within strings.

Examples

Basic Example

Let's say we wanted to search for the word "go" in a string. We can write the following expression: go+. This pattern looks for the letter 'g' followed by the letter 'o' repeated one or more times. So it can match things like 'goooo' or 'goooooooo'.

That's not really what we want. When you put the 'go' inside parentheses like (go)+, it changes what the pattern does. Now it's looking for the entire 'go' together, repeated one or more times. So it can match 'go', 'gogo', 'gogogo', etc.

$pattern = '/(go)+/i';
 
$text = 'Gogogo now!';
 
preg_match_all($pattern, $text, $matches);
 
print_r($matches[0]); // Outputs: Array ( [0] => Gogogo )

Searching for Domains

A common example of using groups is to search for domains in a string. For example, let's say we had the following:

site.com
my.site.com

Domains consist of repeated words separated by dots. The following regular expression can be used for finding domains: (\w+\.)+\w+.

Dashes not supported

The example given does not support domains with dashes, such as my-site.com. An alternative solution would be to use ([\w-]+\.)+\w+.

$pattern = '/(\w+\.)+\w+/';
 
$text = 'site.com my.site.com';
 
preg_match_all($pattern, $text, $matches);
 
print_r($matches[0]); // Outputs: site.com,my.site.com
 
 
 

Searching for Emails

Building on top of our previous example, we can use it to find emails with a few modifications. Emails are formatted as the following: name@domain. For the name portion of the format, any word character can be accepted. We can represent this portion of the expression as: [-.\w]+.

The regular expression given is not completely perfect, but it should work for most cases.

$pattern = '/[-.\w]+@([\w-]+\.)+[\w-]+/';
 
$text = 'site.com my.site.com';
 
preg_match_all($pattern, $text, $matches);
 
print_r($matches[0]); // Outputs: site.com,my.site.com

Extracting Group Values

When you use parentheses in a regular expression, they are numbered in order, starting from the left. This helps the search engine remember what was matched by each set of parentheses.

When you use a pattern to search for a match in a string, here's what you get:

  1. The full match at index 0.
  2. The contents of the first parentheses at index 1.
  3. The contents of the second parentheses at index 2.
  4. So on and so forth

For example, let's say we want to find HTML tags like <h1> and grab what's inside the angle brackets. We can put the inner content inside parentheses like this: <(.*?)>.

 
$text = '<h1>Hello, world!</h1>';
$pattern = '/<(.*?)>/';
preg_match($pattern, $text, $matches);
 
echo $matches[0]; // Outputs <h1>
echo $matches[1]; // Outputs h1

In this example, $matches[0] gives us the whole tag <h1>, and $matches[1] gives us just the content inside the brackets h1. By using parentheses, we can separate the parts we want to work with.

Nested Groups

Parentheses in a regular expression can be inside other parentheses. They are still numbered from left to right.

Imagine we want to search for a tag like <span class="my">, and we want to break it down into different parts:

  1. The whole content inside the brackets: span class="my".
  2. The name of the tag: span.
  3. Any attributes inside the tag: class="my".

We can do this by adding parentheses around each part we're interested in: <(([a-z]+)\s*([^>]*))>. Here's a visual example:

Nested Groups
Nested Groups
$text = '<span class="my">';
$pattern = '/<(([a-z]+)\s*([^>]*))>/';
preg_match($pattern, $text, $matches);
 
echo $matches[0]; // Outputs <span class="my">
echo $matches[1]; // Outputs span class="my"
echo $matches[2]; // Outputs span
echo $matches[3]; // Outputs class="my"

Here's how the parentheses are numbered:

  • $matches[0] always contains the full match.
  • $matches[1] contains the whole content inside the tag (from the first set of parentheses).
  • $matches[2] contains the tag name (from the second set of parentheses).
  • $matches[3] contains the tag's attributes (from the third set of parentheses).

So the numbering of the parentheses lets us pick out the different parts of the tag, just like breaking a puzzle into pieces to see each part!

Optional Groups

Optional groups in regular expressions allow part of the pattern to be matched zero or one times. This can be useful when you want to match something that might or might not be present in the text.

You can make a group optional by following it with a question mark ?.

Suppose we want to match a phone number that may or may not have an area code. The pattern could look like this:

  • With area code: (123) 456-7890
  • Without area code: 456-7890

We can write a regular expression where the area code is an optional group:

(\(\d{3}\)\s?)?(\d{3}-\d{4})
  • (\(\d{3}\)\s?)? is an optional group (? makes it optional) that matches three digits enclosed in parentheses followed by an optional space.
  • (\d{3}-\d{4}) matches three digits, a dash, and four more digits.

Here's how you might use this pattern in PHP:

$pattern = '/(\(\d{3}\)\s?)?(\d{3}-\d{4})/';
$text1 = '(123) 456-7890';
$text2 = '456-7890';
 
preg_match($pattern, $text1, $matches1);
preg_match($pattern, $text2, $matches2);
 
print_r($matches1); // Matches the number with the area code
print_r($matches2); // Matches the number without the area code

This pattern can match both phone numbers with and without area codes by making the area code part of the pattern optional.

Named Groups

Named groups in regular expressions allow you to assign a name to a specific capturing group, making it easier to reference later. Instead of using numerical indices to access the matched portions of the text, you can use more descriptive names. This can make your code more readable and maintainable, especially if the regular expression is complex.

Here's the general syntax for creating a named group in a regular expression pattern:

/(?P<name>pattern)/

Here, name is the name you want to assign to the group, and pattern is the specific pattern you want to match.

Suppose you want to match a date in the format "MM/DD/YYYY" and you want to capture the month, day, and year in separate groups. You could write the pattern like this:

/(?P<month>\d{2})\/(?P<day>\d{2})\/(?P<year>\d{4})/

Then, you could use this pattern in PHP as follows:

$pattern = '/(?P<month>\d{2})\/(?P<day>\d{2})\/(?P<year>\d{4})/';
$text = '07/31/2023';
 
preg_match($pattern, $text, $matches);
 
echo "Month: " . $matches['month']; // Outputs "Month: 07"
echo "Day: " . $matches['day'];   // Outputs "Day: 31"
echo "Year: " . $matches['year'];  // Outputs "Year: 2023"

By using named groups, you can access the matched parts of the date using the descriptive names 'month', 'day', and 'year', rather than numerical indices. This can make the code more self-explanatory, especially if someone else (or even future you) has to read or modify it later.

Exercises

Exercise #1

For this exercise, write a regular expression to detect if a string is a MAC address. A Mac address consists of 6 two-digit hex numbers separated by a colon.

A valid example would be: '01:32:54:67:89:AB'.

Here are some examples. Be sure to refer to the comments for the expected outcome:

$regexp = '';
 
echo preg_match($regexp, '01:32:54:67:89:AB') ? 'true' : 'false'; // true
echo preg_match($regexp, '0132546789AB') ? 'true' : 'false'; // false (no colons)
echo preg_match($regexp, '01:32:54:67:89') ? 'true' : 'false'; // false (5 numbers, must be 6)
echo preg_match($regexp, '01:32:54:67:89:ZZ') ? 'true' : 'false'; // false (ZZ at the end)

Exercise #2

Write a regular expression to check if a string is a valid CSS color in hexadecimal format. Valid CSS colors always start with # and should be exactly 3 or 6 hexadecimal digits.

$pattern = '';
 
$text = 'color: #3f3; background-color: #AA00ef; and: #abcd';
 
preg_match_all($pattern, $text, $matches);
 
print_r($matches[0]);

Exercise #3

Create a regular expression that matches all decimal numbers. This includes positive and negative integers, as well as numbers with a decimal point (floating-point numbers).

$pattern = '';
 
$text = "-1.5 0 2 -123.4.";
 
preg_match_all($pattern, $text, $matches);
 
print_r($matches[0]);

Key Takeaways

  • Capture groups are portions of a regular expression enclosed in parentheses (). They allow you to "capture" a part of the matched text.
  • Capture groups are numbered from left to right, starting from 1. Group 0 always refers to the entire matched text.
  • Parentheses can be nested, and groups will still be numbered based on the order of the opening parenthesis from left to right.
  • In some modern regex engines, you can use named capture groups with (?<name>...). This allows you to reference groups by a specific name instead of a number.
  • Capture groups can be quantified, meaning you can apply quantifiers like *, +, or ? to the entire group. E.g., (abc)+ would match abc, abcabc, etc.

Comments

Please read this before commenting