In a previous tutorial we learned how to represent the characters from the Unicode character set, and how to combine them into regular expressions using three basic operators. We also described how to use regexes to match either a text sample, or a sequence of bytes. Now it’s time to see how we can break a regex into groups, and how to capture the substring they match.

Once we’ve defined a regex group, we can refer to the substring it captures in other parts of the regex by using indexed and named references. Once you understand the lessons taught in this tutorial, you can use the syntax diagrams we provide as a quick reference for the topics we describe.


Capturing a Match with Groups

In a previous tutorial we learned the basics of Python regular expressions, and how to match a whole regex to a text sample or to a byte sequence. But Python allows us to break a regex into smaller pieces, called groups, and to capture the part of the sample they represent (which is called a submatch or substring). In this section we will describe how to define a group, and how to retrieve its substring with the match.group() method, referring to it either by index or name.

Indexed Groups

We can create a Python regex group by enclosing part of a regex between parenthesis; e.g. (\d{4})-(\d{2})-(\d{2}) is a regex for a YYYY-MM-DD date with three groups, namely, (from left to right) (\d{4}), which has index 1 and which captures the year, (\d{2}), which has index 2 and which captures the month, and another (\d{2}), which has index 3 and which captures the day. After matching a text to a regex using one of the searching functions, we can retrieve all substrings of the groups within that regex by using the groups() method of the match object. For example:

import re

def test_split_color():
    """ Splitting an HTML color. """
    color = '#004488'

    for submatch in re.match(r'''
        \s*(\#
        ([\da-fA-F]{2})    # red sample
        ([\da-fA-F]{2})    # green sample
        ([\da-fA-F]{2})    # blue sample
        )\s*
        ''', color, re.VERBOSE).groups():
        print(submatch)

test_split_color()

In this example we represent an HTML color as a sequence of 6 hexadecimal digits, prefixed by a #; e.g. #006400 (DarkGreen), #FF00FF (Fuchsia), and #FFD700 (Gold) are all valid HTML colors. Our regex uses 4 groups:

  • The outermost group includes the whole HTML color, skipping leading and trailing whitespaces. Notice that we prefixed the # symbol with a backslash, otherwise it would have been interpreted as the start of a comment in a verbose regex. This group has index 1.
  • Inside the main group, there are three other groups, each one enclosing a [\da-fA-F]{2} regex, which represents two uppercase or lowercase hexadecimal digits in a row. These groups have the following indices (from left to right): 2, 3, and 4. They capture, respectively, the red, green, and blue components of the HTML color.

The groups() method returns all groups’ substrings as a tuple, sorted in ascending order according to their indices. So, in this tuple the whole HTML color comes first, then the red component, the green component, and, finally, the blue component, as we can see from the output of the test_split_color() function:

#004488
00
44
88
You can learn Python in half the time

I see people struggling with Python every day and I want to help. That's why I developed this systematic approach to learning Python - FAST. This powerful training program exposes you to the Python programming language in a natural way so learning is easy.

Of course I want to get free Python tips

We can also retrieve groups’ submatches one by one, by passing their indices to the group() method of the match object. Once again we point out that groups are numbered starting from 1, so that the outermost group comes first, and that the groups at the same level are indexed from left to right. An example should make it clear:

import re

def test_split_filename():
    """ Splitting a file name using groups. """
    filename = '2019-05-22-regex-tutorial-draft.md'
    
    match = re.match(r'''
        \s*(
        ((\d{4})-(\d{2})-(\d{2}))   # date
        -
        ([\w-]+)                    # title
        (\.[a-zA-Z]+)               # extension
        )\s*
        ''', filename, re.VERBOSE)
    
    for i in range(1, 8):
        print('%d: %s' % (i, match.group(i)))
    
test_split_filename()

Here we split a filename into its parts. This regex has 7 groups, and it is more nested than the previous one, so we’ll break it down into its groups (groups are listed in ascending order, according to their indices):

  1. The outermost group includes the whole filename, except for leading and trailing whitespaces, which are left out of the parentheses.
  2. Inside that group there are 3 other groups at the same level, so we’ll read them left to right. The leftmost group is ((\d{4})-(\d{2})-(\d{2})), which represents a YYYY-MM-DD date, as we have already seen above. This group will capture the date, i.e. 2019-05-22.
  3. The date group has 3 nested groups, all at the same level. So the next group will be the leftmost, i.e. (\d{4}), which represents the date’s year, i.e. 2019.
  4. The next group is (\d{2}), representing the date’s month, in our case 05 (May).
  5. The last subgroup inside the date group is (\d{2}), which represents the date’s day, i.e. 22.
  6. Since there isn’t any group left inside the date group, we go back to the upper level, and we choose the ([\w-]+) group, which represents a sequence of alphanumeric characters, including -, and _. It captures the title of the document, namely, regex-tutorial-draft.
  7. Finally, the last group is (.[a-zA-Z]+), which catches the file’s extension, i.e. .md.

The following diagram summarizes everything that we have just said about our example. The scope of each group is delimited by horizontal braces. The parts of the regex with a yellow background don’t belong to any group.

Regex for the File Name

The for loop will print all group’s substrings, in the exact order described just now:

1: 2019-05-22-regex-tutorial-draft.md
2: 2019-05-22
3: 2019
4: 05
5: 22
6: regex-tutorial-draft
7: .md

Notice that neither the leading and trailing spaces, nor the - between the date and the title of the filename, are included in the output, since they don’t belong to any group.


Named Groups

Referring to a group by its index has two major shortcomings. First, it’s error-prone, since group nesting makes it easy to mistake one index for another. Second, it makes the code handling regexes harder to refactor; e.g. if we swap two groups, or if we add a group before other groups, we may have to change any code relying on the value of match.group() or match.groups(), since the previous indices now refer to different groups. Python comes to our rescue, since it allows us to refer to a group by its name, using the syntax (?P<name>regex), where name is the name of the group, and regex is the regex associated with that name. Let’s start off with a simple example:

import re

def test_split_date_with_groups():
    """ Splitting a YYYY-MM-DD date using named groups. """
    sample = """
        Artist,Album,ReleaseDate
        Pink Floyd,The Wall,1979-11-30
        Spandau Ballet,True,1983-02-28
        Queen,Jazz,1978-11-10
        Johnny Cash,At Folsom Prison,1968-01-13
        Toto,Toto,1978-10-10
        America,Homecoming,1972-11-15
        Bryan Ferry,Boys and Girls,1985-06-03
        Leo Sayer,Living in a Fantasy,1980-08-22
        """  

    regex = re.compile(r'''
        (?P<date>            # group matching the whole date
        (?P<year>\d{4})-     # YYYY year
        (?P<month>\d{2})-    # MM month
        (?P<day>\d{2})       # DD day
        )''', re.VERBOSE)

    print('{:14}{:7}{:7}{:7}'.format(
        'date', 'year', 'month', 'day'))
    for date in regex.finditer(sample):
        print('{date:14}{year:7}{month:7}{day:7}'.format(
            **date.groupdict()))

test_split_date_with_groups()

Here we want to retrieve YYYY-MM-DD dates from the CSV data in sample. Our regex has an outer group, named date, which matches the whole date, and three nested groups, all at the same level, which match the date’s components (i.e. year, month, and day). The groupdict() method of the match object returns a dictionary, whose keys are the groups’ names, and whose values are the strings matched by the respective groups. It is similar to the groups() method, but names are easier to remember than indices. We use groupdict() to print all string matches as a fixed-width table:

date          year   month  day
1979-11-30    1979   11     30
1983-02-28    1983   02     28
1978-11-10    1978   11     10
1968-01-13    1968   01     13
1978-10-10    1978   10     10
1972-11-15    1972   11     15
1985-06-03    1985   06     03
1980-08-22    1980   08     22

Once you have defined a name for a group, you can switch between names and indices using the groupindex attribute of a compiled regex, which maps each group’s name into the respective index. Anonymous groups can only be referred to by their indices, so they won’t appear in that map. Let’s see an example involving both indexed and named groups:

import re

def test_currency():
    """ Matching U.S. Currency Values. """
    samples = (
        '$15', '$2.56', '$12.23',
        '$1,000.00', '$11,231.00', 
        '$24,677,333.14' )
    
    regex = re.compile(r'''
        \$                   # dollar sign
        (?P<left>            # start of the left group
        \d{1,3}              # non-grouped digits
        (,\d{3})*)           # grouped digits
        (?P<right>\.\d{2})?  # fractional part (optional)
        ''', re.VERBOSE)
    
    print("Left group's index is  %d" % regex.groupindex['left'])
    print("Right group's index is %d" % regex.groupindex['right'])
    for sample in samples:
        m = regex.match(sample)
        print('%-16s%-16s%s' % (
            # equals to m.group(1)
            m.group('left'),
            # equals to m.group(3) 
            m.group('right') if m.group('right') else 'N/A',
            m.group(2) if m.group(2) else 'N/A'))
    
test_currency()

In this case we want to represent the U.S. currency format, which has a leading $, followed by one or more decimal digits, and an optional fractional part. The integer and the fractional parts are separated by . (dot). If the integer part has more than three digits, then digits are grouped three by three, separated by a comma; e.g. $12 (only integer part), $2.56 (both integer and fractional part, separated by .), and $1,000.12 (digits grouped 3 by 3) are all valid currency values. We can readily confirm that all strings in samples are valid, too. Now, let’s see how regex has been divided into groups:

  • The left named group catches the whole integer part of the currency value. It has index 1, as we’ll see from the output of regex.groupindex['left'].
  • The left group contains another group, (,\d{3}), which matches a sequence of 3 decimal digits, prefixed by a comma. Since this group has no name, we can only refer to it by the index 2.
  • The right named group captures the fractional part, if any. The .\d{2} regex matches a dot, followed by exactly 2 decimal digits. It has index 3, as we’ll see from the output of regex.groupindex['right'].

After compiling the regex, we use regex.groupindex to print the indices of both the left and right groups, which are, respectively, 1 and 3. Then, we iterate over the values from samples, and match them to regex. For each match we print the integer part, the fractional part, and the substring matched by the second group. On the other hand, we will print N/A for those groups which have no match. Notice we used m.group('left') to retrieve the string captured by the left group (we could have used m.group(1) instead), while we had to use m.group(2) for the second group since it has no name. Now, let’s see the output of test_currency():

Left group's index is  1
Right group's index is 3
15              N/A             N/A
2               .56             N/A
12              .23             N/A
1,000           .00             ,000
11,231          .00             ,231
24,677,333      .14             ,333

There is still one thing worth noting. The output of the second group for 24,677,333.14 is ,333. As a matter of fact, while performing the search on that value, the second group matches twice, first with the ,677 substring, then with ,333, which overwrites the previous substring. Since this group only holds temporary values, it makes little sense to give it a name. Moreover, we could dispose entirely of its substring. We’ll learn how to do that soon.

You can learn Python in half the time

I see people struggling with Python every day and I want to help. That's why I developed this systematic approach to learning Python - FAST. This powerful training program exposes you to the Python programming language in a natural way so learning is easy.

Of course I want to get free Python tips

Matching an E-Mail Address

Now it’s time for a more challenging example. Let’s suppose we want to split an e-mail address, complying with a subset of the RFC 5322:2008 specification.While defining our syntax rules, we retained the same names from the specification, just in case you want to enhance the regex that we’ll supply, and make it fully-compliant with that normative document.

You can skip this syntax diagram and all the technical jargon in the bulleted list, if you’d like. It’s just a fancy way of telling you all the ways an email address can be constructed so we can be sure we’re creating a rigorous regex for capturing the components of an email address.

Syntax of an e-mail Address

Let’s take a closer look at the syntax diagrams:

  • The addr-spec rule is the start symbol of the syntax, i.e. we must read that rule first. An e-mail address has two parts, local-part and domain, separated by the @ symbol.
  • The domain rule represents the e-mail’s domain, either by name (like in wellsr.com) or as a domain literal.
  • The domain-literal rule allows you to insert an IPv4 address in place of the domain’s name, e.g 255.0.0.0.
  • The d-text rule lists all valid characters for a domain literal. We can use any ASCII character from 0x21 (exclamation mark) to 0x5A (uppercase Z), and from 0x5E (^) to 0x60 (backtick). The symbol in the syntax diagram is a placeholder for all missing characters between its left and its right neighbors, e.g. A … Z stands for all ASCII characters between A and Z (both included).
  • The local-part rule can represent a username, which can include sequences of lowercase letters and decimal digits, separated by dots; e.g. john.doe and jane.doe.smith are both valid usernames.
  • The a-text rule lists all valid characters in local-part. The TCK, SQT, and SP symbols stand for, respectively, ` (backtick), ' (single quote), and a space character. We used them only to improve readability of this syntax diagram.

Without further ado, let’s see how to represent these syntax diagrams using Python regexes:

import re

def test_email_address():
    """ Splitting an e-mail address. """
    samples = (
        'john.s.smith@nomail.mmm',
        'Title.Case@useless.uuu',
        'User@[192.168.0.1]'       # using domain literal
    )

    regex = re.compile(
        r'''(?P<AddrSpec>             # the whole address
        (?P<LocalPart>                # the local part
        [\daA-Za-z!#$%&'*+-/=?^ `{}|~]+
        (?:\.[\daA-Za-z!#$%&'*+-/=?^ `{}|~]+)*)
        @
        (?P<Domain>                   # domain
        (?P<DotAtom>
        [\daA-Za-z!#$%&'*+-/=?^ `{}|~]+
        (?:\.[\daA-Za-z!#$%&'*+-/=?^ `{}|~]+)*)|
        \[(?P<DomainLiteral>
        [\x21-\x5a\x5e-\x60]+)\])     # domain-literal
        )''', re.VERBOSE)

    for sample in samples:
        match = regex.match(sample)
        if match:
            print('{:30}{:16}{}'.format(
                match.group('AddrSpec'),
                match.group('LocalPart'),
                match.group('Domain')))
        else:
            print('No match for: %r' % sample)

test_email_address()

The regex for an e-mail address looks a little bit intimidating, but it is simply a translation of the previous syntax diagrams into the language of Python regexes. In order to ease the translation process, the groups’ names in regex match the respective rules of the syntax diagrams, except that they are in camel case. Since the “-” character is not allowed in a group’s name, the addr-spec rule from the syntax diagrams maps to the AddrSpec group in the Python regex. We won’t go over the meaning of each part of the regex again, since we’ve already explained the syntax diagrams. I hope you’ll try to figure it out by yourself as an exercise - that’s how we all become better programmers. Here, we’ll focus on how the regex has been split into groups:

  • The AddrSpec group captures the whole e-mail address.
  • The LocalPart group matches the username.
  • The Domain group catches the e-mail’s domain. It has two subgroups, DotAtom, and DomainLiteral. We included them just to make the translation from syntax diagrams easier, but we won’t use them.

All e-mail addresses in samples are syntactically valid, though they don’t actually exist. The for loop iterates over the samples, and, if successful, it prints the whole e-mail address, the local part, and the domain. Otherwise, it prints an error message. Let’s see if we got the regex right:

john.s.smith@nomail.mmm      john.s.smith     nomail.mmm 
Title.Case@useless.uuu       Title.Case       useless.uuu 
User@[192.168.0.1]           User             192.168.0.1 

As you can see, all addresses have been matched successfully, and they have been split into their username and domain components. Notice that the last example uses an IPv4 address, instead of the domain’s name. With properly structured regular expressions, you’re able to capture nonstandard strings in an email address like this.

You can learn Python in half the time

I see people struggling with Python every day and I want to help. That's why I developed this systematic approach to learning Python - FAST. This powerful training program exposes you to the Python programming language in a natural way so learning is easy.

Of course I want to get free Python tips

Parenthesis

Sometimes we need parentheses not to capture a substring, but simply as a means of delimiting the scope of an operator. For example, the \s*((a|b)?c)\s* regex uses the outer group to match any of the words in the set { ‘ac’, ‘bc’, ‘c’ }, and the (a|b) group to just delimit the scope of the ? operator. We could easily distinguish between these two kind of uses, for example by using named groups when we actually need the substring, while leaving all other groups anonymous. In other words, we could rewrite the previous example as \s*(?P<g>(a|b)?c)\s*, so that the group g captures the whole string, while second group catches only the optional prefix. But Python offers a better option: with the (?:regex) syntax we notify Python not to capture the substring for regex. Let’s see a trivial example involving floating-point literals:

import re

def test_float_literal():
    """ Literal for floating-point numbers. """
    
    samples = ( '-12', ' +2.54 ', '34  ', ' 3.14' )
    
    regex   = re.compile(r'\s*([+-]?\d+(?:.\d+)?)\s*')
    regex_g = re.compile(r'\s*([+-]?\d+(.\d+)?)\s*')
    for n in samples:
        print('\t'.join(
            map(str, regex.match(n).groups())))
        print('\t'.join(
            map(str, regex_g.match(n).groups())))
    
test_float_literal()

A float literal has an optional sign, an integer part, and an optional fractional part. There can be leading and trailing whitespaces; e.g. 33 (without sign or fractional part), +12 (with sign, but without fractional part), and -2.56 (which has both a sign and a fractional part) are all valid float literals. In order to match float literals, we will use two different (but equivalent) regexes:

  • The regex_g regex captures both the entire float (using the ([+-]?\d+(.\d+)?) group, which has index 1) and the fractional part (using the (.\d+) group, which has index 2)
  • On the other hand, regex captures only the entire float, since it replaces the second group with (?:.\d+), which prevents the regex engine from saving the intermediate substring.

The loop iterates over each number in sample, and tries to match it to both regex and regex_g, retrieving all substrings with the groups() method:

-12
-12      None
+2.54
+2.54    .54
34
34       None
3.14
3.14     .14

As we can see from the output, the first regex has just one submatch, while the second has two matches, respectively, the whole float and its fractional part. Sometimes the decimal part is missing, so the submatch returns None. In practice, you’ll probably just use the regex example, since we’re rarely concerned with the decimal part of the float. Even so, it’s a good demonstration of what you can do with parenthesis grouping.


Referencing Groups Inside a Regex

In the previous sections we learned how to capture a submatch using both indexed and named groups. Once captured, we can use that substring in one of the following ways:

  • to match the exact same substring elsewhere in the regex;
  • as a test for conditional regexes, as we will see in another tutorial;
  • to replace it with another string; we will deal with string replacement in another tutorial.

In the following subsections we will learn how to refer to a previously captured group either by index or by name.

Referencing a Group by Index

You can refer to a group by prefixing its index with a backslash; e.g. \1 refers to the match of the first group.

Let’s see an example right away. Suppose you have to handle some kind of tabular data. Simply said, your data is organized into rows with each row having the same number of fields. You know each column is supposed to hold a certain data type, but the field separator can change from row to row, as long as there is only one separator per row; e.g. one row can use comma as a separator, while the next row can use semi-colon.

import re
import io

def test_tabular_data():
    """ Splitting tabular data. """
    sample = io.StringIO('''
        Title,Duration,Date,Credits
        A Horse With No Name,4:10,1972-01-12,Bunnell
        Ventura Highway;3:32;1972-09-19;Bunnell
        Only In Your Heart|3:16|1973-04-14|Beckley
        Rainbow Song,4:00|1973-11-28,Bunnell
        She's Gonna Let You Down;3:40;1974-02-20,Beckley
        ''')
        
    regex = re.compile(r'''
        \s*(?P<title>[^,;|]+)       # title
        ([,;|])\s*                  # first separator
        (?P<dur>\d:\d\d)            # duration
        \s*\2\s*                    # second separator
        (?P<date>\d{4}-\d{2}-\d{2}) # date
        \s*\2\s*                    # third separator
        (?P<credits>[A-Za-z]+)      # song's credits
        ''', re.VERBOSE)
        
    print('{:24}{:12}{:12}{}'.format(
        'title', 'duration', 'date', 'credits'))
    for line in sample:
        m = regex.search(line)
        if m:
            print('{title:24}{dur:12}{date:12}{credits}'.format(
                **m.groupdict()))
    
test_tabular_data()

In this Python 3 example we use an io.StringIO object to hold our data. This class provides Python strings with a file-like interface, so that we can iterate over the lines of our sample using a for loop. Our data is split into four columns:

  • The first column holds the title of a single by the Anglo-American band America. It can contain any character, other than the three separators { ‘,’, ‘;’, ‘|’ }. It is captured by the title group.
  • The second column holds the song’s duration, in the m:ss format (minutes and seconds, separated by :). It is captured by the dur group.
  • The third column holds the single’s release date, in the YYYY-MM-DD format. It is captured by the date group.
  • The fourth column holds the name of the songwriter, as a string of capital or small letters. It is captured by the credits group.

The field separator can be ,, ;, or |. The ([,;|]) regex (right after the title group) captures the separator between the first and the second field. Then, the two occurrences of \s*\2\s* refer to the first separator (which has index 2) to define, respectively, the separator between the second and the third field, and between the third and fourth field. This way we make sure there is only one kind of separator per row. Finally, the for loop iterates over each line in the sample, and it prints the contents of each field:

title                   duration    date        credits
A Horse With No Name    4:10        1972-01-12  Bunnell
Ventura Highway         3:32        1972-09-19  Bunnell
Only In Your Heart      3:16        1973-04-14  Beckley
You can learn Python in half the time

I see people struggling with Python every day and I want to help. That's why I developed this systematic approach to learning Python - FAST. This powerful training program exposes you to the Python programming language in a natural way so learning is easy.

Of course I want to get free Python tips

Notice the last two records from the sample are missing. As a matter of fact, the fourth record uses , as first and last separator, but it uses ; to separate the second field from the third, so not all separators are equal. Similarly, in the fifth record the semi-colon separates both the first field from the second, and the second from the third, but the last separator is comma instead. Though it has just one separator (comma), the header (i.e. Title,Duration,Date,Credits) is also missing, because its second and third fields are invalid according to the rules stated previously.

When you write a regex containing a reference to a group, beware of optional groups, since you can get unexpected results.

import re

def test_optional_group_ref():
    """ Reference to an optional group. """
    samples = ( '\\012\\', '12', '|056|', '56' )
    
    regex = re.compile(r'([\\|])?(\d+)\1')
    
    for sample in samples:
        match = regex.match(sample)
        print(match.group(2) if match \
            else 'No match for: %s' % sample)
    
test_optional_group_ref()

In this case we want to write a regular expression for a decimal integer number, optionally enclosed between \ (like Prolog octal escape sequences) or between | (like the mathematical absolute value); e.g. 12, \12\, and |12| are all proper values. By relying on what we know about group references, we might be tempted to translate the rules above into the ([\\|])?(\d+)\1 regex, where:

  • the ([\\|])? group is in charge of capturing the prefix (if any), which must be either \ or |;
  • the (\d+) group captures the decimal number;
  • the \1 reference represents the (optional) trailing part, which must equal to the (optional) leading part. We may think that if the leading part is missing, then ([\\|])? matches the empty string, so the trailing part will also be optional.

The loop iterates over the samples, and it tries to match them using regex. If successful, then it prints the decimal number, otherwise it prints an error message. Let’s see if we were right about the group reference:

012
No match for: 12
056
No match for: 56

As we can see from the output, the two integers having both the prefix and suffix parts match successfully, while 12 and 56 fail. The problem is that, when the suffix is omitted, ([\\|])? evaluates to None instead of the empty string, which is fine, since the prefix is optional. But when we try to use that group for the trailing part, the match fails, since the \1 expression is not optional (i.e. is not (\1)?). We will come back to the problem matching leading and trailing sequences in our tutorial on conditional regexes.

Referencing a Group by Name

We can also refer to a group by name, using the slightly more complex syntax (?P=name), where name is the name of a group defined elsewhere in the regex. Let’s try to use named groups to match the opening and closing tags of some HTML elements:

import re

def test_match_tags():
    """ Matching HTML tags. """
    sample = r'''
        <b>Bold</b><i>Italics</i>
        <mod>Mod</mod><em>Emphasis</em>
        <h2>Level 2 Header</h2>
        <code>Bad code match</cod>
        <h1>Bad header match</h3>
        '''
    
    regex_noref = re.compile(r'''
        <(?P<otag>\w+)>        # opening tag
        (?P<text>[^<]*)        # contents
        </(?P<ctag>\w+)>       # closing tag
        ''', re.VERBOSE)
    
    print('Without references:\notag    ctag    text')
    for match in regex_noref.finditer(sample):
        print('{otag:8}{ctag:8}{text:12}'.format(
            **match.groupdict()))
        
    regex = re.compile(r'''
        <(?P<otag>\w+)>        # opening tag
        (?P<text>[^<]*)        # contents
        </(?P<ctag>(?P=otag))> # closing tag
        ''', re.VERBOSE)
        
    print('\nUsing references:\notag    ctag    text')
    for match in regex.finditer(sample):
        print('{otag:8}{ctag:8}{text:12}'.format(
            **match.groupdict()))
    
test_match_tags()

We will try to search HTML elements inside sample by using two different regexes, such that:

  • Both regexes use the otag group to catch the opening tag, the ctag group to capture the closing tag, and the text group to capture the contents of the element.
  • Both regexes use the \w+ regex to represent the opening tag; e.g. b, h2, and code are all valid opening tags.
  • Both regexes use the [^<]* regex to represent the contents of the HTML element, i.e. a string containing any character other than <.
  • The regex_noref expression use the same regex for both the opening and the closing tag.
  • The regex expression matches the closing tag by using a reference to the opening tag, namely, (?P=otag).

Finally, regex.finditer(sample) iterates over the matches in sample, and for each one prints the opening tag, the closing tag, and the element’s contents. Let’s see what happens when we run the test_match_tags() function:

Without references:
otag    ctag    text
b       b       Bold        
i       i       Italics     
mod     mod     Mod         
em      em      Emphasis    
h2      h2      Level 2 Header
code    cod     Bad code match
h1      h3      Bad header match

Using references:
otag    ctag    text
b       b       Bold        
i       i       Italics     
mod     mod     Mod         
em      em      Emphasis    
h2      h2      Level 2 Header

The regex_noref regex matches all HTML elements in sample. On the other hand, regex skips the last two elements, since <code>Bad code match</cod> lacks an e in the closing tag, and <h1>Bad header match</h3> has two different header levels (1 in the opening tag, and 3 in the closing tag).


Introduction to Regex Conditionals

A substring we want to match can sometimes be enclosed between an optional prefix and an optional suffix. For example, both (x+(y * 2)) and x+(y * 2) really mean the same thing, but the second string lacks the outer parentheses. In these cases, conditional regexes come in handy. A conditional regex is a regex in the form (?(test)if-true|if-false), where:

  • The test parameter is either a group’s index or name.
  • The if-true regex will be executed only if test has a match, i.e. regex.group(test) is not None.
  • The if-false regex will be executed only if test has no match. This clause is optional, but if you omit it, you should also omit the | between the two clauses.

This kind of regex resembles Python conditional expressions (e.g. 1 if test else 0), except the second clause is optional. Beware the group’s index or name you use in a test must be defined elsewhere in the regex on the left side of the conditional. Otherwise Python will raise an exception while compiling the regex; e.g. (\()?([^)]*)(?(3)\() will fail to compile since the conditional refers to group 3, but there are only two groups in this regex, namely, (\() and ([^)]*). Presenting conditional regex with Python can get pretty complex, so we’re going to dedicate an entire tutorial to them soon. Subscribe below to make sure you don’t miss it.

You can learn Python in half the time

I see people struggling with Python every day and I want to help. That's why I developed this systematic approach to learning Python - FAST. This powerful training program exposes you to the Python programming language in a natural way so learning is easy.

Of course I want to get free Python tips

Closing Thoughts

In this tutorial we learned how to use groups to break a regex into smaller units and to capture their substrings. Groups come in two flavors: indexed groups, which are numbered using integers (starting from 1), and named groups, for which we provide a name using the (?P<name>regex) syntax. Sometimes we just want to delimit the scope of an operator, rather than capturing a substring, so we use the (?:regex) syntax instead.

Recall, a regex stands for a set of words. When we match it to a sample, we select the only word from that set which can be found in the sample; e.g. a|b stands for the set { ‘a’, ‘b’ }, but when we call re.match('a|b', 'a'), we select the word a among the possible matches for that regex. That’s where group references come in handy. Group references allow us to refer to the submatch of a group elsewhere in the regex, even if we don’t know it before executing the regex. We can refer to a group either by index or by name.

We can use the submatch of a group to choose one of the two regexes of a conditional regex, which we’ll describe in more detail in another tutorial. For now, just know conditional regexes are much like Python conditional expressions and can be nested at will. In other words, the clauses of a conditional regex can themselves be conditional regexes.

Did you find this free tutorial helpful? Share this article with your friends, classmates, and coworkers on Facebook and Twitter! When you spread the word on social media, you’re helping us grow so we can continue to provide free tutorials like this one for years to come. Once you’re done sharing our tutorial, please subscribe to our Python training list using the form below:

Syntax Diagrams

This section summarizes what we’ve learned about grouping and conditional regexes. I encourage you to read our previous tutorial on basic regex operators to jog your memory.

The following syntax diagrams are grouped into sections. Each section gives a brief explanation of the rules, and provides links to parts of the tutorial where those rules were first defined.

Definition of Group

The term rule, which represents a part of a regular expression, extends the rule with the same name defined in the previous tutorial by adding grouping and conditional regexes. A letter is any character c from the Unicode character set for which c.isalpha() is True.

Definition of Group

Grouping by Index and by Name. Parenthesis

A group captures the submatch of the regex it encloses. Each group is assigned to an integer index, starting from 1. We can also assign a name to a group. You can map a group’s name to its index using regex.groupindex[g], where regex is the regex to which the group belongs, and g is the name of the group. If you just need to delimit the scope of an operator, you can use parentheses instead of groups. In this case, enclose the regex between (?: and ). For example, (\d+(?:.\d+)?) represents a decimal floating point number, like 12.2, 34. Here, the parentheses around .\d+ delimit the scope of the ? operator in order to make the fractional part optional. Since we are only interested in capturing the entire number, not both the number and its fractional part, we enclosed the fractional part between (?:...).

Grouping by Index and by Name. Parenthesis

Group's References

Once you have defined a group, you can refer to it inside the same regex by using its index (e.g. \1 is a reference to the group 1, defined elsewhere in the regex), or by name (e.g. (?P=g) is a reference to a group named g).

Referencing a Group

Conditional Regexes

You can use a group’s match as a test for a conditional regex in the form (?(test)if-true|if-false), where the if-true regex is executed if test has a match, otherwise the if-false regex is executed. The if-false clause is optional.

Conditional Regexes


List of All Examples in This Tutorial