In a previous tutorial we learned how to represent the characters from the Unicode character set, and how to combine them into regular expressions using three basic operators. We also described how to use regexes to match either a text sample, or a sequence of bytes. Now it’s time to see how we can break a regex into groups, and how to capture the substring they match.
Once we’ve defined a regex group, we can refer to the substring it captures in other parts of the regex by using indexed and named references. Once you understand the lessons taught in this tutorial, you can use the syntax diagrams we provide as a quick reference for the topics we describe.
- Capturing a Match with Groups
- Parenthesis
- Referencing Groups Inside a Regex
- Introduction to Regex Conditionals
- Closing Thoughts
- Syntax Diagrams
Capturing a Match with Groups
In a previous tutorial we learned the basics of Python regular expressions, and how to match a whole regex to a text sample or to a byte sequence. But Python allows us to break a regex into smaller pieces, called groups, and to capture the part of the sample they represent (which is called a submatch or substring). In this section we will describe how to define a group, and how to retrieve its substring with the match.group()
method, referring to it either by index or name.
Indexed Groups
We can create a Python regex group by enclosing part of a regex between parenthesis; e.g. (\d{4})-(\d{2})-(\d{2}) is a regex for a YYYY-MM-DD date with three groups, namely, (from left to right) (\d{4}), which has index 1 and which captures the year, (\d{2}), which has index 2 and which captures the month, and another (\d{2}), which has index 3 and which captures the day. After matching a text to a regex using one of the searching functions, we can retrieve all substrings of the groups within that regex by using the groups()
method of the match object. For example:
import re
def test_split_color():
""" Splitting an HTML color. """
color = '#004488'
for submatch in re.match(r'''
\s*(\#
([\da-fA-F]{2}) # red sample
([\da-fA-F]{2}) # green sample
([\da-fA-F]{2}) # blue sample
)\s*
''', color, re.VERBOSE).groups():
print(submatch)
test_split_color()
In this example we represent an HTML color as a sequence of 6 hexadecimal digits, prefixed by a #
; e.g. #006400 (DarkGreen), #FF00FF (Fuchsia), and #FFD700 (Gold) are all valid HTML colors. Our regex uses 4 groups:
- The outermost group includes the whole HTML color, skipping leading and trailing whitespaces. Notice that we prefixed the
#
symbol with a backslash, otherwise it would have been interpreted as the start of a comment in a verbose regex. This group has index 1. - Inside the main group, there are three other groups, each one enclosing a [\da-fA-F]{2} regex, which represents two uppercase or lowercase hexadecimal digits in a row. These groups have the following indices (from left to right): 2, 3, and 4. They capture, respectively, the red, green, and blue components of the HTML color.
The groups()
method returns all groups’ substrings as a tuple, sorted in ascending order according to their indices. So, in this tuple the whole HTML color comes first, then the red component, the green component, and, finally, the blue component, as we can see from the output of the test_split_color()
function:
#004488 00 44 88
Code More, Distract Less: Support Our Ad-Free Site
You might have noticed we removed ads from our site - we hope this enhances your learning experience. To help sustain this, please take a look at our Python Developer Kit and our comprehensive cheat sheets. Each purchase directly supports this site, ensuring we can continue to offer you quality, distraction-free tutorials.
We can also retrieve groups’ submatches one by one, by passing their indices to the group()
method of the match object. Once again we point out that groups are numbered starting from 1, so that the outermost group comes first, and that the groups at the same level are indexed from left to right. An example should make it clear:
import re
def test_split_filename():
""" Splitting a file name using groups. """
filename = '2019-05-22-regex-tutorial-draft.md'
match = re.match(r'''
\s*(
((\d{4})-(\d{2})-(\d{2})) # date
-
([\w-]+) # title
(\.[a-zA-Z]+) # extension
)\s*
''', filename, re.VERBOSE)
for i in range(1, 8):
print('%d: %s' % (i, match.group(i)))
test_split_filename()
Here we split a filename into its parts. This regex has 7 groups, and it is more nested than the previous one, so we’ll break it down into its groups (groups are listed in ascending order, according to their indices):
- The outermost group includes the whole filename, except for leading and trailing whitespaces, which are left out of the parentheses.
- Inside that group there are 3 other groups at the same level, so we’ll read them left to right. The leftmost group is ((\d{4})-(\d{2})-(\d{2})), which represents a YYYY-MM-DD date, as we have already seen above. This group will capture the date, i.e. 2019-05-22.
- The date group has 3 nested groups, all at the same level. So the next group will be the leftmost, i.e. (\d{4}), which represents the date’s year, i.e. 2019.
- The next group is (\d{2}), representing the date’s month, in our case 05 (May).
- The last subgroup inside the date group is (\d{2}), which represents the date’s day, i.e. 22.
- Since there isn’t any group left inside the date group, we go back to the upper level, and we choose the ([\w-]+) group, which represents a sequence of alphanumeric characters, including
-
, and_
. It captures the title of the document, namely, regex-tutorial-draft. - Finally, the last group is (.[a-zA-Z]+), which catches the file’s extension, i.e. .md.
The following diagram summarizes everything that we have just said about our example. The scope of each group is delimited by horizontal braces. The parts of the regex with a yellow background don’t belong to any group.
The for
loop will print all group’s substrings, in the exact order described just now:
1: 2019-05-22-regex-tutorial-draft.md 2: 2019-05-22 3: 2019 4: 05 5: 22 6: regex-tutorial-draft 7: .md
Notice that neither the leading and trailing spaces, nor the -
between the date and the title of the filename, are included in the output, since they don’t belong to any group.
Named Groups
Referring to a group by its index has two major shortcomings. First, it’s error-prone, since group nesting makes it easy to mistake one index for another. Second, it makes the code handling regexes harder to refactor; e.g. if we swap two groups, or if we add a group before other groups, we may have to change any code relying on the value of match.group()
or match.groups()
, since the previous indices now refer to different groups. Python comes to our rescue, since it allows us to refer to a group by its name, using the syntax (?P<name>regex), where
import re
def test_split_date_with_groups():
""" Splitting a YYYY-MM-DD date using named groups. """
sample = """
Artist,Album,ReleaseDate
Pink Floyd,The Wall,1979-11-30
Spandau Ballet,True,1983-02-28
Queen,Jazz,1978-11-10
Johnny Cash,At Folsom Prison,1968-01-13
Toto,Toto,1978-10-10
America,Homecoming,1972-11-15
Bryan Ferry,Boys and Girls,1985-06-03
Leo Sayer,Living in a Fantasy,1980-08-22
"""
regex = re.compile(r'''
(?P<date> # group matching the whole date
(?P<year>\d{4})- # YYYY year
(?P<month>\d{2})- # MM month
(?P<day>\d{2}) # DD day
)''', re.VERBOSE)
print('{:14}{:7}{:7}{:7}'.format(
'date', 'year', 'month', 'day'))
for date in regex.finditer(sample):
print('{date:14}{year:7}{month:7}{day:7}'.format(
**date.groupdict()))
test_split_date_with_groups()
Here we want to retrieve YYYY-MM-DD dates from the CSV data in groupdict()
method of the match object returns a dictionary, whose keys are the groups’ names, and whose values are the strings matched by the respective groups. It is similar to the groups()
method, but names are easier to remember than indices. We use groupdict()
to print all string matches as a fixed-width table:
date year month day 1979-11-30 1979 11 30 1983-02-28 1983 02 28 1978-11-10 1978 11 10 1968-01-13 1968 01 13 1978-10-10 1978 10 10 1972-11-15 1972 11 15 1985-06-03 1985 06 03 1980-08-22 1980 08 22
Once you have defined a name for a group, you can switch between names and indices using the
import re
def test_currency():
""" Matching U.S. Currency Values. """
samples = (
'$15', '$2.56', '$12.23',
'$1,000.00', '$11,231.00',
'$24,677,333.14' )
regex = re.compile(r'''
\$ # dollar sign
(?P<left> # start of the left group
\d{1,3} # non-grouped digits
(,\d{3})*) # grouped digits
(?P<right>\.\d{2})? # fractional part (optional)
''', re.VERBOSE)
print("Left group's index is %d" % regex.groupindex['left'])
print("Right group's index is %d" % regex.groupindex['right'])
for sample in samples:
m = regex.match(sample)
print('%-16s%-16s%s' % (
# equals to m.group(1)
m.group('left'),
# equals to m.group(3)
m.group('right') if m.group('right') else 'N/A',
m.group(2) if m.group(2) else 'N/A'))
test_currency()
In this case we want to represent the U.S. currency format, which has a leading $
, followed by one or more decimal digits, and an optional fractional part. The integer and the fractional parts are separated by .
(dot). If the integer part has more than three digits, then digits are grouped three by three, separated by a comma; e.g. $12 (only integer part), $2.56 (both integer and fractional part, separated by .
), and $1,000.12 (digits grouped 3 by 3) are all valid currency values. We can readily confirm that all strings in
- The
left named group catches the whole integer part of the currency value. It has index 1, as we’ll see from the output ofregex.groupindex['left']
. - The
left group contains another group, (,\d{3}), which matches a sequence of 3 decimal digits, prefixed by a comma. Since this group has no name, we can only refer to it by the index 2. - The
right named group captures the fractional part, if any. The .\d{2} regex matches a dot, followed by exactly 2 decimal digits. It has index 3, as we’ll see from the output ofregex.groupindex['right']
.
After compiling the regex, we use regex.groupindex
to print the indices of both the m.group('left')
to retrieve the string captured by the m.group(1)
instead), while we had to use m.group(2)
for the second group since it has no name. Now, let’s see the output of test_currency()
:
Left group's index is 1 Right group's index is 3 15 N/A N/A 2 .56 N/A 12 .23 N/A 1,000 .00 ,000 11,231 .00 ,231 24,677,333 .14 ,333
There is still one thing worth noting. The output of the second group for 24,677,333.14 is ,333. As a matter of fact, while performing the search on that value, the second group matches twice, first with the ,677 substring, then with ,333, which overwrites the previous substring. Since this group only holds temporary values, it makes little sense to give it a name. Moreover, we could dispose entirely of its substring. We’ll learn how to do that soon.
Code More, Distract Less: Support Our Ad-Free Site
You might have noticed we removed ads from our site - we hope this enhances your learning experience. To help sustain this, please take a look at our Python Developer Kit and our comprehensive cheat sheets. Each purchase directly supports this site, ensuring we can continue to offer you quality, distraction-free tutorials.
Matching an E-Mail Address
Now it’s time for a more challenging example. Let’s suppose we want to split an e-mail address, complying with a subset of the RFC 5322:2008 specification.While defining our syntax rules, we retained the same names from the specification, just in case you want to enhance the regex that we’ll supply, and make it fully-compliant with that normative document.
You can skip this syntax diagram and all the technical jargon in the bulleted list, if you’d like. It’s just a fancy way of telling you all the ways an email address can be constructed so we can be sure we’re creating a rigorous regex for capturing the components of an email address.
Let’s take a closer look at the syntax diagrams:
- The
addr-spec rule is the start symbol of the syntax, i.e. we must read that rule first. An e-mail address has two parts,local-part anddomain , separated by the@
symbol. - The
domain rule represents the e-mail’s domain, either by name (like inwellsr.com
) or as a domain literal. - The
domain-literal rule allows you to insert an IPv4 address in place of the domain’s name, e.g 255.0.0.0. - The
d-text rule lists all valid characters for a domain literal. We can use any ASCII character from 0x21 (exclamation mark) to 0x5A (uppercase Z), and from 0x5E (^
) to 0x60 (backtick). The … symbol in the syntax diagram is a placeholder for all missing characters between its left and its right neighbors, e.g. A … Z stands for all ASCII characters between A and Z (both included). - The
local-part rule can represent a username, which can include sequences of lowercase letters and decimal digits, separated by dots; e.g. john.doe and jane.doe.smith are both valid usernames. - The
a-text rule lists all valid characters inlocal-part . TheTCK
,SQT
, andSP
symbols stand for, respectively,`
(backtick),'
(single quote), and a space character. We used them only to improve readability of this syntax diagram.
Without further ado, let’s see how to represent these syntax diagrams using Python regexes:
import re
def test_email_address():
""" Splitting an e-mail address. """
samples = (
'john.s.smith@nomail.mmm',
'Title.Case@useless.uuu',
'User@[192.168.0.1]' # using domain literal
)
regex = re.compile(
r'''(?P<AddrSpec> # the whole address
(?P<LocalPart> # the local part
[\daA-Za-z!#$%&'*+-/=?^ `{}|~]+
(?:\.[\daA-Za-z!#$%&'*+-/=?^ `{}|~]+)*)
@
(?P<Domain> # domain
(?P<DotAtom>
[\daA-Za-z!#$%&'*+-/=?^ `{}|~]+
(?:\.[\daA-Za-z!#$%&'*+-/=?^ `{}|~]+)*)|
\[(?P<DomainLiteral>
[\x21-\x5a\x5e-\x60]+)\]) # domain-literal
)''', re.VERBOSE)
for sample in samples:
match = regex.match(sample)
if match:
print('{:30}{:16}{}'.format(
match.group('AddrSpec'),
match.group('LocalPart'),
match.group('Domain')))
else:
print('No match for: %r' % sample)
test_email_address()
The regex for an e-mail address looks a little bit intimidating, but it is simply a translation of the previous syntax diagrams into the language of Python regexes. In order to ease the translation process, the groups’ names in
- The
AddrSpec group captures the whole e-mail address. - The
LocalPart group matches the username. - The
Domain group catches the e-mail’s domain. It has two subgroups,DotAtom , andDomainLiteral . We included them just to make the translation from syntax diagrams easier, but we won’t use them.
All e-mail addresses in for
loop iterates over the samples, and, if successful, it prints the whole e-mail address, the local part, and the domain. Otherwise, it prints an error message. Let’s see if we got the regex right:
john.s.smith@nomail.mmm john.s.smith nomail.mmm Title.Case@useless.uuu Title.Case useless.uuu User@[192.168.0.1] User 192.168.0.1
As you can see, all addresses have been matched successfully, and they have been split into their username and domain components. Notice that the last example uses an IPv4 address, instead of the domain’s name. With properly structured regular expressions, you’re able to capture nonstandard strings in an email address like this.
Code More, Distract Less: Support Our Ad-Free Site
You might have noticed we removed ads from our site - we hope this enhances your learning experience. To help sustain this, please take a look at our Python Developer Kit and our comprehensive cheat sheets. Each purchase directly supports this site, ensuring we can continue to offer you quality, distraction-free tutorials.
Parenthesis
Sometimes we need parentheses not to capture a substring, but simply as a means of delimiting the scope of an operator. For example, the \s*((a|b)?c)\s* regex uses the outer group to match any of the words in the set { ‘ac’, ‘bc’, ‘c’ }, and the (a|b) group to just delimit the scope of the ?
operator. We could easily distinguish between these two kind of uses, for example by using named groups when we actually need the substring, while leaving all other groups anonymous. In other words, we could rewrite the previous example as \s*(?P<g>(a|b)?c)\s*, so that the group
import re
def test_float_literal():
""" Literal for floating-point numbers. """
samples = ( '-12', ' +2.54 ', '34 ', ' 3.14' )
regex = re.compile(r'\s*([+-]?\d+(?:.\d+)?)\s*')
regex_g = re.compile(r'\s*([+-]?\d+(.\d+)?)\s*')
for n in samples:
print('\t'.join(
map(str, regex.match(n).groups())))
print('\t'.join(
map(str, regex_g.match(n).groups())))
test_float_literal()
A float literal has an optional sign, an integer part, and an optional fractional part. There can be leading and trailing whitespaces; e.g. 33 (without sign or fractional part), +12 (with sign, but without fractional part), and -2.56 (which has both a sign and a fractional part) are all valid float literals. In order to match float literals, we will use two different (but equivalent) regexes:
- The
regex_g regex captures both the entire float (using the ([+-]?\d+(.\d+)?) group, which has index 1) and the fractional part (using the (.\d+) group, which has index 2) - On the other hand,
regex captures only the entire float, since it replaces the second group with (?:.\d+), which prevents the regex engine from saving the intermediate substring.
The loop iterates over each number in groups()
method:
-12 -12 None +2.54 +2.54 .54 34 34 None 3.14 3.14 .14
As we can see from the output, the first regex has just one submatch, while the second has two matches, respectively, the whole float and its fractional part. Sometimes the decimal part is missing, so the submatch returns None
. In practice, you’ll probably just use the
Referencing Groups Inside a Regex
In the previous sections we learned how to capture a submatch using both indexed and named groups. Once captured, we can use that substring in one of the following ways:
- to match the exact same substring elsewhere in the regex;
- as a test for conditional regexes, as we will see in another tutorial;
- to replace it with another string; we will deal with string replacement in another tutorial.
In the following subsections we will learn how to refer to a previously captured group either by index or by name.
Referencing a Group by Index
You can refer to a group by prefixing its index with a backslash; e.g. \1 refers to the match of the first group.
Let’s see an example right away. Suppose you have to handle some kind of tabular data. Simply said, your data is organized into rows with each row having the same number of fields. You know each column is supposed to hold a certain data type, but the field separator can change from row to row, as long as there is only one separator per row; e.g. one row can use comma as a separator, while the next row can use semi-colon.
import re
import io
def test_tabular_data():
""" Splitting tabular data. """
sample = io.StringIO('''
Title,Duration,Date,Credits
A Horse With No Name,4:10,1972-01-12,Bunnell
Ventura Highway;3:32;1972-09-19;Bunnell
Only In Your Heart|3:16|1973-04-14|Beckley
Rainbow Song,4:00|1973-11-28,Bunnell
She's Gonna Let You Down;3:40;1974-02-20,Beckley
''')
regex = re.compile(r'''
\s*(?P<title>[^,;|]+) # title
([,;|])\s* # first separator
(?P<dur>\d:\d\d) # duration
\s*\2\s* # second separator
(?P<date>\d{4}-\d{2}-\d{2}) # date
\s*\2\s* # third separator
(?P<credits>[A-Za-z]+) # song's credits
''', re.VERBOSE)
print('{:24}{:12}{:12}{}'.format(
'title', 'duration', 'date', 'credits'))
for line in sample:
m = regex.search(line)
if m:
print('{title:24}{dur:12}{date:12}{credits}'.format(
**m.groupdict()))
test_tabular_data()
In this Python 3 example we use an io.StringIO
object to hold our data. This class provides Python strings with a file-like interface, so that we can iterate over the lines of our sample using a for loop. Our data is split into four columns:
- The first column holds the title of a single by the Anglo-American band America. It can contain any character, other than the three separators { ‘,’, ‘;’, ‘|’ }. It is captured by the
title group. - The second column holds the song’s duration, in the m:ss format (minutes and seconds, separated by
:
). It is captured by thedur group. - The third column holds the single’s release date, in the YYYY-MM-DD format. It is captured by the
date group. - The fourth column holds the name of the songwriter, as a string of capital or small letters. It is captured by the
credits group.
The field separator can be ,
, ;
, or |
. The ([,;|]) regex (right after the title
group) captures the separator between the first and the second field. Then, the two occurrences of \s*\2\s* refer to the first separator (which has index 2) to define, respectively, the separator between the second and the third field, and between the third and fourth field. This way we make sure there is only one kind of separator per row. Finally, the for
loop iterates over each line in the sample, and it prints the contents of each field:
title duration date credits A Horse With No Name 4:10 1972-01-12 Bunnell Ventura Highway 3:32 1972-09-19 Bunnell Only In Your Heart 3:16 1973-04-14 Beckley
Code More, Distract Less: Support Our Ad-Free Site
You might have noticed we removed ads from our site - we hope this enhances your learning experience. To help sustain this, please take a look at our Python Developer Kit and our comprehensive cheat sheets. Each purchase directly supports this site, ensuring we can continue to offer you quality, distraction-free tutorials.
Notice the last two records from the sample are missing. As a matter of fact, the fourth record uses ,
as first and last separator, but it uses ;
to separate the second field from the third, so not all separators are equal. Similarly, in the fifth record the semi-colon separates both the first field from the second, and the second from the third, but the last separator is comma instead. Though it has just one separator (comma), the header (i.e. Title,Duration,Date,Credits) is also missing, because its second and third fields are invalid according to the rules stated previously.
When you write a regex containing a reference to a group, beware of optional groups, since you can get unexpected results.
import re
def test_optional_group_ref():
""" Reference to an optional group. """
samples = ( '\\012\\', '12', '|056|', '56' )
regex = re.compile(r'([\\|])?(\d+)\1')
for sample in samples:
match = regex.match(sample)
print(match.group(2) if match \
else 'No match for: %s' % sample)
test_optional_group_ref()
In this case we want to write a regular expression for a decimal integer number, optionally enclosed between \
(like Prolog octal escape sequences) or between |
(like the mathematical absolute value); e.g. 12, \12\, and |12| are all proper values. By relying on what we know about group references, we might be tempted to translate the rules above into the ([\\|])?(\d+)\1 regex, where:
- the ([\\|])? group is in charge of capturing the prefix (if any), which must be either
\
or|
; - the (\d+) group captures the decimal number;
- the \1 reference represents the (optional) trailing part, which must equal to the (optional) leading part. We may think that if the leading part is missing, then ([\\|])? matches the empty string, so the trailing part will also be optional.
The loop iterates over the samples, and it tries to match them using
012 No match for: 12 056 No match for: 56
As we can see from the output, the two integers having both the prefix and suffix parts match successfully, while 12 and 56 fail. The problem is that, when the suffix is omitted, ([\\|])? evaluates to None
instead of the empty string, which is fine, since the prefix is optional. But when we try to use that group for the trailing part, the match fails, since the \1 expression is not optional (i.e. is not (\1)?). We will come back to the problem matching leading and trailing sequences in our tutorial on conditional regexes.
Referencing a Group by Name
We can also refer to a group by name, using the slightly more complex syntax (?P=name), where
import re
def test_match_tags():
""" Matching HTML tags. """
sample = r'''
<b>Bold</b><i>Italics</i>
<mod>Mod</mod><em>Emphasis</em>
<h2>Level 2 Header</h2>
<code>Bad code match</cod>
<h1>Bad header match</h3>
'''
regex_noref = re.compile(r'''
<(?P<otag>\w+)> # opening tag
(?P<text>[^<]*) # contents
</(?P<ctag>\w+)> # closing tag
''', re.VERBOSE)
print('Without references:\notag ctag text')
for match in regex_noref.finditer(sample):
print('{otag:8}{ctag:8}{text:12}'.format(
**match.groupdict()))
regex = re.compile(r'''
<(?P<otag>\w+)> # opening tag
(?P<text>[^<]*) # contents
</(?P<ctag>(?P=otag))> # closing tag
''', re.VERBOSE)
print('\nUsing references:\notag ctag text')
for match in regex.finditer(sample):
print('{otag:8}{ctag:8}{text:12}'.format(
**match.groupdict()))
test_match_tags()
We will try to search HTML elements inside
- Both regexes use the
otag group to catch the opening tag, thectag group to capture the closing tag, and thetext group to capture the contents of the element. - Both regexes use the \w+ regex to represent the opening tag; e.g. b, h2, and code are all valid opening tags.
- Both regexes use the [^<]* regex to represent the contents of the HTML element, i.e. a string containing any character other than
<
. - The
regex_noref expression use the same regex for both the opening and the closing tag. - The
regex expression matches the closing tag by using a reference to the opening tag, namely, (?P=otag).
Finally, regex.finditer(sample)
iterates over the matches in test_match_tags()
function:
Without references: otag ctag text b b Bold i i Italics mod mod Mod em em Emphasis h2 h2 Level 2 Header code cod Bad code match h1 h3 Bad header match Using references: otag ctag text b b Bold i i Italics mod mod Mod em em Emphasis h2 h2 Level 2 Header
The
Introduction to Regex Conditionals
A substring we want to match can sometimes be enclosed between an optional prefix and an optional suffix. For example, both (x+(y * 2)) and x+(y * 2) really mean the same thing, but the second string lacks the outer parentheses. In these cases, conditional regexes come in handy. A conditional regex is a regex in the form (?(test)if-true|if-false), where:
- The
test parameter is either a group’s index or name. - The
if-true regex will be executed only iftest has a match, i.e.regex.group(test)
is notNone
. - The
if-false regex will be executed only iftest has no match. This clause is optional, but if you omit it, you should also omit the|
between the two clauses.
This kind of regex resembles Python conditional expressions (e.g. 1 if test else 0
), except the second clause is optional. Beware the group’s index or name you use in a test must be defined elsewhere in the regex on the left side of the conditional. Otherwise Python will raise an exception while compiling the regex; e.g. (\()?([^)]*)(?(3)\() will fail to compile since the conditional refers to group 3, but there are only two groups in this regex, namely, (\() and ([^)]*). Presenting conditional regex with Python can get pretty complex, so we’re going to dedicate an entire tutorial to them soon. Subscribe below to make sure you don’t miss it.
Code More, Distract Less: Support Our Ad-Free Site
You might have noticed we removed ads from our site - we hope this enhances your learning experience. To help sustain this, please take a look at our Python Developer Kit and our comprehensive cheat sheets. Each purchase directly supports this site, ensuring we can continue to offer you quality, distraction-free tutorials.
Closing Thoughts
In this tutorial we learned how to use groups to break a regex into smaller units and to capture their substrings. Groups come in two flavors: indexed groups, which are numbered using integers (starting from 1), and named groups, for which we provide a name using the (?P<name>regex) syntax. Sometimes we just want to delimit the scope of an operator, rather than capturing a substring, so we use the (?:regex) syntax instead.
Recall, a regex stands for a set of words. When we match it to a sample, we select the only word from that set which can be found in the sample; e.g. a|b stands for the set { ‘a’, ‘b’ }, but when we call re.match('a|b', 'a')
, we select the word a among the possible matches for that regex. That’s where group references come in handy. Group references allow us to refer to the submatch of a group elsewhere in the regex, even if we don’t know it before executing the regex. We can refer to a group either by index or by name.
We can use the submatch of a group to choose one of the two regexes of a conditional regex, which we’ll describe in more detail in another tutorial. For now, just know conditional regexes are much like Python conditional expressions and can be nested at will. In other words, the clauses of a conditional regex can themselves be conditional regexes.
Did you find this free tutorial helpful? Share this article with your friends, classmates, and coworkers on Facebook and Twitter! When you spread the word on social media, you’re helping us grow so we can continue to provide free tutorials like this one for years to come. Once you’re done sharing our tutorial, please subscribe to our Python training list using the form below:
Syntax Diagrams
This section summarizes what we’ve learned about grouping and conditional regexes. I encourage you to read our previous tutorial on basic regex operators to jog your memory.
The following syntax diagrams are grouped into sections. Each section gives a brief explanation of the rules, and provides links to parts of the tutorial where those rules were first defined.
Definition of Group
The c.isalpha()
is True
.
Grouping by Index and by Name. Parenthesis
A group captures the submatch of the regex it encloses. Each group is assigned to an integer index, starting from 1. We can also assign a name to a group. You can map a group’s name to its index using regex.groupindex[g]
, where (?:
and )
. For example, (\d+(?:.\d+)?) represents a decimal floating point number, like 12.2
, 34
. Here, the parentheses around .\d+ delimit the scope of the ?
operator in order to make the fractional part optional. Since we are only interested in capturing the entire number, not both the number and its fractional part, we enclosed the fractional part between (?:...)
.
Group's References
Once you have defined a group, you can refer to it inside the same regex by using its index (e.g. \1 is a reference to the group 1, defined elsewhere in the regex), or by name (e.g. (?P=g) is a reference to a group named g).
Conditional Regexes
You can use a group’s match as a test for a conditional regex in the form (?(test)if-true|if-false), where the