This tutorial will guide you through using regular expressions (regexes) in Python. Regexes are a language which describes sets of words over an alphabet. First, we’ll learn how to represent all characters from the alphabet using just sequences of ASCII graphic characters. Then, we’ll describe how to build regexes from these characters using three simple operators, and we’ll explain what they mean, i.e. the set of words that they stand for. Once we know how to write regexes, we’ll use them for searching strings inside a text, or a byte sequence.

If you are already familiar with regular expressions, and you are itching for action, take a look at the summaries for character representation and for regexes operators, and then dive right into string matching the Python way. You can download the Python source code for this tutorial from this link.

Regular Expressions' Alphabet

Regular Expressions (regexes for short) are a language used to define sets of words. When we say word we mean a sequence of characters over an alphabet, e.g. both cats and var = value are words over the ASCII character set.

Regular expressions can be defined from the characters in the alphabet by means of three operators: concatenation, alternation, and repetiton. We will study these operators in depth in the next section. Right now we will focus on the Regex alphabet.

Representation of Characters

Before starting to define regular expressions, we should know how to represent each character from the alphabet, which includes all Unicode characters. As long as we concern ourselves only with alphanumeric characters from the ASCII character set, we can use them directly, e.g. LATIN UPPERCASE LETTER A is represented by a.

But regexes are a language, and as such they give a special meaning to some characters (called metacharacters), from which all operators are built: .^$*+?{}[]|(). We can’t use these characters directly in a regex without first “escaping” them, i.e. preceding them with a \ (backslash). For example, (a*b)+2 may look like a Python arithmetic expression, meaning «multiply a by b, then add 2», but in the regex language, it really stands for the set { ‘b2’, ‘ab2’, ‘aab2’, … }, as we will soon see. In order to represent the arithmetic expression, you should write (a\*b)\+2 instead.

Since both Python strings and Python regexes use escape sequences with a leading \, all regexes will be written as raw strings. A raw string literal looks like a string literal, except that it has a r prefix, which prevents the Python interpreter from replacing the escape sequences within it; e.g. while ‘dog\tcat’ will replace \t with a U+0009 tab character, so that the string object will have 7 characters (dog cat), the expression r’dog\tcat’ will create an 8-characters string including the \ and t characters (dog\tcat). Then, the regex engine will take care of replacing escape sequences in the raw string.

Representing Unicode Characters

Since Python 3.3, there are two ways of representing an arbitrary character using its Unicode code point: \uhhhh (for code points between U+0000 and U+FFFF), and \Uhhhhhhhh (for code points between U+10000 and U+10FFFF), where h is a hexadecimal digit; e.g. \u03A9 and \U000003A9 both represent GREEK CAPITAL LETTER OMEGA. The Python regex engine will raise an exception in the following scenarios:

  • if the sequence is too short; e.g. \u123 has 3 digits instead of 4, \U12345 has 5 digits instead of 8.
  • if the hexadecimal value is out of bounds; e.g. \U12345678 is not a valid Unicode code point, since it is greater than U+10FFFF.
  • if it includes some character that is not a hexadecimal digit; e.g. \unoun, where none of the characters are valid hexadecimal digits.

Python also provide a shorthand representation for more frequent non-graphic characters, as shown in the following table:

Sequence Code Point Meaning
\a U+0007 Alert (bell)
\b U+0008 Backspace
\t U+0009 Horizontal tab
\n U+000A Line feed
\v U+000B Vertical tab
\f U+000C Form feed
\r U+000D Carriage return

We can only use \b inside character classes, since the backspace escape has a different meaning outside a character class. The alternate meaning relates to regexes’ anchoring, which we will talk about in another tutorial.

Remember that regexes are still Python strings, so you can put Unicode characters directly into them, e.g. \bβιο.*\b is also a valid regex, which represents the set of all word starting with the βιο prefix (This example should give you a hint of the use of \b as an anchor). Inserting plain Unicode characters means that the Python module containing the regex will be encoded in one of the Unicode encoding forms. On the other hand, escape sequences use only ASCII characters.

We’ve thrown a lot of technical jargon in this tutorial so far. Let’s take a step back and recap all the ways of representing a single character with regex:

Notation Code Point Unicode Name Description
A U+0061 LATIN CAPITAL LETTER A Plain ASCII character insertion.
ä U+00E4 LATIN SMALL LETTER A WITH DIAERESIS Plain Unicode character insertion.
\. U+002E FULL STOP Escaping a metacharacter (see above).
\a U+0007 BELL Escape sequence for non-graphic characters (see above).
\u03BB U+03BB GREEK SMALL LETTER LAMBDA The \uhhhh Unicode escape sequence.
\U0010600 U+10600 LINEAR A SIGN AB001 The \Uhhhhhhhh Unicode escape sequence.
\N{TILDE} U+007E TILDE The \N{NAME} Unicode escape sequence (see below).

Basic Operators of Regular Expressions

Now that we know how to represent all Unicode characters, it’s time to get in touch with the three basic operators that allow us to build regular expressions from the alphabet. We will describe both the syntax of each operator, and its meaning, that is, the set of words it stands for. Once you have a good grasp of the contents of this section, you can either refer to the operators’ summary or to the syntax diagrams for a quick reference. The three basic operators are:

  1. Concatenation
  2. Alternation
  3. Repetition

We’ll explain each operator in detail.


The simplest kind of regular expression is created by putting characters from the alphabet one after another, e.g. dog is a 3 character regex that represent the set { ‘dog’ }. The concatenate operator combines the three characters (d, o, and g) from left to right. Notice the concatenate operator is not commutative, e.g. god is built from the same three characters, but it stands for a different set of words, namely { ‘god’ }.


We can also create a regular expression by listing all words that it stands for, separated by |, which is the alternation operator; e.g. the regex red|green|blue represents the set of all primary colors, that is, { ‘red’, ‘green’, ‘blue’ }. Notice that in the last example we used both concatenation (for assembling words from characters) and alternation (for assembling the set).

Defining Sets of Single Characters

Python provides a shortcut to define sets whose elements are all single characters (such sets are known as character classes). Instead of listing them one after the other, separated by |, you enclose them between [ and ]; e.g. r|g|b equals to [rgb]. All metacharacters inside a set definition are taken literally, except for \ and ], which are represented as \\, and \], respectively; e.g. we can simply write [ab()+*] for the set { ‘a’, ‘b’, ‘(‘, ‘)’, ‘*’, ‘+’} (where (, ), +, and * are metacharacters), but we must write [[\]\\] for the set { ‘[’, ‘]’, ‘\’ }.

You can use any character representation to define a regex set, including escape sequences for ASCII non-graphic characters and Unicode characters; e.g. [\u0391\u0392\u0393\u0394] is the set of the first four Greek capital letters, i.e. { ‘Α’, ‘Β’, ‘Γ’, ‘Δ’ }.

If that’s still too much typing for you, Python provides an even shorter notation for set definition, using l-u, which stands for the set of all characters from l to u (bounds included), according to their respective Unicode code points; e.g. [0-7] is the set of all octal digits, i.e., { ‘0’, ‘1’, …, ‘7’ }. You can place more than one of these pairs in your set definitions, mixed with other characters; e.g. [0-9A-Fa-f] is the set of all hexadecimal digits.

And what about the - character (HYPHEN-MINUS)? If you want it included in your set, you can either place it at the start or at the end of the set definition, or you can place it anywhere in your set definition by escaping it with \-. Let’s see if all this pans out:

import re

def test_using_hyphen():
    """ Using hyphen in a set definition. """
    sample = "In this sample I'll surely find 1 or more " \
        "lowercase hexadecimal digits - right?"
    regexes = (
    for regex in regexes:
        print('Using regex %r:' % regex.pattern)
        for m in regex.finditer(sample):
            print(, end=" ")

This example shows you how to use regexes to search a text. We will discuss all these functions soon. For now, let’s stick to the results:

Using regex '[-0-9a-f]':
a e e f d 1 e e c a e e a d e c a d -
Using regex '[0-9a-f-]':
a e e f d 1 e e c a e e a d e c a d -
Using regex '[0-9\-a-f]':
a e e f d 1 e e c a e e a d e c a d -

As we can see from the output, all three regex [-0-9a-f], [0-9a-f-], and [0-9\-a-f] are the same, and they represent the set of all lowercase hexadecimal digits in our sample string, plus the hyphen -, that is, the set { ‘0’, …, ‘9’, ‘a’, …, ‘f’, ‘-‘ }.

Get Our Python Developer Kit for Free

I put together a Python Developer Kit with over 100 pre-built Python scripts covering data structures, Pandas, NumPy, Seaborn, machine learning, file processing, web scraping and a whole lot more - and I want you to have it for free. Enter your email address below and I'll send a copy your way.

Yes, I'll take a free Python Developer Kit

Negated Character Classes

Sometimes we want to define a set of characters including all characters from the alphabet, except for a few exceptions. For example, the contents of an XML element can be any Unicode character except for &, <, and >. In the previous section we learned the l-u shorthand notation, but even using this shorthand, defining such a set would be impractical.

Fortunately, by providing negated character classes, Python comes to the rescue. Negated character classes look just like normal set definitions, except the first character after the left bracket is ^; e.g. [^<>&] stands for the set of all characters over the alphabet, except for &, <, and >.

Let’s look at another example. The expression [^^] is the set of all alphabetic characters, except for ^.

Predefined Character Classes in Python

Python defines a shorthand notation for a bunch of recurring character classes, which are reported in the following table along with the equivalent set definition and a description of their meaning:

Shorthand Regex Meaning
\w [a-zA-Z0-9_] All alphanumeric characters, plus _. An alphanumeric character is any Unicode character c for which c.isalnum() evaluates to True; e.g. every character in the string étranger is an alphanumeric character, even é.
\W [^a-zA-Z0-9_] Neither an alphanumeric character, nor _. See the the previous table entry for more information.
\d [0-9] Decimal digits.
\D [^0-9] All characters other than decimal digits.
\s [\t\n\r\f\v] White spaces.
\S [^\t\n\r\f\v] All characters but white spaces.

As we can see from the table, uppercase notations stand for negated character classes. All these predefined classes can be used inside a set definition, e.g. [0-9a-fA-F] and [\da-fA-F] both represent the set of all hexadecimal digits.

The Set of All Characters

What if we want to represent any character in the alphabet? It would be tedious to list all characters in the set, even with the [] syntax. Luckily, Python represents the set of all characters as .. This symbol actually means the whole alphabet, except for newline. If we want to include newline, we must add re.DOTALL to the re.compile() flags, as we will see in the next section.


Sometimes we want to describe a pattern that repeats more than once in a row, e.g. abbbbba, where b repeats 5 times. Python represents repetition with the {l,u} operator, where l is the minimum number of repetitions (lower bound), while u is the maximum number of repetitions (upper bound); e.g. ab{2,4}a means that b can be repeated 2, 3, or 4 times in a row, i.e., the regex stands for the set { ‘abba’, ‘abbba’, ‘abbbba’ }.

If l is omitted, it defaults to 0. Notice that this is our first encounter with the empty string, since matching zero instances of a character really means including the empty string in the set of words; e.g. a{0,3} represents the set { ‘’, ‘a’, ‘aa’, ‘aaa’ }, where '' is the empty string. On the other hand, if u is omitted, the regex will try to match as many occurrences of the expression as it can; e.g. b{2,} means that there must be at least two consecutive occurrences of b, like in bb, but there may be even more, like in bbbbbb (which has 6 occurrences of b).

When l and u are the same, you can simply write {l}, meaning that there are exacly l repetitions of that expression; e.g. the regexes abbbbba, ab{5}a, and ab{5,5}a are all the same, and stand for the set { ‘abbbbba’ }.

Repetition's Shorthand Notations

Python provides some shorthand notations for common types of repetition. You should use these notations anywhere you can, since they make regexes easier to understand.

The ? symbol equals to {0,1}. It means the expression is optional; e.g. docx? means that we can omit the x character at the left of ?, so this regex stands for the set { ‘doc’, ‘docx’ }. This operator (and all other operators listed here) is postfix. That means it applies to the regex at its left. For example, a?b stands for the set { ‘b’, ‘ab’ }, and not for the set { ‘a’, ‘ab’ }, since ? applies to a and not to b.

The * symbol equals to {0,}. It means that the expression is optional, but there may also be one or more instances of it; e.g. [a-zA-Z]\w* is a common definition for an identifier in many programming languages, i.e. an alphabetic character, followed by none or more alphanumeric characters or _, like in test_func, CONST_PI, and namedtuple. Recall we defined \w earlier.

The + symbol equals to {1,}. It means that there is one or more occurrences of that expression; e.g. [A-Z][a-z]+ is the set of all words starting with a capital letter, followed by at least one lowercase letter; e.g. John, Mark, and Linda.

Operators' Precedence

In the previous sections we applied all operators in a precise order, but we never spelled out those rules explicitly. Here we list all operators in order of precedence (highest to lowest): repetition, concatenation, alternation. For example, a|bc* equals to (a|(b(c*))), that is:

  1. First, we apply repetition to c, since repetition has the highest precedence. c* stands for the set { ‘’, ‘c’, ‘cc’, ‘ccc’, … }.
  2. Then, we concatenate b to the previous set, and we obtain the new set { ‘b’, ‘bc’, ‘bcc’, … }.
  3. Finally, we apply the alternation operator to a and to the previous set, so that the overall regex stands for { ‘a’, ‘b’, ‘bc’, ‘bcc’, … }.

Let’s take a look at a bunch of examples, just to make sure that we fully understand. Try to figure these examples out on your own before checking our solutions.

  1. a*ba*ba* stands for the set of all words over the alphabet { ‘a’, ‘b’ } which contain exactly two occurrences of b, with possible occurrences of a at the beginning, at the end, or in the middle of the word; e.g. bb, abba, abaaaaab, and so on.
  2. [ab]*(aa|bb)[ab]* stands for the set of all words over { ‘a’, ‘b’ } containing at least two consecutive as or two consecutive bs, e.g. aa, abba, babaa, and so on.
  3. 1?(01)*0? is the set of all words over { 0, 1 } which have alternating 0s and 1s; e.g. 0101, 10, and so on.

As for mathematical expressions, Python allows you to use parentheses to delimit the scope of an identifier; e.g. [\w-]+\.(svg|xml|docx) can represent a set of file names with one of the following extensions: svg, xml, or docx. Parentheses make it clear that the alternation operator applies only to the file’s extension. We will revisit this topic in another tutorial, since parentheses assume a special meaning in regexes, related to grouping.

After this whirling tour of symbols, it’s time to take a break and recap the regex operators we’ve learned so far:


Notation Operation Meaning
ab Concatenation { 'ab' }
a|b Alternation { 'a', 'b' }
[ab] Alternation { 'a', 'b' }
[a-d] Alternation { 'a', 'b', 'c', 'd' }
[^ab] Alternation All characters in the alphabet, except for a and b
\w, \d, \s Alternation Predefined character classes, see above
\W, \D, \S Alternation Predefined negated character classes, see above
. Alternation All characters in the alphabet
a{1,3} Repetition { 'a', 'aa', 'aaa' }
a{,3} Repetition { '', 'a', 'aa', 'aaa' }
a{2,} Repetition { 'aa', 'aaa', 'aaaa', ... }
a{2} Repetition { 'aa' }. Same as a{2,2}
a? Repetition { '', 'a' }
a+ Repetition { 'a', 'aa', 'aaa', ... }
a* Repetition { '', 'a', 'aa', 'aaa', ...}

Get Our Python Developer Kit for Free

I put together a Python Developer Kit with over 100 pre-built Python scripts covering data structures, Pandas, NumPy, Seaborn, machine learning, file processing, web scraping and a whole lot more - and I want you to have it for free. Enter your email address below and I'll send a copy your way.

Yes, I'll take a free Python Developer Kit

Matching a String to a Regex

Up to this point, we have only concerned ourselves with learning the basic syntax of regular expressions, and what a regex means, i.e. the set of words it describes. Now it’s time to put all this into practice with some Python Regex examples. The most common use for regex is to find which parts of a text sample match the set of words described by a regular expression.

In Python 3, regexes are implemented by the re module. The re.compile(regex) function compiles regex. That means it generates bytecode for the underlying regex engine and returns an object that can be used to search our text samples. In the following sections we’ll describe some of the methods of a compiled regex. For each method of the compiled regex, there is a module-level function with the same name and signature, except that it requires the regex to be passed as its first parameter. For example, we can either call r = re.compile(r'\d+') and then r.match('123') to search the sample, or re.match(r'\d+', '123'), where the regex is passed directly to re.match() as the first parameter. According to the official documentation, there isn’t a noteworthy penality hit by using module-level functions instead of compiled regexes, since the former use a cache to avoid compiling frequently used regexes over and over again. We’ll mainly use compiled regexes in our examples.

Using match() to Search from the Start of the Sample

Once we compiled the regex, we can use its match() method to apply the regex from the first character of the sample. It returns a match object if the search is successful, None otherwise. Let’s suppose we want to check if our sample starts with a YYYY-MM-DD date:

import re

def test_check_date():
    """ Match a YYYY-MM-DD date. """
    samples = ( '99-12-03', '   2019-05-09', '2019-05-22  ' )
    regex = re.compile(r'\d{4}-\d{2}-\d{2}')
    print('Using %s' % regex.pattern)
    for sample in samples:
        match = regex.match(sample)
        print('%-20r: %s' % (sample, 
   if match else 'No match'))

The test_check_date() function shows a common code pattern for handling regexes that you’ll see over and over again throughout this tutorial:

  1. We use a samples variable to hold all our tests.
  2. We compile a regex object outside the test loop. In this case \d equals to [0-9], i.e. the set of all decimal digits, and the {4} and {2} repetition operators tell that there must be, respectively, 4 (e.g. 2019) and 2 (e.g. 06) decimal digits one after another. The pattern attribute of the compiled regex hold the first argument of re.compile().
  3. We iterate over the samples, and use the precompiled regex to search them.
  4. We use match to hold the search results. Since this object may be None, we test it with an if statement before trying to fetch results from it.
  5. Finally, we retrieve the matched string using, and we print it.

Let’s see the results:

Using \d{4}-\d{2}-\d{2}
'99-12-03'          : No match
'   2019-05-09'     : No match
'2019-05-22  '      : 2019-05-22

The regex doesn’t match the first sample, since the year field in that date only has two digits (99). The second sample includes a valid date, but there are some leading spaces, so our test once again fails. Finally, the third sample matches 2019-05-22 from its first character, so the test succeeds.

Get Our Python Developer Kit for Free

I put together a Python Developer Kit with over 100 pre-built Python scripts covering data structures, Pandas, NumPy, Seaborn, machine learning, file processing, web scraping and a whole lot more - and I want you to have it for free. Enter your email address below and I'll send a copy your way.

Yes, I'll take a free Python Developer Kit

Using search() to Search Anywhere in the Sample

The second string from the previous example was a proper date, and yet the test was not successful. Python provides the search() method to start matching the regex anywhere in the sample:

import re

def test_search_date():
    """ Searching a YYYY-MM-DD date anywhere in a sample. """
    samples = (
        '1999-09-28: "Showbiz" by Muse was released in the US.',
        '"Jazz" by Queen was released on 1978-10-10 in the US.',
        'Today is 2019-05-22')

    print('%-6s %-6s %s' % ('start', 'end', 'match'))
    regex = re.compile(r'\d{4}-\d{2}-\d{2}')
    for sample in samples:
        match =
        print('%-6d %-6d %s' \
            % (match.start(), match.end(),   

This example looks like the previous one, except we used search() instead of match(). All three samples contain a valid date, but the date is stored in different places in the strings. We use the start() and end() methods to retrieve, respectively, the start and end positions of the match. The match object also provides a span() method that returns the start and end position of the match as a tuple. The search() method matches all dates in the sample, as we can see from the output:

start  end    match
0      10     1999-09-28
32     42     1978-10-10
9      19     2019-05-22

If you edited the Python script to print the span() tuple, your output might look like this:

import re

def test_search_date():
    """ Searching a YYYY-MM-DD date anywhere in a sample. """
    samples = (
        '1999-09-28: "Showbiz" by Muse was released in the US.',
        '"Jazz" by Queen was released on 1978-10-10 in the US.',
        'Today is 2019-05-22')

    print('%-8s %s' % ('span', 'match',))
    regex = re.compile(r'\d{4}-\d{2}-\d{2}')
    for sample in samples:
        match =
        print('%-8s %s' \
            % (match.span(),   
span     match
(0, 10)  1999-09-28
(32, 42) 1978-10-10
(9, 19)  2019-05-22

Multiple Matches within a Sample

There may be multiple matches for a regex within the same text sample. Looking at the example below, the regex \d+, which represents a sequence of decimal digits, has two matches in the sample (12 + a)*2, namely 12 and 2, but search() would only return the first occurrence of the pattern, i.e. 12. To retrieve all matches from a sample using the search() function, we can try something like this:

import re

def test_search_all_matches():  
    """ Use `search()` to find all matches in a pattern. """
    sample = '((12 + a)*2) + 3 * b'
    regex = re.compile(r'\d+') 

    match =
    while match:
        print(, end='  ')
        sample = sample[match.end() : ]
        match =

First, we try to find the first occurrence of a decimal number in sample. If there is one, we enter the while loop, where we retrieve the match with Then, we remove all characters from the start of the sample to the end of the last match (marked by match.end()). Finally, we search again for the pattern in the shortened sample. The loop will exit as soon as the method fails to provide another match (i.e. when it returns None). The output of our test_search_all_matches() Python regex script is:

12  2  3 

This works perfectly fine, but there’s a better way for small sample problems. To handle cases where we want to find all the matches in a small string, Python provides a findall() method. This method returns a tuple containing all the matches of the regex. The following example will search all decimal integer numbers within our sample:

import re

def test_find_decimal():
    """ Finding all decimal integer numbers. """
    for word in re.findall(r'\d+', 
        "[ 'a', 12, 34.1, 'b', { 'x':2, 'y':3 }]"):
        print(word, end="  ")

Our sample looks like a Python list filled with different types of data (strings, integers, floats, and dictionaries). The findall() method retrieves the following digits:

12  34  1  2  3

Notice the float literal 34.1 has been split into 34 and 1. This is because our regular expression recognizes sequences of decimal digits, not floating point literals. We can use \d+(?:.\d*)? to prevent the float from being split into two parts, but we’ll come back to this topic in another tutorial, after discussing grouping.

When there are lots of matches, it’s not advisable to store all of them in memory, like findall() does when it builds the result tuple. Python provides a finditer() method, which returns one match at a time. This way, you can process each match on the fly, without storing intermediate results. Suppose you want to report all integer constants within a Python code snippet:

import re
import io

def test_python_numeric_const():
    """ Finding all integer constants within Python code. """
    pycode = io.StringIO("""
   [ 1, 6, 4, 6, 2, 6, 4, 6 ],
   [ 7, 7, 7, 7, 7, 7, 7, 7 ],
   [ 5, 6, 5, 6, 5, 6, 5, 6 ],
   [ 7, 7, 7, 7, 7, 7, 7, 7 ],
   [ 3, 6, 4, 6, 3, 6, 4, 6 ],
   [ 7, 7, 7, 7, 7, 7, 7, 7 ],
   [ 5, 6, 5, 6, 5, 6, 5, 6 ],
   [ 7, 7, 7, 7, 7, 7, 7, 7 ] ]
def adam7(r, c):
    return ADAM7_MAP[r % 7, c % 7]
print(adam7(123, 13))

    regex = re.compile(r'\d+')
    n, j = 0, 0
    for line in pycode:
        n += 1
        for m in regex.finditer(line):
            j += 1
            print('%-6s%-6d%-6d%-6d' \
                % (, n, m.start(), m.end()),
                end=" | " if j % 2 else '\n')

Here we use an io.StringIO object to hold the Python sample. This object provides Python strings with a file-like interface; in particular, we can iterate over the lines in the sample using the for loop, just like we would do with a file.

In case you were wondering, the Python sample code that we will feed to the regex is the Adam-7 algorithm for interlacing PNG images. Given a pixel’s coordinates, the adam7() function returns the number of the pass that will transmit that pixel to the PNG datastream. In particular, adam7(123, 13) evaluates to 4, meaning the pixel (123, 13) will be transmitted during the fourth pass. It doesn’t matter if none of this makes sense to you: we just needed a big matrix of integers to test the regex (the more integers, the merrier).

The main loop of our test function scans each line of the Python source using the finditer() method. For each match, it prints the integer constant, the line number, the start column of the match (returned by the start() method of the matching object), and its end column (returned by the end() method of the matching object). We’ll get two matches per line of output, separated by a |. This is the complete output (for those of you who fancy long sequences of numbers):

Int    Line   Start  End     | Int    Line   Start  End
7      2      4      5       | 1      3      5      6
6      3      8      9       | 4      3      11     12
6      3      14     15      | 2      3      17     18
6      3      20     21      | 4      3      23     24
6      3      26     27      | 7      4      5      6
7      4      8      9       | 7      4      11     12
7      4      14     15      | 7      4      17     18
7      4      20     21      | 7      4      23     24
7      4      26     27      | 5      5      5      6
6      5      8      9       | 5      5      11     12
6      5      14     15      | 5      5      17     18
6      5      20     21      | 5      5      23     24
6      5      26     27      | 7      6      5      6
7      6      8      9       | 7      6      11     12
7      6      14     15      | 7      6      17     18
7      6      20     21      | 7      6      23     24
7      6      26     27      | 3      7      5      6
6      7      8      9       | 4      7      11     12
6      7      14     15      | 3      7      17     18
6      7      20     21      | 4      7      23     24
6      7      26     27      | 7      8      5      6
7      8      8      9       | 7      8      11     12
7      8      14     15      | 7      8      17     18
7      8      20     21      | 7      8      23     24
7      8      26     27      | 5      9      5      6
6      9      8      9       | 5      9      11     12
6      9      14     15      | 5      9      17     18
6      9      20     21      | 5      9      23     24
6      9      26     27      | 7      10     5      6
7      10     8      9       | 7      10     11     12
7      10     14     15      | 7      10     17     18
7      10     20     21      | 7      10     23     24
7      10     26     27      | 7      12     8      9
7      13     15     16      | 7      13     25     26
7      13     32     33      | 7      15     10     11
123    15     12     15      | 13     15     17     19

Case-Insensitive Search

Sometimes we want to match a regex to a sample ignoring the difference between lowercase and uppercase characters. Python provides a re.CASEIGNORE flag for the re.compile() function, which enables case-insensitive matching:

import re

def test_caseinsensitive_match():
    """ Case-insensitive match. """
    tests = { 'USA', 'U.S.A.', 'usa', 'U.s.a.' }
    regex = re.compile(r'usa|u\.s\.a\.', re.IGNORECASE)
    for t in tests:
        print('regex matches %s: %s' \
            % (t, regex.match(t).group()))

The usa|u\.s\.a\. regex stands for the set { ‘usa’, ‘u.s.a.’ }. But, since we compiled it with the re.IGNORECASE flag, all mixed-case versions of these two words will also be included into the set, e.g. Usa, U.S.A., and so on. As we can see from the output, regex matches all test strings, regardless of the case:

regex matches usa: usa
regex matches U.s.a.: U.s.a.
regex matches USA: USA
regex matches U.S.A.: U.S.A.

Get Our Python Developer Kit for Free

I put together a Python Developer Kit with over 100 pre-built Python scripts covering data structures, Pandas, NumPy, Seaborn, machine learning, file processing, web scraping and a whole lot more - and I want you to have it for free. Enter your email address below and I'll send a copy your way.

Yes, I'll take a free Python Developer Kit

Beware that you need to compile the regex with the re.UNICODE flag if you want a Unicode-aware comparison, otherwise all non-ASCII charaters will fail to work properly. Don’t worry, though. In Python 3, the re.UNICODE flag is enabled by default:

import re 

def test_unicode_match():
    """ Unicode-aware case-insensitive match. """
    tests = { 'coast', 'COAST', 'Côte', 'CÔTE' }
    print("Without UNICODE flag:")
    regex = re.compile(r'coast|c\u00f4te', re.IGNORECASE)
    for t in tests:
        match = regex.match(t)
        print('regex matches %-6s: %s' % (t,
   if match \
                else 'regex doesn\'t match %s'))
    print("\nWith UNICODE flag:")
    regex = re.compile('coast|c\u00f4te', 
        re.IGNORECASE | re.UNICODE) 
    for t in tests:
        print('regex matches %-6s: %s' \
            % (t, regex.match(t).group()))


The coast|c\u00f4te regex stands for the set { ‘coast’, ‘côte’ }. The second word (which means coast in French) contains ô (U+00F4 LATIN SMALL LETTER O WITH CIRCUMFLEX), which is outside the ASCII range. Our samples include both the lowercase and the uppercase version of this letter. We try to match all our samples twice, with and without the re.UNICODE flag. As we can see from the output, since we executed the script on a Python 3 interpreter, even the version without the re.UNICODE flag matches all samples, because that flag is enabled by default:

Without UNICODE flag:
regex matches coast : coast
regex matches CÔTE  : CÔTE
regex matches COAST : COAST
regex matches Côte  : Côte

With UNICODE flag:
regex matches coast : coast
regex matches CÔTE  : CÔTE
regex matches COAST : COAST
regex matches Côte  : Côte

Matching a Byte Sequence to a Regex

The previous sections only look at text samples, i.e. sequences of code points from the Unicode character set. This approach doesn’t account for all those files that are mere sequences of bytes, like audio files and images. Fortunately, the re module can search bytes sequences, too! This section will spell out the differences between str and bytes searching. First, let’s see an example.

Listing All PNG Chunks' Names

PNG (Portable Network Graphics) is a well-spread image format with a permissive license (we all like freebies!), maintained by the W3C Consortium. We don’t need to know this format in depth. For our purposes, it suffices to say that:

  • A PNG file has a signature, followed by a sequence of chunks.
  • Each chunk includes (in this exact order) the size of the data field (4 bytes), the chunk’s name (4 bytes), the data (which spans size bytes), and the CRC code (4 bytes).

That’s a lot to take in, but the following diagram will help you understand how our sample PNG file looks. The PNG signature has a blue background, the size field has an orange background, the name has a yellow background, and the CRC code has a green background. The numbers on the left side of the diagram are the hexadecimal offset from the start of the file of each chunk. You should compare them with the output of our example, which we’ll get to momentarily.

Internal Structure of the favicon PNG Image

The test_list_chunks() function opens a PNG image (the current favicon of the website, which is included in the source file link at the top of this tutorial), it detects the file signature, and then it reads all chunks, starting from the first, and using the offset to jump to the next chunk. Regexes will help to validate each chunk’s name.

import re

def test_list_chunks():
    """ Listing all chunk names from a PNG file. """
    def chunk_size(b):
        """ Converts the size from bytes to int. """
        return (b[0]<<24) | (b[1]<<16) | (b[2]<<8) | b[3]

    regex = re.compile(rb'[a-zA-Z]{4}')
    n = 0
    with open('favicon.png', 'rb') as img:
        # Skip signature
        singature =
        chunk =
        # For each chunk:
        n = 8
        while len(chunk) == 8:
            # Read the chunk's header
            name = regex.match(chunk[4:]).group()
            size = chunk_size(chunk)
            # Print the chunk's name and offset
            print('%-8s%x' % (name, n))
            # Jump to the next chunk
   + 4, 1)
            n += size + 12
            chunk =


The chunk_size() function converts a sequence of 4 bytes (representing an unsigned integer in big-endian byte order) into a Python integer. Chunk names are 4-letter words from the ASCII alphabet (both uppercase and lowercase), e.g. IDAT, IEND, and tEXt are all valid chunk’s names. For each chunk, we read 8 bytes. The first 4 bytes are the size of the data section of the chunk, and the other 4 bytes are the name. The offset to the next chunk is size + 12 bytes, since the size, name, and CRC fields are all 4 bytes long. Here is the list of chunks in favicon.png:

b'IHDR'   8
b'IDAT'   21
b'IEND'   17fe

You can check the accurancy of the output by reading the PNG file with a hex editor. Bigger images may have several IDAT chunks, which store the PNG compressed datastream. Notice we compiled the regular expression using a raw binary string, namely rb’[a-zA-Z]{4}’, which tells the Python regex engine to use all bytes from 0x00 to 0xFF as the alphabet, instead of Unicode characters. Every time we use a bytes object to store the regex, the sample that we pass to regex.match() must be a bytes object, too.

Differences Between Searching Strings and Byte Sequences

The first noteworthy difference is that we use binary raw strings for our regexes, instead of raw strings. Notice the rb in the regex rb’[a-z]’, which is used to match a single ASCII lowercase letter.

Another major difference is that everything we said about Unicode characters doesn’t make any sense here, since the words described by a binary regex are really sequences of bytes, not sequences of characters. So \w means exactly [0-9A-Za-z], i.e. a set of 62 characters (10 digits, 26 uppercase letters, 26 lowercase letters), while the string version includes many other other characters, like GREEK CAPITAL LETTER A, and so on.

Unicode escape sequences are forbidden in binary regexes, and Python will raise an exception if you try to include such characters in a binary regex. On the other hand, binary regexes do offer two other escape sequences: \xhh, where h is a hexadecimal digit (i.e. [0-9A-Fa-f]), and \ooo, where o is an octal digit (i.e. [0-7]). You can type all ASCII graphic characters directly in the regex and use one of these two escape sequences for all other characters. Beware that all escape sequences must be in the range 0x00-0xFF. That’s always true for hexadecimal sequences, since two hexadecimal digits can represent exactly all numbers in that range, but octal sequences must not exceed \377 (decimal 255), otherwise Python will raise an exception while trying to compile the regex.

If you haven’t realized it by now, using regex to navigate a binary file is neat, but it’s also a royal pain.

Get Our Python Developer Kit for Free

I put together a Python Developer Kit with over 100 pre-built Python scripts covering data structures, Pandas, NumPy, Seaborn, machine learning, file processing, web scraping and a whole lot more - and I want you to have it for free. Enter your email address below and I'll send a copy your way.

Yes, I'll take a free Python Developer Kit

Caveats of Using Regular Expressions

Here we collect some features of regular expression that you should probably be aware of. These features are fairly advanced, so you can skip them if you’d like and instead focus only on previous sections of this tutorial.

Documenting Your Regexes

When we deal with really long regular expressions, it would be nice to break them down into smaller pieces and add some comments. Suppose you want to match a URI (Universal Resource Identifier) using the regex described by RFC 3986:2005. A URI includes the following parts, ordered from left to right: scheme, authority, path, query, and fragment. All parts are optional, except for path. The version of the regex without comments would be:

import re

def test_uri():
    """ Splitting a URI into its components. """
    tests = (
    regex = re.compile(
        r'^(([^:/?#]+):)?' # scheme \
        '(//([^/?#]*))?'   # authority \
        '([^?#]*)'         # path \
        '(\?([^#]*))?'     # query  \
        '(#(.*))?')        # fragment
    for t in tests:
        print('regex matches %r: %s' \
            % (t, True if regex.match(t) else False))

Parentheses have a precise function in regexes, but for the time being let’s just pretend their sole purpose is to delimit the scope of operators. As a visual aid, we split the regex’s string into many logical lines, adding a comment before the line breaks. The parts of a URI are delimited as follows:

  1. The scheme (optional) part starts at the beginning of the URI and ends with the first occurrence of one of the following characters: :, /, ?, or #; e.g. https is the scheme part of
  2. The authority (optional) part ends with the first occurrence of one of the following characters: /, ?, or #; ; e.g. is the authority part of
  3. The path is a series of locations, separated by a / character. It goes on until the first occurrence of ? or #, or until the end of the URI; e.g. is the path part of
  4. The query (optional) starts with a ?, and goes on until the first #, or until the end of the URI; e.g. q=PDF is the query part of
  5. The fragment (optional) starts with a #, and goes on until the end of the URI; e.g. appendix-A is the fragment part of

As you can see from the output, all tests are successful:

regex matches '': True
regex matches '': True
regex matches '': True
regex matches '': True

Python allows you to add comments within a regex by enclosing them into (?#comm), where comm is the comment. We can rework the regex from the previous example as follows:

import re

regex = re.compile(
    r'^(([^:/?#]+):)?(?#scheme)' \
    '(//([^/?#]*))?(?#authority)' \
    '([^?#]*)(?#path)' \
    '(\?([^#]*))?(?#query)'  \

This alternative makes the regex a little bit easier to grasp, but there is an even better option. Python provides a re.VERBOSE flag for the re.compile() function. All whitespaces within the regex are ignored, except when in a set definition. Python ignores all characters from a # which is neither preceded by a backslash (\#) nor included in a set definition ([#]), up to the end of the line. So, the following definition of regex is the same as the two previous ones:

import re 

regex = re.compile(
    ^(([^:/?#]+):)?    # scheme
    (//([^/?#]*))?     # authority
    ([^?#]*)           # path
    (\?([^#]*))?       # query 
    (\#(.*))?          # fragment
    ''', re.VERBOSE)

This is the cleanest of the alternatives. As complexity of regexes increases, we will often use these kinds of comments in our examples. Be aware that in this version of the regex, we use multiline raw strings (enclosed within triple single quotes). Moreover, in the fragment part of the URI, we escaped the # character outside set definition by preceding it with a backslash, since the pound sign would otherwise be interpreted as the start of an inline comment. In that case the Python regex engine would have raised an exception stating the regex is not complete. This would have happened because the inline comment would have stripped the closing parenthesis from the regex.

Now that you know all alternative ways of commenting your regexes in place, choose the one that best suits your needs. As a rule of thumb:

  • Short regexes should be self-explanatory, so there’s no need to comment them;
  • The (?#…) comment style may be better for medium-length regexes;
  • You can use the commenting version with logical lines if you don’t want to use compilation flags;
  • In all other cases, use the verbose version.

Using Unicode Characters' Names

Python 3.8 introduces yet another way of representing Unicode characters. Instead of referring to them by code point, you can use \N{name}, where name is the Unicode name for the character; e.g. \N{GREEK CAPITAL LETTER OMEGA} is easier to grasp than \u0391, though they both stand for the same character. You can use Unicode Code Charts to map all Unicode code points to their names (and the other way around), and the Chart Index to find which chart includes a certain character. This representation surely requires some additional typing, but it makes your regex more readable. On the other hand, it becomes somewhat inconvenient for long sequences of Unicode charcters.

So, let’s recap all ways of inserting Unicode characters in a regex:

import re

def test_representing_unicode_chars():
    """ Equivalent representations of Unicode characters. """
    sample = "There are several Greek words starting with " \
        "'βιο', for example 'βιολογία' (biology) and " \
        "'βιογραφία' (biography). " \
        "They all stem from the word 'βιος' (life)."
    regexes = (
        # Using Unicode characters directly
        # Using \u escape sequence (since Python 3.3)
        # Using \N escape sequence (since Python 3.8)
    for r in regexes:
            regex = re.compile(r)
            print('Searching with %r:' % r)
            for match in regex.finditer(sample):
        except re.error:
            print('Error while compiling %r.' % r)

In this example, we want to define a regex which represents the set of all words containing βιο. So, the first three characters of the regex will be exactly βιο, which we represent in three ways:

  1. Using Unicode characters directly: βιο.
  2. Using Unicode escape sequences: \u03B2\u03B9\u03BF.
  3. Using Unicode names: \N{GREEK SMALL LETTER BETA}….

The trailing part of the regex is \w*, which stands for the set of all words (including the empty string) which contain only decimal digits, Unicode alphabetic characters (which include all letters from the Greek script), and _, as we have already described above. Since some of these ways of representing Unicode characters may not be available in your version of the Python interpreter, the code has been protected by an except clause, printing the regex which failed to compile. Each regex should match all words with a βιο prefix in the sample text:

Searching with 'βιο\\w*':
Searching with '\\u03B2\\u03B9\\u03BF\\w*':

Since we tested this script with Python 3.7, the last test raised an exception, since Unicode names require Python 3.8. Beware that, at the time of writing, Python 3.8 is still under development, and it will not be released until September 2019.

Nonsense Repetitions

Be careful when you write regexes, since even a misplaced space can result in nonsense and unexpected results:

import re

def test_nonsense_repetitions():
    """ Nonsense uses of the repetition operator. """
    # Omitting both parameters, but without comma
    print(re.match('a{}'    , 'a{}').group()     == 'a{}')
    # Non-numeric parameters for the repetition operator
    print(re.match('a{x,y}' , 'a{x,y}').group()  == 'a{x,y}')
    # Space within the repetition operator
    print(re.match('a{1, 2}', 'a{1, 2}').group() == 'a{1, 2}')
    # Space are not allowed even in VERBOSE mode
    print(re.match('a{1, 2}', 'a{1,2}', 
        re.VERBOSE).group() == 'a{1,2}')
    # Lower bound is greater than the upper bound
        re.match('a{4,2}', 'aa').group()
    except re.error as err:

All the regexes above, except for the last one, will execute without issues, but yielding unexpected results:

min repeat greater than max repeat at position 2

Let’s investigate what went wrong:

  1. In the first case, we wanted to write the equivalent of *, but we omitted the comma. Both {,} and {0,} equal to *, i.e. the infinite set { ‘’, ‘a’, ‘aa’, … }, but a{} stands for the set { ‘a{}’ }.
  2. In the second case, we used two non-numeric parameters for the repetition, namely, x and y. Once again, this regex stands for a one-word set, namely, { ‘a{x,y}’ }.
  3. In the third case, we put an extra space between the two parameters of the regex, which seems harmless enough, but the resulting regex represents { ‘a{1, 2}’ } instead of { ‘a’, ‘aa’ }, as we meant. We’ll only notice our mistake when the regex doesn’t match the pattern. Putting unwanted spaces in a regex is actually a common error, which applies to other operators as well; e.g. r | g | b doesn’t stand for the set { ‘r’, ‘g’, ‘b’ }, but for the set { ‘r ‘, ‘ g ‘, ‘ b’ }.
  4. Notice that additional spaces within the repetition operator will be ignored in VERBOSE mode, but the a{1, 2} regex doesn’t stand for neither { ‘a’, ‘aa’ } nor { ‘{1, 2}’ }, but rather for { ‘{1,2}’ }!
  5. In the last case, the lower bound of the repetition is greater than the upper bound. Python checks the boundaries of a repetition and raises an exception, which we catch with the except clause.

Raw Strings and Predefined Character Classes

Now it’s time for us to come clean and admit we were not completely straightforward about the use of the backslash character in regexes. You are free to put it before any ASCII graphic character that is not alphanumeric (i.e. which is not a letter or a decimal digit), in which case the regex engine simply skips it; e.g. \#[\da-fA-F]{6} and #[\da-fA-F]{6} both represents an HTML color, like #E6E6FA (Lavender). The \ preceding # in the first regex is ignored. A backslash followed by a decimal digit (e.g. \0) is reserved for regex grouping.

Remember that Python uses the backslash for escape sequence in string literals. If Python doesn’t recognize the escape sequence, the backslash and subsequent character are included in the resulting string; e.g. ‘a\tb’ creates a 3-character str object (a, horizontal tab, and b), while ‘a\mb’ results in a 4-character string (a, backslash, m, and b), since \m is not a valid escape sequence. However, if you want to prevent Python from replacing any escape sequence, you must place the raw prefix (r) before the sting; e.g. r’a\tb’ is a 4-character string (a, backslash, t, and b), regardless of the fact that \t is a valid escape sequence. Since we use many escape sequences and predefined character classes in regex, some of which overlap with string escape sequences, it is strongly encouraged to use raw strings (or binary raw strings) for regular expressions.

Starting with Python 3.6, all escape sequences consisting of a \ followed by an ASCII letter which is not a predefined character class raises an exception. This makes forgetting about the r prefix for raw strings even more problematic, e.g. ‘\m’ before Python 3.6 was a valid regex, representing { ‘m’ }, but now r'\m' will raise an exception, since \m is not a valid escape sequence.

Get Our Python Developer Kit for Free

I put together a Python Developer Kit with over 100 pre-built Python scripts covering data structures, Pandas, NumPy, Seaborn, machine learning, file processing, web scraping and a whole lot more - and I want you to have it for free. Enter your email address below and I'll send a copy your way.

Yes, I'll take a free Python Developer Kit

Closing Thoughts

In this tutorial we introduced Python regular expressions and presented several regex examples. First, we learned how to represent all Unicode characters, since they are the basic components of all regular expression. Then, we described three operators on regexes: concatenation, alternation, and repetition. Syntax diagrams presented below will help freshen your memory. Finally, after much talking about regexes, we got our hands dirty and we used them to search text samples. We can search byte sequences with the same functions and methods, but we must be aware there are small differences between the two approaches. As you dig deeper into this topic, regexes will became more and more sophisticated and harder to read, so it is very important to document them properly as you write them.

Regular Expressions' Syntax So Far

In this section we use syntax diagrams to sum up what we have learned so far about the grammar of regular expressions. A syntax diagram is a way of representing a grammar rule by means of a graph. All symbols with a yellow background are terminal symbols, i.e. characters that are part of a regular expression. All symbols with an orange background are non-terminal symbols, i.e. names of grammar rules. You should read syntax diagrams from left to right, starting from the rule’s name, and following one of the arrows. When you get to a non-terminal symbol, you must read the diagram for that rule. Depending on your browser, if you click on a non-terminal symbol, you’ll jump straight to the rule which it represents. This feature of syntax diagrams makes it easier for you to browse through the grammar.

Syntax diagrams are grouped into sections. Each section gives a brief explanation of the rules, and provides links to parts of the tutorial where those rules were first defined. It also describes those non-terminal symbols which have no syntax diagram.

Regular Expression's Definition

The regex rule is the starting symbol of our grammar, i.e. the basic rule that we use to create all regular expressions. All regexes are created from single characters or from other regexes by applying one of the three basic operators: concatenation, alternation, or repetition.

Definition of Regular Expression

Term of a Regex

Characters' Representation

This section collects all rules about characters’ representation. By simple-char we mean an ASCII graphic character which is not a metacharacter. As we have seen above, we can type it directly in a regex. Python provides escape sequences for all common non-graphic ASCII characters, as you can see from the escaped-char rule. The esc-metachar rule teaches you how to escape metacharacters, i.e. characters that represent the operators of regex. To cut a long story short, take a look at the following table, where non-graphic characters are in green background (those characters for which Python provides an escape sequence are written in red), and metacharacters are in light blue background; anything else is a simple character:

Summary of Simple Characters

Remember you can insert Unicode characters either directly or by using hexadecimal or named escape sequences, which only require ASCII characters. A unicode-name is the name of a Unicode Character, as specified by the Unicode standard. A Unicode name can only include Latin uppercase letters from A to Z, decimal digits, spaces, and hyphens. See the Name section from chapter 4 of the standard for further information.

Notice that both DOT and SPACE are terminal symbols, standing for, respectively, . and a single space character. We use their names instead of their glyphs just to improve readability of syntax diagrams.

Characters' Representation

Escape Sequences for Non-Graphic Characters

Escape Sequences for Bytes

Escaped Metacharacters for Bytes

Unicode Characters

Octal Digit

Decimal Digit

Hexadecimal Digit

Basic Operators of Regular Expressions

This sections describes the three basic operators for regular expressions. The concatenation rule simply puts one term after the other, so that all terms are joined from left to right, as we have seen above.

The alternation rule specifies three ways of representing alternation: using the | operator, using a predefined character class, or defining our own classes either by listing all characters that belong to them or just the ones they lack.

Finally, the repetition rule puts none, one, or more instances of a term one after another. We can either specify the minimum and maximum number of repetitions, or use the ?, +, and * shorthand notations for the most common cases.

The Concatenation Operator

The Alternation Operator

Character Classes

Set Definition

The Repetition Operator

Did you find this free tutorial helpful? Share this article with your friends, classmates, and coworkers on Facebook and Twitter! When you spread the word on social media, you’re helping us grow so we can continue to provide free tutorials like this one for years to come. After that, please subscribe to our email list for more Python tutorials.

Get Our Python Developer Kit for Free

I put together a Python Developer Kit with over 100 pre-built Python scripts covering data structures, Pandas, NumPy, Seaborn, machine learning, file processing, web scraping and a whole lot more - and I want you to have it for free. Enter your email address below and I'll send a copy your way.

Yes, I'll take a free Python Developer Kit

List of All Examples in This Tutorial