Introduction to VBA RegEx
A regular expression, or RegEx, is a sequence of characters that defines a search pattern which is used for finding a specific pattern of characters in a string. This tutorial will show you a VBA RegEx pattern example and will demonstrate why using regular expressions in VBA is so powerful.
Regular expressions are a very large topic and many books have been written about it. It’s also a topic that many people find intimidating due to the rather cryptic and compact syntax and because the syntax varies considerably from language to language.
The VBA RegEx syntax differs quite a lot from other RegEx syntaxes, such as those used in Java, Perl or Python. This is primarily because the VBA RegEx object comes with optional parameters that you must sometimes specify to obtain the same results as you would by using standard RegEx syntax in other languages. I know. This explanation even sounds weird.
For these reasons, we’ll focus on VBA RegEx in this article. There are a few sections where we’ll assume you already have a basic understanding of regular expressions, but don’t worry if you’re new to RegEx! If you’re completely new to the topic and you find it slightly intimidating or difficult, you should still be able to grasp the basic concepts after reading this, so stick with us!
We’ll start by presenting the essential Regex syntax and then we’ll show you how to instantiate the RegEx object in VBA. Next, we’ll describe the parameters of the VBA RegEx object and present a general VBA RegEx function which returns substring matches from the input text passed to the function. We’ll also show you a fairly advanced example which incorporates many of the RegEx features we’ve previously presented and then we’ll finish off with a brief discussion on application ideas and when not to use VBA RegEx.
VBA RegEx syntax essentials
A RegEx search pattern is a sequence of characters consisting of one or more of the following components:
- Literal characters.
- Metacharacters.
Literal characters simply match their counterparts in the input text. For instance, the literal character sequence abc in a search pattern will simply match all occurrences of abc in the input text. Metacharacters, on the other hand, are far more abstract. We’ll cover the most important ones next.
Common metacharacters
A metacharacter is a character, or combination of characters, with a special meaning in a RegEx pattern. Unlike literal characters, metacharacters can match several different characters in the input text or they can even represent something other than a character. I know some of these descriptions may be a bit confusing, but I encourage you to take a moment to familiarize yourself with the following metacharacters:
Metacharacter | Description |
---|---|
. | "Wildcard." The unescaped period matches any character, except a new line. |
^ | "Beginning of a string or line" or "negation." The significance of the caret in search patterns is context-dependent.
1) If the VBA MultiLine parameter is set to True, the caret means "beginning of line," rather than "beginning of string."
2) If the caret is placed within a character class [^], it acts as a negation, i.e. "match all characters NOT matched by the other parameters of the character class." |
$ | "End of a string." The significance of the dollar sign in search patterns is context-dependent. If the VBA MultiLine parameter is set to True, the dollar sign means "end of line," rather than "end of string." |
\ | "Escape." The backslash in front of a metacharacter turns it into a literal character. |
\b | "Word boundary" or "backspace character." Outside character classes, \b matches a position before or after a word within the text source. Within character classes, \b denotes the backspace character. |
\B | "Not a word boundary." |
\d | "Digit." Matches any digit from 0-9. |
\D | "Not digit." Matches any character that's not a digit. |
\s | "Whitespace." Matches a space, newline or tab character. |
\S | "Not whitespace." Matches a character that's not a space, newline or tab. |
The characters
Operators
Some metacharacters change how one or more of the other components in the search pattern is interpreted, i.e. they perform operations on these other components. VBA RegEx comes with three types of operators. The last one in the table is actualy a group of metacharacter expressions:
Operator | Meta-character | Description | Example |
---|---|---|---|
Boolean "or" | | | The vertical bar denotes the boolean "or" operator. | a|b matches either "a" or "b". |
Grouping | () | Parentheses are used for several purposes: 1) to define the scope and precedence of operators.2) to group characters and remember text. | h(a|e)y matches either "hay" or "hey". |
Quantification | ? | Zero or one occurrences of the preceding element. | colou?r matches both "color" and "colour". |
* | Zero or more occurrences of the preceding element. | ab*c matches "ac", "abc", "abbc", "abbbc", and so on. | |
+ | One or more occurrences of the preceding element. | ab+c matches "abc", "abbc", "abbbc", and so on, but not "ac". | |
{n} | The preceding item is matched exactly n times. | a{3} matches "aaa". | |
{min,} | The preceding item is matched min or more times. | a{1,} matches "a", "aa", "aaa" and so on. | |
{min, max} | The preceding item is matched at least min times, but not more than max times. | a{1,3} matches "a", "aa" and "aaa", but not "aaaa". |
Character classes
Character classes or character sets are specified with square brackets [ ]. Some of the most common ones are:
The set of lower-case letters ranging from a to z.
The set of upper-case letters ranging from A to Z.
The set of single digits ranging from 0 to 9.
Character classes are frequently used in conjunction with operators in the search pattern. For instance,
After this rather lengthy intro on the VBA RegEx syntax essentials, let’s now turn to the VBA RegEx setup!
VBA RegEx function
Start by opening the Visual Basic Editor and then go to Tools > References… and set a reference to the Microsoft VBScript Regular Expressions 5.5 Object Library.
VBA RegEx Reference
Next, paste the code below into a standard code module:
Sub CallRegEx()
'Must add reference (Tools > References) to the
' "Microsoft VBScript Regular Expressions 5.5" Object Library
Dim r As Match
Dim mcolResults As MatchCollection
Dim strInput As String, strPattern As String
strInput = "The email address,'Mary-Jo.T.Williamson_01+test@test-site.com', is contained within this string"
strPattern = "[a-z0-9-.+_]+@[a-z-]+\.[a-z]+"
Set mcolResults = RegEx(strInput, strPattern, , , True)
'print the returned results to the immediate window
If Not mcolResults Is Nothing Then
For Each r In mcolResults
'***********************************
'Insert your code here
'***********************************
Debug.Print r ' remove in production
Next r
End If
End Sub
Function RegEx(strInput As String, strPattern As String, _
Optional GlobalSearch As Boolean, Optional MultiLine As Boolean, _
Optional IgnoreCase As Boolean) As MatchCollection
Dim mcolResults As MatchCollection
Dim objRegEx As New RegExp
If strPattern <> vbNullString Then
With objRegEx
.Global = GlobalSearch
.MultiLine = MultiLine
.IgnoreCase = IgnoreCase
.Pattern = strPattern
End With
If objRegEx.Test(strInput) Then
Set mcolResults = objRegEx.Execute(strInput)
Set RegEx = mcolResults
End If
End If
End Function
Make powerful macros with our free VBA Developer Kit It’s easy to copy and paste a macro like this, but it’s harder make one on your own. To help you make macros like this, we built a free VBA Developer Kit and wrote the Big Book of Excel VBA Macros full of hundreds of pre-built macros to help you master file I/O, arrays, strings and more - grab your free copy below.
Run the
Inside the
The first two parameters of our RegEx function are (1) the input string and (2) the search pattern variables we just discussed. These parameters are mandatory, and it should be self-evident why that is. Conversely, the next three parameters are optional:
Optional Parameters | Type | Description |
---|---|---|
Global | Boolean | If set to True, all matches will be returned ("greedy matching"). If set to False, only the first match will be returned ("lazy matching"). The default value is False. |
IgnoreCase | Boolean | If set to True, the pattern matching will be case-insensitive. The default value is False (case-sensitive). |
MultiLine | Boolean | If set to True, it changes the interpretation of the ^ ("beginning of string") and $ ("end of string") meta-characters in the search pattern so that they match the beginning and end of a line instead. The default value is False. |
In the CallRegEx
example, we’ve set the CallRegEx
procedure would have been considerably longer. As a matter of fact, this is what we would have had to type if we didn’t set the
That’s pretty ugly, isn’t it? Just in case you’re new to RegEx and this search pattern is complete gobbledygook to you, let’s break it down in more detail!
The bracketed terms in the pattern above denote character sets and the plus signs at the end of them are quantifiers specifying how many characters we’re looking for. Finally, the “at” character (@) in the middle denotes a literal search for this character (In other words, we’re looking for an @ symbol). The dot (.) between the last two pair of square brackets is also searching for a literal dot, but this character has been escaped by a backslash, since the dot is a reserved symbol in RegEx syntax.
Reading from left to right, we could translate the search pattern example to English like this:
Find one or more occurrences of either a lower-case or upper-case letter ranging from a to z, a number ranging from 0 to 9, a hyphen (-), dot (.), plus (+) or an underscore sign (_)
AND
an “at” sign (@)
AND
one or more occurrences of either a lower-case or upper-case letter ranging from a to z, or a hyphen (-)
AND
a dot (.)
AND
one or more occurrences of either a lower-case or upper-case letter ranging from a to z
If (and only if) all the conditions are met and all the matching characters in the input string are contiguous and in the right order, the RegEx function will return a match. This means whitespace characters within search patterns radically alter the meaning of them. If you include a whitespace character in your search pattern by mistake the RegEx engine will look for a whitespace amid the other characters you’ve specified!
Application Ideas for VBA RegEx
In this tutorial, we’ve focused on the “find” functionality of regular expressions, but it should be mentioned that they’re often used for “find and replace “ too. The latter is what the
- Validation of user input . For instance, you could use a RegEx function for validating passwords or email addresses.
- Finding and deleting duplicate words . Most text processing software, such as MS Word, can locate duplicate word occurrences for you, but if it’s too time-consuming to remove the duplicates manually, RegEx may be the answer.
- Localization. In case you need to translate a text into another language, the date and number formats of the two languages might differ from one another. For instance, if the input text contains thousands of numbers, it can be well worthwhile to write a RegEx algorithm, rather than editing manually.
Even though there are many legitimate RegEx uses, there are also instances where you probably shouldn’t use this approach:
-
When other functions do the job. VBA RegEx is extremely powerful because it allows you to extract exactly the information you want from an input text, but it’s overkill if the native Excel functions and operators such as Like, Find or Replace can do the job.
-
Think twice before using RegEx to parse html. Sure, you could get lucky and get the results you want. Just realize that going this route means you could be asking for a world of hurt where you’ll keep getting weird errors that you won’t be able to fix properly. If that happens, the only stable solution to the problem is to refactor your code without RegEx. Take my word for it, I’ve been there!
That’s all for this tutorial. When you’re ready to take your VBA to the next level, subscribe using the form below.