Regular Expressions in Python

Regular expressions let us find content inside strings matching a particular format.

By formulating a regular expression with a special syntax, you can

  • search text a string
  • replace substrings in a string
  • extract information from a string

The re Python standard library module gives us a set of tools to work with regular expressions.

In particular, among others it offers us the following functions:

  • re.match() checks for a match at the beginning of the string
  • checks for a match anywhere in the string

Both take take 3 parameters: the pattern, the string to search into, and the flags.

Before talking about how to use them, let’s introduce the basics of a regular expression pattern.

The pattern is a string wrapped in a r'' delimiter. Inside it, we can use some special combinations of characters we can use to capture the values we want.

For example:

  • . matches a single character (except the new line character)
  • \w matches any alphanumeric character ([a-zA-Z0-9_])
  • \W matches any non-alphanumeric character
  • \d matches any digit
  • \D matches anything that’s not a digit
  • \s matches whitespace
  • \S matches anything that’s not whitespace

Square brackets can contain multiple characters matches: [\d\sa] matches digits and whitespaces, and the character a. [a-z] matches characters from a to z.

\ can be used to escape, for example to match the dot ., you should use \. in your pattern.

| means or

Then we have anchors:

  • ^ matches the beginning of a line
  • $ matches the end of a line

Then we have quantity modifiers:

  • ? means “zero or one” occurrences
  • * means “zero or more” occurrences
  • + means “one or more” occurrences
  • {n} means “exactly n” occurrences
  • {n,} means “at least n” occurrences
  • {n, m} means “at least n and at most m” occurrences

Parentheses, (<expression>), create a group. Groups are interesting because we can capture the content of a group.

Those 2 examples match the whole string:

re.match('^.*Roger', 'My dog name is Roger')
re.match('.*', 'My dog name is Roger')

Printing one of those statements will result in a string like this:

<re.Match object; span=(0, 20), match='My dog name is Roger'>

If you assign the result to a result variable and call group() on it, you will see the match:

result = re.match('^.*Roger', 'My dog name is Roger')
# My dog name is Roger

Let’s try to get the name of the dog, if you don’t know what is going to be the name of the dog, you can look for “name is ” and then add a group, like this:

result ='name is (.*)', 'My dog name is Roger') will print “name is Roger”, and will print the content of the group, “Roger”:

print(  # name is Roger
print( # Roger

I mentioned and re.match() take flags as the 3rd parameter. We have a few possible flags, the most used is re.I to perform a case-insensitive match.

This is just an introduction to regular expressions, starting from this there’s a lot of rabbit holes you can go into.

I recommend trying your regular expressions on for correctness. Make sure you choose the Python flavor in the sidebar.