Python Regex

Compiling a pattern

import re
pattern = re.compile(r'ab*')
pattern = re.compile(r'ab*', re.IGNORECASE)

Some useful compilation flags from the re module:

IGNORECASE, I : case-insensitive matches.
DOTALL, S     : match any char, including newlines
LOCALE, L     : locale aware match (might be useful when matching accented chars in regex)
MULTILINE, M  : affects ^ and $
VERBOSE, X    : enable re written on multiple lines and comments, for clarity.
UNICODE, U    : makes several escapes like \w, \b, \s and \d dependent on the unicode char db (useful)

To combine flags use bitwise operators:

pattern = re.compile('ab*', re.I|re.U|re.X)

Matching with compiled pattern

# match at the beginning of the string
match = pattern.match(subject)

# scan through the string for any matches
match = pattern.search(subject)

# find all substrings where re matches, return them as a list
match_list = pattern.findall(subject)

# find all substrings where re matches, return them as an iterator
match_iter = pattern.finditer(subject)

match() and search() methods return None if no match, otherwise a MatchObject.

There are re module's functions that are similar to the compiled pattern's methods above, but that take a regex string as their first parameter:

import re
match = re.match(re_string, subject)
match = re.search(re_string, subject)
match = re.findall(re_string, subject)

Using the MatchObject

Retrieving details about matched string

# whole string matched
match.group()

# named group match e.g. (?P<name>\w+)
match.group('name')

# starting position of the match
match.start()

# end position of the match
match.end()

# tuple containing start and end positions
match.span()

Search and Replace

new_string = re.sub(search, replace, old_string)

The backslash problem

backslash in regex

The backslash can be used by the regex engine in 2 special contexts:

  • to indicate a special form
    • match any digits: \d
    • match any alphanumeric (and underscore): \w
    • match any whitespace: \s
  • to escape special characters and allow them to be used without invoking their special meaning:
    • match an actual dot: \.
    • match an actual pipe: \|
    • match an actual parenthese: \(
    • match an actual bracket: \[
    • match an actual backslash: \\
backslash in strings

The backslash also happens to be the escape character for string literals in Python.

# strings that can output:
# a new line
"\n"
# a backslash character 
"\\"
# the letter n.
"n"
# a string to output all 3 characters
"\n\\n"

These two usage of the backslash by the regex engine and Python's string literals conflict.

The solution

When writing regex expression using string literals, you must

  1. write the pattern you need to match
  2. rewrite the pattern expression so that the regex engine can understand it
  3. write a string literal that can evaluate to that specific expression (i.e. backslashes might need to be escaped again).

Example: Lets say we want to match this series of characters: \section

  1. \section is our base pattern. In other words, it's the pattern we're searching for in some string. We're looking for a backslash character \, followed by s,e,c,t,i,o and n.

2) we need to rewrite this so that the regex engine understands it. To match the backslash in a regular expression, we need to remove its special meaning for the regex engine, otherwise it will associate the backslash with the following character s and will try to match the resulting combination \s which evaluates to an empty space. To escape the leading backslash in \section, we just put another backslash in front and the pattern becomes \\section. Note that this is not yet represented as a Python string, it simply is the expression of a pattern to match the series of characters \section with a regex engine.

3) Now, we need to write the actual Python string literal that can evaluate to that expression. Since the backslash character also has meaning in Python, we need to escape each instance of it. In \\section we have 2 backslashes, therefore after escaping them for Python we have \\\\section and that's what we compile.

pattern = re.compile('\\\\section')

So we wanted to match \section and we have to create the string as \\\\section. Cumbersome indeed!

A simpler solution: raw strings

Python raw strings are strings in which backslash don't have any special meaning. i.e. we only need to escape the backslash for the regex engine, not for the raw string. Put an r before the string to turn it into a raw string. e.g. r'this is so RAW!!!'.

Regular String  | Raw String
----------------------------
"ab*"           | r"ab*"
"\\\\section"   | r"\\section"
"\\w+\\s+\\1"   | r"\w+\s+\1"

Compiling raw strings

pattern = re.compile(r'\\section')

If your raw strings need to include quotes within, triple quote them.

r"""Use the door that says "EXIT"."""

Grouping

  • conditional: (?(1/2)yes-pattern|no-pattern)
    • groups can be numeric or named.
    • no-pattern is optional. e.g. if 'overdue' is matched, retrieve the rest of the record, else only match the name.
r'(?P<status>overdue)?\s+(?P<first_name>\w+)\s+(?P<last_name>\w+)(?(status).*)'
  • non-capturing: (?:...)
  • positive lookahead: (?=...)
  • negative lookahead: (?!...)
  • positive lookbehind: (?<=...)
  • negative lookbehind: (?<!...)
  • comment: (?# this is ignored )
  • named capture (Python specific): (?P<mymatch>...)
    • accessible in the rest of the expression with the name mymatch
    • is also a regular numbered group as if it wasn't named
  • group recall:
    • numbered: \1 \2 ...
    • named: mymatch
  • backref to a named group (Python specific): (?P=myname)
    • matches whatever was matched by <myname> in earlier group

Usage example from the Werzeug routing library

rule = re.compile(r'''
    (?P<static>[^<]*)                           # static rule data
    <
    (?:
        (?P<converter>[a-zA-Z_][a-zA-Z0-9_]*)   # converter name
        (?:\((?P<args>.*?)\))?                  # converter arguments
        \:                                      # variable delimiter
    )?
    (?P<variable>[a-zA-Z][a-zA-Z0-9_]*)         # variable name
    >
''', re.VERBOSE)

m1 = rule.match('/article/<id>/orderby/<order>')
(m1.start(), m1.end())
# (0, 13)

m1.group()
# '/article/<id>'

m2 = rule.match('/article/<id>/orderby/<order>', 9)
(m2.start(), m2.end())
# (13,)

m2.group()
# '<id>'

References

http://docs.python.org/howto/regex.html

http://docs.python.org/library/re.html