Regular Expressions - How they work and where they are used
Many programs and databases or data processing systems support regular expressions, also known as regex or regexp. But what do these regular expressions mean and how do they work? These questions will be answered here and you will learn how the expressions can be used in practice.
1. A brief introduction to regular expressions
1.1 What are regular expressions?
Simply put, a regular expression is a pattern for text. Regular expressions are used for different purposes:
- To search texts for a pattern (possibly also to replace parts)
- To validate a text using a pattern
- To extract information within a text with a specific structure
Of course there are other uses as well. Usually, however, a pattern is used to analyze a text. The pattern is also provided as (usually) small text in which certain characters have special meanings.
1.2 A simple regular expression example
The easiest way to show the application is with an example. A regular expression (and for that matter the pattern) might look something like this:
hallo|hello
Dieser reguläre Ausdruck würde nun sowohl auf “hallo” als auch auf “hello” zutreffen, jedoch nicht auf einen anderen Text, wie beispielsweise “salut”. Über den | Operator kann aber im Muster angegeben werden, welche Möglichkeiten erlaubt sind.
This regular expression would now match both “hallo” and “hello”, but not any other text, such as “hey”. The pipe operator “|” is used in the pattern to separate all instances that should be matched. In this case, both the words “hello” and “hallo” would be matched by this pattern.
1.3 The syntax of regular expressions
There are various regex implementations, and the syntax can sometimes differ slightly. This article focuses on some very common elements:
Example | Description |
---|---|
abcdef | Matches the text “abcdef” |
. | Matches any character |
\. | Matches the character “.” |
abc|def | Matches the text “abc” or “def” |
[a-z0-9] | Matches any character from the range “a-z0-9”. |
a? | Matches the character “a” zero or one time |
a* | Matches the character “a” zero, one, or more times |
a+ | Matches the character “a” one, or more times |
a{2} | Matches the character “a” exactly two times |
a{2,4} | Matches the character “a” between two and four times |
a{2,} | Matches the character “a” at least twice |
a{,2} | Matches the character “a” at most twice |
abc+ | Matches the text “abc” one or more times |
(abc)+ | Matches the text “abc” at least once |
^abc | Matches the text “abc” at the beginning of a line |
abc$ | Matches the text “abc” at the end of a line |
Regular expressions are usually not bound to the beginning and end of text. That is, the pattern searches the entire text for a match. In order for the text to match exactly, it must be anchored with the ˆ
and $
operators. For example, the expression ˆ(hallo|hello)$
only matches the terms “hallo” and “hello”, but not longer text that contains the term “hello”. With the pattern “hallo|hello” without an anchor, the pattern would also match it. The elements from the table can be listed to compose the pattern so that it fulfills its intended purpose. A pattern is therefore able to combine several elements of the table. For example, [abc]+ can be used to find arbitrarily consecutive a, b, and c, like the words bac
or cab
.
2. How Regular Expressions Work
So far it has been explained what a regular expression is. But how do they work and how are they implemented?
2.1 Regular expressions as a Finite Automaton
As a reminder, regular expressions are provided as text. This makes it very easy to enter these patterns on a keyboard. However, regular expressions can also be described and represented graphically as a “finite automaton” or as a “state machine”, because every regular expression can be converted into a state machine. The conversion is easy to implement and helps to understand how an algorithm uses regular expressions.
A state machine for the regular expression “hallo|salut” could look like this:
Figure 1: State machine for the regular expression “hallo|salut”
An algorithm always starts in the initial state “start” and takes the first character from a text to be analyzed in order to compare it with all possible state transitions. If no state transition matches the character, then there is no hit at the position. However, if it is, the process is repeated based on the new state, this time with the second character. The whole thing is continued until either no hit or no state transition is found or the end state “end” is reached.
2.2 State machine for regular expression operators
The possibilities that are available in a regular expression can basically be divided into three different categories:
- Symbols
- Repetition
- Alternation
Some implementations of regular expressions allow additional elements. Symbols, repetitions and alternations are explained below.
2.2.1 Symbols as a state machine
Symbols mean characters (letters, numbers, special characters, etc…). Every regular expression has characters in it because the ultimate goal of any expression is to match the symbols to the text. With a pattern “abc” there are three consecutive characters that have to be matched.
Symbols, on the other hand, are simply represented in the state machine as consecutive states, with the respective symbols as state transitions.
This can be shown using the regular expression “abc”:
Figure 2: State machine for the regular expression “abc”
2.2.2 A repetition as a state machine
Basically, the operators *
, +
and {a,b}
are used to introduce repetitions. Actually, the two operators *
and +
can also be represented using the {a,b}
operator. The *
operator is used for 0 or more repetitions. This can also be achieved with {0,}
. The +
operator stands for 1 or more repetitions, which can accordingly be represented with {1,}
. In practice, however, these elements appear very often, so these short versions are pleasant to work with.
But what does a state machine for a repetition look like? As an example, here is the regular expression (abc)+d
(The string “abc” one or more times in a row and then a “d”):
Figure 3: State machine for the regular expression “(abc)+d”
2.2.3 An alternation as a state machine
There are two types of alternations in a regular expression. On the one hand there are the character classes, for example [abc]
and one of the letters “a”, “b” or “c”. On the other hand there is the |
Operator, for example abc|def
, which lists the text in “abc” or “def”. Here, too, the first variant, the character class, can be rewritten: a|b|c
, which is basically the same as [abc]
.
Here is another example to illustrate this: a[bcd]e
:
Figure 4: State machine for the regular expression “a[bcd]e”
2.3 Backtracking vs. Non-Backtracking
There are several regular expression algorithms. There are two main types, each with advantages and disadvantages.
A backtracking algorithm tests every single possible path of a regular expression until a match is found. This is not the most efficient solution available, but it offers the possibility of so-called backreferences, for which no efficient algorithm is known so far. If an implementation implements these backreferences, one can refer to previous parts within a regular expression.
For example, the regular expression (ab|cd)\1
matches both abab and cdcd, but not abcd.
Other implementations that don’t rely on backtracking can be much more efficient. Not all possible paths are checked, only all possible states.
2.3.1 Efficiency of Backtracking Algorithms
Most of the time, it goes unnoticed that backtracking algorithms are inefficient. In addition, backtracking algorithms are very common, for the simple reason that they offer more possibilities than other options. The following example shows what consequences this can have:
The simple regular expression (a|aa)*b
targeting specific text takes some time, often more than a minute. This is relatively easy to determine in Chrome and with JavaScript:
|
|
Task:
|
|
Conclusion
A simple regular expression can cause significant performance problems for backtracking algorithms and make them lengthy. Nevertheless, the method is used by many frameworks: because they are widespread, there is no quick implementation, or because more efficient methods are simply not known. Ideally, a combination would be such that the slow variant is only used when backlinks are included in the regular expression. If this is not the case, the more efficient method would come into play - however, it would also depend on the application purpose, because there are definitely situations in which it would be security-critical to support the more time-consuming method at all. For example, when a system is deliberately brought to a standstill by a cyber attack with a slow regular expression.
📝 Like what you read?
Then let’s get in touch! Our engineering team will get back to you as soon as possible.