Regex course – part four. Avoiding catastrophic backtracking using lookahead

JavaScript Regex

Regular expressions can help us solve many different problems. They can also be the source of our headaches. A recent Cloudfare outage happened due to a regular expression that caused CPU to spike to 100% on (…) machines worldwide. In this article, we go through situations we need to watch out for like the catastrophic backtracking. To help us understand the problem, we also define what greedy and lazy quantifiers are and why the lookahead might be helpful.

Some people, when confronted with a problem, think “I know, I’ll use regular expressions.” Now they have two problems.

Jamie Zawinski

Regex problems

xkcd.com

Diving deeper into quantifiers

The Regular Expression engines are quite complex. We can work wonders with regexp, but we need to take into account some problems that we might encounter. To do that, we need to dive deeper into how some regular expressions are executed.

Greedy quantifiers

In the previous parts of the course, we use quantifiers such as  . It tells the engine to match at least one token.

Let’s tale a closer look at the second example:  . We might wonder how many letters we match using the above expression.

Because quantifiers are greedy by default, we match as many letters as possible. We can confirm this by using the match function.

Another good example is the use of some HTML tags:

The first guess might be that it matches something like  . Not quite!

 

As you can see, the greedy quantifier matched the longest possible string!

Lazy quantifiers

In this series, we also cover the   quantifier. It means to match zero or one time.

The interesting thing is that by adding it to a greedy quantifier, we tell it to repeat as few times as possible and therefore make it lazy.

 

Catastrophic backtracking

To understand how quantifiers can affect our performance, we need to take a closer look into a process called backtracking.

Let’s take a look at this seemingly innocent piece of code!

At first glance, it successfully detects a series of digits. Let’s break down how it works.

  • First, the engine processes  . It is greedy, so it starts by attempting to match as many digits as possible. The first match is 
    • then the engine tries to apply the * quantifier, but there are no more digits left
    • since we use the  sign, we want the string to end with our digits – this does not happen due to the  sign
  • Now the amount of matched characters in   decreases due to backtracking. It matches  .
    • then the   quantifier applies and therefore   results in two substrings:   and 
    • since no of the above substrings are at the end of the string, matching with   fails
  • The engine keeps backtracking, by decreasing the number of digits the   matches

There are multiple different combinations the above process produces.

Our string ends with the  sign. Because of that, the regex engine attempts to backtrack until it finds a substring of digits at the end of a provided string.

After a lot of calculations, no match is found. It might lead to a big performance drop. If you use a really long string, the browser might hang, destroying the user experience.

The performance can sometimes be improved by changing the greedy quantifiers into lazy ones, but that is not the case in this particular example.

Lookahead

The most straightforward way to fix the above issue is to rewrite the regular expression completely. The above is not always easy and may result in quite a pain. A solution to the above problem can be the use of lookahead.

In its most basic form, it states that x matches only if it is followed by y.

We refer to it to as positive lookahead. The negative lookahead matches x only if x is not followed by ‘y’

The cool thing about lookahead is that it is atomic. As soon as its condition is satisfied, the engine will not backtrack to try different permutations.

Backreference

Another thing that we need to use here is a backreference.

Above,   means the contents of the first capturing group, and  is the contents of the second capturing group.

We can combine the use of the lookahead and the backreference to deal with the backtracking issue!

This looks quite complex; let’s break it down part by part.

  •  looks for the longest string of digits, since   is greedy
    • the engine does not backtrack looking for different combinations
  •  the backreference states that the content found by the lookahead needs to appear in the string

Thanks to all of the above, we can safely test even very long strings without any hiccups.

Summary

In this article, we’ve dived deeper into quantifiers. We’ve learned that we can divide them into greedy and lazy quantifiers and that it can have an impact on the performance. We’ve also covered another issue that quantifiers can cause called catastrophic backtracking. We’ve also learned how to use lookahead to improve the performance instead off simply rewriting our expressions. With all that knowledge we can write even better code and avoid issues like the one that Cloudflare had.

Series Navigation<< Regex course – part three. Grouping and using ES6 features.
avatar
  Subscribe  
Notify of