Regex course - part two. Writing more elegant and precise patterns. - wanago.io

May 7, 2018

1. Regex course – part one. Basic concepts.
2. Regex course – part two. Writing more elegant and precise patterns.
3. Regex course – part three. Grouping and using ES6 features.
4. Regex course – part four. Avoiding catastrophic backtracking using lookahead

Hello again! Today we are back to regular expressions in JavaScript. If you are new to them, check out the first part of the course. This time we will learn how to write more elegant patterns and define the position of searched strings.

The shorter way to define repetitions

Recently we’ve learned, that the asterisk, * , can make the expression to be matched 0 or more times. This is equivalent to {0,} . There are actually additional shorter forms and using them might help make your patterns more elegant and shorter.

One or more repetitions

With the plus sign, + , we can indicate that the expression might be matched one, or more times. This is similar to the asterisk, but this time it has to match at least once. It is equivalent to {1,} .

/1+23/.test('123'); // true

/1+23/.test('111123'); // true

/1+23/.test('23'); // false

It means that /.+/ will match any character appearing at least once.

1 2	/.+/.test(''); // false /.*/.test(''); // true

Imagine for example checking if a string contains a substring, but does not end with it:

// function checks if the string contains question marks,

// but does not end with it

function hasQuestionMarkBeforeEnd(string) {

return /\?.+/.test(string);

}

hasQuestionMarkBeforeEnd('Do you know regex yet?'); // false

hasQuestionMarkBeforeEnd('Do you know regex yet? Yes, I do!'); // true

Please note that a question mark is a special character and because of that we need to precede it with a backslash.

We can take it a little further and write a more generic function:

function containsPatternBeforeEnd(string, pattern) {

return RegExp(`${pattern}.+`).test(string);

}

containsPatternBeforeEnd('cat, dog', 'cat'); // true

containsPatternBeforeEnd('cat, dog', 'dog'); // false

Optional character

As said above, the question mark is a special character. Using it, we can create a pattern with an optional character. It makes it equivalent to {0,1} .

function wereFilesFound(string) {

return /[1-9][0-9]* files? found/.test(string);

}

wereFilesFound('0 files found'); // false

wereFilesFound('No files found'); // false

wereFilesFound('1 file found'); // true

wereFilesFound('2 files found'); // true

The shorter way to define a set of possible characters

Previously we used square brackets, [ ] , to define a set of possible characters. In regex, there are a few sets implemented that you can easily refer to.

Alphanumeric character

If you ever wanted to match any alphanumeric character, you would need a pattern like that: /[A-Za-z0-9_]/ . Quite a complex one, isn’t it? There is a shorter way to do this, though: \w . Watch out though: neither of them will match any language-specific characters!

Non-alphanumeric character

It is an opposite of the pattern described above: /^[A-Za-z0-9_]/ . Its equivalent is \W . It shares the same flaws and does not handle language-specific characters:

function isAlphanumeric(string) {

return /\w/.test(string);

};

function isNotAlphanumeric(string) {

return /\W/.test(string);

};

isAlphanumeric('Ó'); // false

isNotAlphanumeric('Ó'); // true

Handling digits

Previously, we’ve learned that to match any digit we can use a pattern like that: [0-9] . We can also make use of \d . It will match any digit:

isItADigit(string) {

return /\d/.test(string);

}

isItADigit('5'); // true

isItADigit('a'); // false

Time to make a more generic regex note: in some implementations (JavaScript included) \d means just [0-9] . In some, it might match any Unicode digit character, such as Eastern Arabic numerals. An example of such implementation is the one in Python 3.

Using \D will match any non-digit characters.

Dealing with whitespaces

In strings, there are a few types of whitespace characters:

space ” “
tab “/t”
new line “\n”
carriage return “\r”

To create a pattern that matches every one of them, we would need something complex like that: /[ \t\n\r]/ . There is an easier way, though, and it involves using \s :

function containsWhitespace(string) {

return /\s/.test(string);

}

containsWhitespace('Lorem ipsum'); // true

containsWhitespace('Lorem_ipsum'); // false

Respectively, using \S would match any non-whitespace character.

Specifying the position

So far, we’ve been just writing patterns that will be matched if they are found anywhere in the string. We can specify the position in order to be more precise.

Caret sign

If you add ^ sign at the very beginning of your pattern, it will match only if the tested string begins with that pattern:

1 2	/^dog/.test('dog and cat'); // true /^dog/.test('cat and dog'); // false

Please note that the caret sign serves a different purpose when used in the square brackets, which was said in the previous part of the course.

Dollar sign

Adding a dollar sign at the end of your pattern will make it match only if it appears at the end of the string:

1 2	/dog$/.test('dog and cat'); // false /dog$/.test('cat and dog'); // true

Combining both signs

If you begin your pattern with ^ and end it with $ it will match only if the tested string matches as a whole:

1 2	/success/.test('Unsuccessful operation'); // true /^success$/.test('Unsuccessful operation'); // false

Even if the string “success” can be found in the tested string, enclosing the pattern in ^ and $ would make it match only if the whole string matches.

Check one more example:

function areAllCharactersDigits(string) {

return /^[0-9]+$/.test(string);

}

Here, we are checking if the string contains only digits. Using plus sign will make it work for one or more digits. Enclosing the pattern in ^ and $ signs will make sure that the expression matches only, if there are digits from the beginning to the end of the string, and nothing else.

1 2	areAllCharactersDigits('123'); // true areAllCharactersDigits('Digits: 123'); // false

Without the mentioned signs, the second string would match:

1	/[0-9]+/.test('Digits: 123');

Multiline mode

We’ve already learned that there are additional flags that we can add to our patterns. One of them is the multiline flag represented by the m letter. It changes the meaning of caret and dollar sign. In multiline mode, they stand for the beginning and end of a line and not the whole string.

const pets = `

dog

cat

parrot and other birds

/^dog$/m.test(pets); // true

/^cat$/m.test(pets); // true

/^parrot$/m.test(pets); // false

I’ve used template literals here to add newlines. We could also do something like that:

const pets = 'dog\ncat\nparrot and other birds';

/^dog$/m.test(pets); // true

/^cat$/m.test(pets); // true

/^parrot$/m.test(pets); // false

Thanks to the multiline flag, only the lines are being tested as a whole, and not the entire string. You can observe that the last test still won’t pass because the last line contains more than just a “parrot”.

Summary

This time we’ve learned how to use more of the special characters and utilize them as shorter forms of more complex patterns. You can now also rely on your patterns a little bit more with the knowledge on how to specify a position of the pattern that you would like to look for: the beginning of a string, the end of it, or both, resulting in a pattern trying to match the whole string or line (if in the multiline mode). Our patterns are getting more and more complex: I encourage you to put it to use. More parts of the course will come soon. Take care!

Series Navigation<< Regex course – part one. Basic concepts.Regex course – part three. Grouping and using ES6 features. >>