Regex course – part two. Writing more elegant and precise patterns.

JavaScript Regex

Hello again! Today we are back to regular expressions in JavaScript. If you are new to them, check out the first part of the course. This time we will learn how to write more elegant patterns and define the position of searched strings.

The shorter way to define repetitions

Recently we’ve learned, that the asterisk, * , can make the expression to be matched 0 or more times. This is equivalent to {0,} . There are actually additional shorter forms and using them might help make your patterns more elegant and shorter.

One or more repetitions

With the plus sign, + , we can indicate that the expression might be matched one, or more times. This is similar to the asterisk, but this time it has to match at least once. It is equivalent to  {1,} .

It means that  /.+/ will match any character appearing at least once.

Imagine for example checking if a string contains a substring, but does not end with it:

Please note that a question mark is a special character and because of that we need to precede it with a backslash.

We can take it a little further and write a more generic function:

Optional character

As said above, the question mark is a special character. Using it, we can create a pattern with an optional character. It makes it equivalent to  {0,1} .

The shorter way to define a set of possible characters

Previously we used square brackets,  [ ] , to define a set of possible characters. In regex, there are a few sets implemented that you can easily refer to.

Alphanumeric character

If you ever wanted to match any alphanumeric character, you would need a pattern like that:  /[A-Za-z0-9_]/ . Quite a complex one, isn’t it? There is a shorter way to do this, though:  \w . Watch out though: neither of them will match any language-specific characters!

Non-alphanumeric character

It is an opposite of the pattern described above:  /^[A-Za-z0-9_]/ . Its equivalent is  \W . It shares the same flaws and does not handle language-specific characters:

Handling digits

Previously, we’ve learned that to match any digit we can use a pattern like that:  [0-9] . We can also make use of  \d . It will match any digit:

Time to make a more generic regex note: in some implementations (JavaScript included)  \d means just  [0-9] . In some, it might match any Unicode digit character, such as Eastern Arabic numerals. An example of such implementation is the one in Python 3.

Using  \D will match any non-digit characters.

Dealing with whitespaces

In strings, there are a few types of whitespace characters:

  • space ” “
  • tab “/t”
  • new line “\n”
  • carriage return “\r”

To create a pattern that matches every one of them, we would need something complex like that:  /[ \t\n\r]/ . There is an easier way, though, and it involves using  \s :

Respectively, using  \S would match any non-whitespace character.

Specifying the position

So far, we’ve been just writing patterns that will be matched if they are found anywhere in the string. We can specify the position in order to be more precise.

Caret sign

If you add ^  sign at the very beginning of your pattern, it will match only if the tested string begins with that pattern:

Please note that the caret sign serves a different purpose when used in the square brackets, which was said in the previous part of the course.

Dollar sign

Adding a dollar sign at the end of your pattern will make it match only if it appears at the end of the string:

Combining both signs

If you begin your pattern with  ^ and end it with  $ it will match only if the tested string matches as a whole:

Even if the string “success” can be found in the tested string, enclosing the pattern in ^  and $  would make it match only if the whole string matches.

Check one more example:

Here, we are checking if the string contains only digits. Using plus sign will make it work for one or more digits. Enclosing the pattern in  ^ and  $ signs will make sure that the expression matches only, if there are digits from the beginning to the end of the string, and nothing else.

Without the mentioned signs, the second string would match:

Multiline mode

We’ve already learned that there are additional flags that we can add to our patterns. One of them is the multiline flag represented by the  m letter. It changes the meaning of caret and dollar sign. In multiline mode, they stand for the beginning and end of a line and not the whole string.

I’ve used template literals here to add newlines. We could also do something like that:

Thanks to the multiline flag, only the lines are being tested as a whole, and not the entire string. You can observe that the last test still won’t pass because the last line contains more than just a “parrot”.

Summary

This time we’ve learned how to use more of the special characters and utilize them as shorter forms of more complex patterns. You can now also rely on your patterns a little bit more with the knowledge on how to specify a position of the pattern that you would like to look for: the beginning of a string, the end of it, or both, resulting in a pattern trying to match the whole string or line (if in the multiline mode). Our patterns are getting more and more complex: I encourage you to put it to use. More parts of the course will come soon. Take care!

Series Navigation<< Regex course – part one. Basic concepts.Regex course – part three. Grouping and using ES6 features. >>

Leave a Reply

Your email address will not be published. Required fields are marked *