Master regex hands-on

Brice Vergnou
8 min readJan 27, 2022

Finally, master that skill by experimenting with yourself

Regex example
Image by author

Disclaimer: This tutorial is based on experimentation, thus it is not very straightforward so you can discover how does regex reacts. Don’t hesitate to tinker with the expressions during the tutorial

Content table

  • What is regex?
  • Basics
  • Ranges
  • Quantifiers
  • Selecting groups
  • Extra-curriculum

What is Regex?

Regex is a very powerful tool when you’re working with data. Do you need to filter text to validate an email on a website? Go with Regex. Do you need to clean text for an NLP project? Regex is your friend! The possibilities are endless.

But you quickly understand it’s not going to be that trivial because these piles of weird symbols don’t make much sense, and that’s normal. That’s we’re going to step by step demystify this by making experiments.

For this purpose, we will be using Regex101 to be able to test our regex in a sandbox.

To keep it simple, regex allows you to filter out complex character strings schemes, so you can accept various formats when you ask for a date for example. It makes it easy to save each group ( month group, day group, year group, but we’ll come to that later ) to manipulate it.

How does it work?

Basics

For this part, paste the following text in the text string box :

easily enough, I will learn regex.

Starting from the basics, if you put any string of characters, it will filter them out if it finds them in the order specified. For example, if you add ea in the regular expression box, you get two matches:

ea is both in easy and learn

Notice the expression can be in the middle of the word

But what if you wanted to only select ea when it begins the string? Well, the ^ symbol allows you to only select what’s at the beginning. You just have to place it right before the expression you want to filter

only the ea in easy was filtered, as it begins the string

You can do the same thing if you want to filter something at the end of the string, using the $ symbol. In this case, if I want to filter ex. , I’d have to type $ex. . Let’s try this :

It worked! But not for the reason you think. As proof, change the period in the test string to anything you want, it will work anyway.

notice the . became a !

Huh ?? What would it work if we don’t have a period in our test string ?? It’s simply because the period in regex has its own meaning: it’s a wildcard. That means it would match any character ( except a new line ). In this case, our expression doesn’t filter ex. at the end of a string, but ex+any character at the end of the string. If you wanted to search for a period anyway, just escape it by putting an escaping character ( the backslash \ ) before the period.

the ! is no longer filtered as we use \.

Ranges

For this part, the text we’ll be working with is

learning data science doesn’t take 1 day

Alright, you learned how to filter a simple piece of text. But what if I ask you to show every character that is not a number?

Well in regex it’s actually simple. The squared brackets [] allow you to filter anything that is between these brackets. What’s the difference with just typing the letters we want to filter out? Well, abc would filter any string that matches abc, while [abc] would match any character that is either a, b or c.

Back to our problem, if we want to filter any lowercase letter, we’d have to type [abcdefghijklmnopqrstuvwxyz]. Fortunately for us, some shortcuts exist. For lowercase letters, its [a-z]. Let’s try this with our sentence :

only spaces and the 1 are not selected

Similarly, there are other shortcuts :

  • [A-Z] for all caps
  • [a-q] for all letters from a to q
  • [a-zA-Z0–9] or \w for every alphanumeric character
  • [^a-z] to select everything that is not a lowercase letter ( the ^ within squared brackets indicates a negation )
  • …plenty of custom ones you can make

Quantifiers

For this part, the sentence will be

In 2022, I would like to fix my sleep schedule

Quantifiers are basically here to help you filter a character happening a specific amount of time, depending on your needs. They can both be used after characters and ranges.

  • * allows you to select between 0 and unlimited times a character

With s[le]* , we indicate we want at least a s, followed by either an l, an e, or None once or multiple times: sl, s, se, seel, seele, slee…..

  • ? matches 0 or 1 times the previous character. It is useful when you’re trying to filter something that is not necessary ( for example the day in the date, you could say January 2022 without specifying the day )

With s[le]?, we indicate we want the s, and maybe one of the characters between brackets ( only one ! ). You could then have: s, see, or sl.

  • + matches at least once the previous character

With s[le]+, we indicate we want the s, and at least one l and one e, no matter the order. It could then be sle, sel, seel, slee…

  • {x} means you want a character to repeat x times, whereas {x,y} means you want it to repeat between x and y times

e{2} means we want the e to appear twice (ee). If we wanted to appear between 1 and two times, we would have written e{1,2}. Try it yourself!

Note: when using the curly brackets with two arguments ( {x,y} ), there are two things to remember:

  • If you don't specify x ( {,y} ), it would mean you’re looking for your character up to y times
  • If you don’t specify y ( {x,} ), it would mean you’re looking for your character at least x times

Selecting groups

For this part, we’ll use this date as our test string :

27/01/2022

Okay, we learned how to filter what we wanted, but how do we retrieve it in groups? For example, if we’re working with dates, how do we separate the day part, the month part, and the year part?

In Regex, we capture strings of characters between parentheses (). Let’s type this regex code :

(\d{2})\/(\d{2})\/(\d{4})

But before I tell you what it is, try to analyze it yourself ( hint: \d stands for any digit ). It may look scary, but it’s actually easy if you break it down.

So (\d{2}) means we want two digits and put them into a group. We separate these with a slash / (the backslash just before is just to escape it, as you can’t use a slash alone in your regex code. By doing so, you’re telling regex “don’t read the next character as a special one”.), they must be the day and the month. We do the last thing again, but with a 4 instead (\d{4}). It means we’re taking 4 digits, for the year.

But look on the right part of your page, you can see the match information. Digits are separated just as we wanted.

But imagine we wanted to add the day as an optional component of the date? When making groups, if you put ?: at the beginning of your parenthesis like so (?:…), it creates a “non-capturing group” ( which means it won’t be in the match information tab ).

Why is that interesting? Because you can now put quantifiers on this group, and which one have we been talking about for this occasion? That’s right, the interrogation mark.

Then, this code allows you to add that condition for the day :

(?:(\d{2})\/)?(\d{2})\/(\d{4})

We removed the day

And it no longer appears, without crashing!

Extra-curriculum

Why not try to apply all of this knowledge to a short project to test your skills? In this article, I give you one problem to solve with a detailed solution in case you get stuck. You should give it a try, it’s by practicing that we retain things ;)

Conclusion

And with this, you’ll make 90% of the work needed with regex. For more precise scenarios, Google and the sandbox we’ve been using are your friends.

If you don’t remember some syntax or want to find something more specific, have a look at this cheat sheet.

Thanks for reading this article, I hope you found the information you were looking for. I wish you good luck with your data science or programming journey ❤

--

--