Master regex hands-on
Finally, master that skill by experimenting with yourself
Disclaimer: This tutorial is based on experimentation, thus it is not very straightforward so you can discover how does regex reacts. Don’t hesitate to tinker with the expressions during the tutorial
Content table
- What is regex?
- Basics
- Ranges
- Quantifiers
- Selecting groups
- Extra-curriculum
What is Regex?
Regex is a very powerful tool when you’re working with data. Do you need to filter text to validate an email on a website? Go with Regex. Do you need to clean text for an NLP project? Regex is your friend! The possibilities are endless.
But you quickly understand it’s not going to be that trivial because these piles of weird symbols don’t make much sense, and that’s normal. That’s we’re going to step by step demystify this by making experiments.
For this purpose, we will be using Regex101 to be able to test our regex in a sandbox.
To keep it simple, regex allows you to filter out complex character strings schemes, so you can accept various formats when you ask for a date for example. It makes it easy to save each group ( month group, day group, year group, but we’ll come to that later ) to manipulate it.
How does it work?
Basics
For this part, paste the following text in the text string box :
easily enough, I will learn regex.
Starting from the basics, if you put any string of characters, it will filter them out if it finds them in the order specified. For example, if you add ea in the regular expression box, you get two matches:
Notice the expression can be in the middle of the word
But what if you wanted to only select ea when it begins the string? Well, the ^ symbol allows you to only select what’s at the beginning. You just have to place it right before the expression you want to filter
You can do the same thing if you want to filter something at the end of the string, using the $ symbol. In this case, if I want to filter ex. , I’d have to type $ex. . Let’s try this :
It worked! But not for the reason you think. As proof, change the period in the test string to anything you want, it will work anyway.
Huh ?? What would it work if we don’t have a period in our test string ?? It’s simply because the period in regex has its own meaning: it’s a wildcard. That means it would match any character ( except a new line ). In this case, our expression doesn’t filter ex. at the end of a string, but ex+any character at the end of the string. If you wanted to search for a period anyway, just escape it by putting an escaping character ( the backslash \ ) before the period.
Ranges
For this part, the text we’ll be working with is
learning data science doesn’t take 1 day
Alright, you learned how to filter a simple piece of text. But what if I ask you to show every character that is not a number?
Well in regex it’s actually simple. The squared brackets [] allow you to filter anything that is between these brackets. What’s the difference with just typing the letters we want to filter out? Well, abc would filter any string that matches abc, while [abc] would match any character that is either a, b or c.
Back to our problem, if we want to filter any lowercase letter, we’d have to type [abcdefghijklmnopqrstuvwxyz]. Fortunately for us, some shortcuts exist. For lowercase letters, its [a-z]. Let’s try this with our sentence :
Similarly, there are other shortcuts :
- [A-Z] for all caps
- [a-q] for all letters from a to q
- [a-zA-Z0–9] or \w for every alphanumeric character
- [^a-z] to select everything that is not a lowercase letter ( the ^ within squared brackets indicates a negation )
- …plenty of custom ones you can make
Quantifiers
For this part, the sentence will be
In 2022, I would like to fix my sleep schedule
Quantifiers are basically here to help you filter a character happening a specific amount of time, depending on your needs. They can both be used after characters and ranges.
- * allows you to select between 0 and unlimited times a character
With s[le]* , we indicate we want at least a s, followed by either an l, an e, or None once or multiple times: sl, s, se, seel, seele, slee…..
- ? matches 0 or 1 times the previous character. It is useful when you’re trying to filter something that is not necessary ( for example the day in the date, you could say January 2022 without specifying the day )
With s[le]?, we indicate we want the s, and maybe one of the characters between brackets ( only one ! ). You could then have: s, see, or sl.
- + matches at least once the previous character
With s[le]+, we indicate we want the s, and at least one l and one e, no matter the order. It could then be sle, sel, seel, slee…
- {x} means you want a character to repeat x times, whereas {x,y} means you want it to repeat between x and y times
e{2} means we want the e to appear twice (ee). If we wanted to appear between 1 and two times, we would have written e{1,2}. Try it yourself!
Note: when using the curly brackets with two arguments ( {x,y} ), there are two things to remember:
- If you don't specify x ( {,y} ), it would mean you’re looking for your character up to y times
- If you don’t specify y ( {x,} ), it would mean you’re looking for your character at least x times
Selecting groups
For this part, we’ll use this date as our test string :
27/01/2022
Okay, we learned how to filter what we wanted, but how do we retrieve it in groups? For example, if we’re working with dates, how do we separate the day part, the month part, and the year part?
In Regex, we capture strings of characters between parentheses (). Let’s type this regex code :
(\d{2})\/(\d{2})\/(\d{4})
But before I tell you what it is, try to analyze it yourself ( hint: \d stands for any digit ). It may look scary, but it’s actually easy if you break it down.
So (\d{2}) means we want two digits and put them into a group. We separate these with a slash / (the backslash just before is just to escape it, as you can’t use a slash alone in your regex code. By doing so, you’re telling regex “don’t read the next character as a special one”.), they must be the day and the month. We do the last thing again, but with a 4 instead (\d{4}). It means we’re taking 4 digits, for the year.
But look on the right part of your page, you can see the match information. Digits are separated just as we wanted.
But imagine we wanted to add the day as an optional component of the date? When making groups, if you put ?: at the beginning of your parenthesis like so (?:…), it creates a “non-capturing group” ( which means it won’t be in the match information tab ).
Why is that interesting? Because you can now put quantifiers on this group, and which one have we been talking about for this occasion? That’s right, the interrogation mark.
Then, this code allows you to add that condition for the day :
(?:(\d{2})\/)?(\d{2})\/(\d{4})
We removed the day
And it no longer appears, without crashing!
Extra-curriculum
Why not try to apply all of this knowledge to a short project to test your skills? In this article, I give you one problem to solve with a detailed solution in case you get stuck. You should give it a try, it’s by practicing that we retain things ;)
Conclusion
And with this, you’ll make 90% of the work needed with regex. For more precise scenarios, Google and the sandbox we’ve been using are your friends.
If you don’t remember some syntax or want to find something more specific, have a look at this cheat sheet.
Thanks for reading this article, I hope you found the information you were looking for. I wish you good luck with your data science or programming journey ❤