Are you proficient in Regex? If not, you may want to try this

Brice Vergnou
4 min readFeb 17, 2022

In this article, I’ll walk you through a problem that you have to solve using regex while providing resources if you lack proficiency. Make sure to spend a good amount of time trying to solve the problem before looking at the solution; otherwise, you won’t learn anything.

Photo by Crew on Unsplash

This article is the continuation of Master regex hands-on, an article of mine where I practically teach you regex. Make sure to check it if you need a refresher or if you’re stuck during the exercise.

As a reminder, you can use Regex101 as a sandbox to practice Regex, this is what we’ll be using in this project.

Project presentation

You’re working with a very large dataset in which dates are listed. But there is one problem: as it’s not always the same person who collects the data, the date format someone uses may change from one person to another. Hence, you decide to use Regex to manipulate these dates uniformly regardless of their format.

Your goal will be to “code” in regex the patterns of these dates to be able to capture them. You will give a name to each group (day, month, year) so, if you were doing this project for real, you could more easily manipulate these dates ( to give a name to a capturing group, just add ?<name> at the beginning of the parentheses ). Example :

Which captures 2010 in a group called “year” :

Exercise

The different date formats will be :

  • 02/12/2003 ; 2/12/2013 ; 2/5/2006 ; 2/6/01
  • 02-12-2003 ; 2-12-2013 ; 2-5-2006 ; 2-6-01
  • 12/2010 ; 2010

The test strings will be the dates right above. You can copy them from this Pastebin

Solution

First, let’s handle full dates separated with slashes. We have digits ( \d ), between 1 and 2 for days and months ( \d{1,2} ) and between 2 and 4 for years ( \d{2,4} ). It would then be something like this :

\d{1,2}\/\d{1,2}\/\d{2,4}

( the backslash is just to escape the slash, otherwise, you get an error )

Let’s just add the named capturing groups (make sure not to capture the separators, we only want the figures) :

(?<day>\d{1,2})\/(?<month>\d{1,2})\/(?<year>\d{2,4})

It’s capturing correctly the first format

For the second format, we need to tell regex the figures are either separated with a slash or a dash. Well, the squared brackets allow us to indicate this choice to regex, hence changing :

\/

to :

[\/-]

which gives us :

(?<day>\d{1,2})[\/-](?<month>\d{1,2})[\/-](?<year>\d{2,4})

Now, we need to explain that the day and/or the month is/are optional. In other words, these expressions can appear only one or zero times. The ? quantifier allows us to express this condition. By changing

(?<day>\d{1,2})[\/-]

to

(?:(?<day>\d{1,2})[\/-])?

we make a non-capturing group around the day and the first separator and tell Regex: “ok we may need this but if it doesn’t appear…it’s fine”. Similarly, we can make another condition to say that both the day AND the month can be optional :

(?:(?:(?<day>\d{1,2})?[\/-])?(?<month>\d{1,2})?[\/-])?(?<year>\d{2,4})

Great, it captures everything! And if you head to the Match Information tab, all of our dates are nicely formatted :

Extra-curriculum

If you want to learn more about Regex or even text mining in general, you can check the Applied Text Mining in Python course on Coursera.

You can join this course for free, you just won’t be able to get a certificate or validate your assignments.

Conclusion

If you managed to go through the exercise without looking at the solution, congratulations! And even if you didn’t, you still learned a lot by experimenting and trying to figure out why your code wasn’t working. In both cases, you can try the course I’ve been talking about right above if you think that text mining is important for your career.

Thanks for reading this article, it motivates me to learn more to make content when I receive messages from people I’ve been helping. You can connect with me on Twitter or Linkedin.

--

--