Make Regular Expressions the Easy Way

Introduction and The Problem

Regular Expressions have been around for a long time and can solve problems nothing else really seems to be able to. There are a lot of great examples of where to use them, and where to not use them. Every once in a while, a new product (or version of a product) comes out which includes regex support. Some people get excited about being able to use regular expressions. They tell other people about it.

Then they realize most people seem to think Regular Expresions, or regex for short, are the hardest problem in coding.

There are a group of people who get nervous about regex, or really just do not want to work with any regular expressions. I do not blame them; it looks weird.

However, if I can become proficient at regex, so can you. The following is a set of regular expressions I created, some in part due to what I am recommending in this post.

My favorite regular expressions
Find duplicates Find text wrapped with an asterisk in a unique way
Find the beginning and ending of sentences on a line (the wild horses regex) Find different types of hashes for crypto types
Find valid roman numerals

A note on what this is not

First, this is not a resource for learning Regular Expressions. It is however a recommendation for how to format things so it’s easier to regex up front. I thought I learned Regular Expressions a long time ago, but a few years back actually spent time with these websites working through how to actually learn regex.

RexEgg
Regular-Expressions
Regex101

Regex101 is the best tester for regex around. There are multiple flavors for different languages, a quick reference, and a number of other things. You can version control saved regexes even, which results in a very iterative approach. It’s not perfect, but it’s pretty darn good.

Getting Started

There are three key things to work through on a regex when you first start.

1) Getting your mind wrapped around the language and it’s many, many variants.
2) Getting your mind wrapped around writing something on a single line.
3) Remembering what the regex is doing six months from now.

And really the way a lot of websites and books teach it, getting to the point where you can remember that this regex

^\d{2}\/\d{2}\/\d{4}$

Would match a date of 02/02/2022

(Yes this will also match 45/55/9998, let’s not be pedantic quite yet)

Trying to remember what each symbol or format of text does while also trying to learn how to apply a regex, at the same time, becomes problematic. I believe this is the first problem with regex, it’s not easy because it looks like a cat walked across your keyboard.

Next, it’s all on one line. If you get some kind of long regular expression going, it’s going to be hard to figure out what the regex is doing should you need to refer back to it. Even thirty minutes later. There are no comments, there is nothing here to help other than piecing something back together from the start. It’s… well it’s frustrating.

Instead of dealing with all of these issues from the start, I first teach people to write their regular expressions in this manner:

^                         # Our first anchor, beginning of the line.
[[:digit:]]{2}            # Any 2 digits in a row.
[[:punct:]]               # Any punctuation, also another anchor.
[[:digit:]]{2}            # Any 2 digits in a row.
[[:punct:]]               # Any punctuation, also another anchor.
[[:digit:]]{4}            # Any 4 digits in a row.
$                         # Our last anchor, end of the line.

A few things enable this. The comments and ignored whitespace are brought to you by /x mode. On regex101.com this can be clicked on the regex line, to the right.

Next the syntax is different. You can actually read it. digit is for numbers, punct is for punctuation. While starting to learn regex, ignore the fact that punct will also match ! or any number of other punctuations. The point is to start figuring out how regex work and make the process easier on yourself through this method. Once you do this you can move on to this format.

^                         # Our first anchor, beginning of the line.
[[:digit:]]{2}            # Any 2 digits in a row.
\/                        # An escaped / character. Also an anchor
[[:digit:]]{2}            # Any 2 digits in a row.
\/                        # An escaped / character. Also an anchor
[[:digit:]]{4}            # Any 4 digits in a row.
$                         # Our last anchor, end of the line.

Regardless of using punct or a \/ to escape a slash, I have found this method of writing a regular expression to be a lot easier to read than what other websites teach. When learning to form a regular expression with some sample or test data, start with PCRE 1 or 2 so you can take advantage of this method.

The problem now is that many flavors of regex do not support this method. This makes the code so much more readable.

Look at these two examples, matching this sentence:

This is a test

Example 1:

^[a-zA-Z]+\s\w+\s[a-zA-Z]+\s\w+$

Example 2:

^                         # Our first anchor, beginning of the line.
[[:alpha:]]+              # Alphas
[[:blank:]]               # A blank and also an anchor

[[:alpha:]]+              # Alphas
[[:blank:]]               # A blank and also an anchor

[[:alpha:]]+              # Alphas
[[:blank:]]               # A blank and also an anchor

[[:alpha:]]+              # Alphas
$                         # Our last anchor, end of the line.

While both match the same line, which one would you be able to easily skim and understand if you’re new to regular expressions? I feel example 2 is the right approach to teaching this language to people who have either never seen it before, or who have but have a hard time remembering things.

When it’s time to start converting it, creating a cheat sheet or a script or something else to replace [[:alpha:]] with [A-Za-z] would be a good idea until you’re comfortable enough to learn how to write a regular expression how most people write them. Until then though this sort of syntax will really help to learn the mechanics of a regular expression without getting hung up on the syntax.

You have a full list of possible, human readable options available to you for human readable regexes (info copied from regex101)

Character Class Definition Explanation
[[:alnum:]] Letters and Digits An alternate way to match any letter or digit. Equivalent to [A-Za-z0-9]. The double square brackets is not a typo, POSIX notation demands it.
[[:alpha:]] Letters An alternate way to match alphabet letters. Equivalent to [A-Za-z]. The double square brackets is not a typo, POSIX notation demands it.
[[:ascii:]] ASCII codes 0-127 Matches any character in the valid ASCII range. Equivalent to [\x00-\x7F]. The double square brackets is not a typo, POSIX notation demands it.
[[:blank:]] Space or tab only Matches spaces and tabs (but not newlines). Equivalent to [ \t]. The double square brackets is not a typo, POSIX notation demands it.
[[:cntrl:]] Control Characters Matches characters that are often used to control text presentation, including newlines, null characters, tabs and the escape character. Equivalent to [\x00-\x1F\x7F]. The double square brackets is not a typo, POSIX notation demands it.
[[:digit:]] Decimal Digits Matches decimal digits. Equivalent to [0-9] or \d. The double square brackets is not a typo, POSIX notation demands it.
[[:graph:]] Visible Characters (not whitespace/blanks) Matches printable, non-whitespace, non-control characters only. Equivalent to [\x21-\x7E]. The double square brackets is not a typo, POSIX notation demands it.
[[:lower:]] Lowercase letters Matches lowercase letters. Equivalent to [a-z]. The double square brackets is not a typo, POSIX notation demands it.
[[:print:]] Visible characters
[[:punct:]] Visible punctuation characters
[[:space:]] Whitespace
[[:upper:]] Uppercase letters
[[:word:]] Word characters
[[:xdigit:]] Hexadecimal digits

Making Two Sets of Examples

One method which may seem obvious to some, is to test against two sets of strings. One is a set of strings you want to match. The other is the set you do not want to match.

The non-obvious part is how you lay them out. Ideally you would use regex101.com to test with. Here is how I lay it out on regex101:

MATCH THESE
example 1
example 2
example 3


DO NOT MATCH THESE

exmple1
exmple2
exmple3

The regex for this is one of the following two options. If you were brand new to regex, which one would be easier for you to write? Read back later? Understand?

^example\ \d$, or:

^
example
[[:blank:]]
[[:digit:]]
$

Combining these techniques, you can learn how regex works a lot more easily, and then work on memorizing the syntax as you go.

 
9
Kudos
 
9
Kudos

Now read this

The responsibility of ad networks

Ad networks are serving up malware. This is a practice known as malvertising. This scenario is becoming more and more common as an infection vector utilizing different exploit kits. Imagine this. You browse a website. Then you get an... Continue →