Learn Regex the Easy Way, Part 5: The Dot, Escaping, and Special Characters

Quick Recap #

In Part 4, we covered quantifiers: * (zero or more), + (one or more), ? (optional), and {n,m} (specific counts). We also learned that quantifiers apply to the element directly before them. Now let’s talk about the characters regex treats as special and how to handle them.

The 14 Metacharacters #

Regex has 14 characters that have special meaning. These are called metacharacters:

.  *  +  ?  ^  $  {  }  [  ]  (  )  |  \

When the engine sees any of these, it doesn’t treat them as literal characters. It interprets them as instructions. We’ve already used several: ^ and $ as anchors, * + ? as quantifiers, [] for character classes.

One note: This lesson isn’t about what these special characters do. A future lesson will cover what meta characters are used for in more detail.

Escaping with Backslash #

What if you actually want to match a literal dot, or a literal asterisk? You put a backslash \ before it. This is called escaping. The backslash tells the engine “treat the next character as a plain, literal character.”

You Type It Matches
\. A literal period/dot
\* A literal asterisk
\+ A literal plus sign
\? A literal question mark
\^ A literal caret
\$ A literal dollar sign
\{ and \} Literal curly braces
`

This is a good example of how a regex with an escaped . would look:

[[:digit:]]     # A digit
\.              # A literal dot
[[:digit:]]     # Another digit

MATCH THESE

  • 3.14 (matches "3.1")
  • 127.0.0.1 (matches "127.0" etc.)
  • file.txt (matches "e.t")

DO NOT MATCH THESE

  • 3x14
  • 127x0x0x1
  • filetxt

The Dot (.) as a Wildcard #

Without a backslash, the dot is a wildcard. It matches any single character except a newline (by default).

c               # Literal "c"
.               # ANY single character (wildcard)
t               # Literal "t"

Compact: c.t

MATCH THESE

  • cat
  • cot
  • cut
  • c9t
  • c!t

DO NOT MATCH THESE

  • ct (no character between c and t)
  • cart (two characters between c and t)

The dot is tempting because it matches everything, but that’s also its weakness. If you know you’re looking for a letter, use [[:alpha:]]. If you know it’s a digit, use [[:digit:]]. Be specific. The dot is lazy regex writing and often matches things you didn’t intend.

Practical Example: Matching Prices #

\$              # Literal dollar sign (escaped)
[[:digit:]]+    # One or more digits (dollars)
\.              # Literal dot (escaped)
[[:digit:]]{2}  # Exactly 2 digits (cents)

Compact: \$[[:digit:]]+\.[[:digit:]]{2}

MATCH THESE

  • $19.99
  • $5.00
  • $1234.56

DO NOT MATCH THESE

  • 19.99 (no dollar sign)
  • $19 (no cents)
  • $19.9 (only one cent digit)

What to Practice #

  1. Write a regex that matches an IP address format: four groups of digits separated by literal dots (don’t worry about number ranges yet).
  2. Write a regex that matches a question mark at the end of a line. (You’ll need to escape it AND use an anchor.)
  3. Write a regex using the dot wildcard that matches any three-character string starting with “a” and ending with “z”.
  4. Rewrite the regex from exercise 3 to be more specific: match “a” followed by exactly one lowercase letter, followed by “z”.

Definitions #


Series Navigation #

 
0
Kudos
 
0
Kudos

Now read this

Tools should be simple

If you look at some software preferences and there is an advanced tab, sit and think about that for a minute. Think about the fact that most people will never see those preferences. Also, think about the fact that the advanced tab is a... Continue →