Learn Regex the Easy Way, Part 3: Character Classes

Learn Regex the Easy Way, Part 3: Character Classes #

Quick Recap #

In Part 2, we learned about anchors and boundaries. The caret ^ matches the start of a line, $ matches the end, and \b matches word edges. They match positions, not characters. Now let’s match actual characters, but with more control than just typing them out literally.

Square Brackets Create a Character Class #

A character class is a set of characters inside square brackets: [abc]. It matches any ONE character from that set. Not all of them. Not some of them. Exactly one character from the options inside the brackets.

[abc]   # Match a single character: a, b, or c

MATCH THESE

  • a
  • b
  • c

DO NOT MATCH THESE

  • d
  • e
  • ab
  • abc

Notice that [abc] does NOT match “ab” or “abc.” It matches exactly one character. If your text contains “ab,” the regex will match just the “a” (the first character from the set it finds).

Ranges #

You can use a dash to specify a range of characters:

POSIX Character Classes #

Ranges work, but they’re not very readable. POSIX character classes are named sets that make your regex much clearer:

POSIX Class Matches Equivalent Range
[[:alpha:]] Any letter [a-zA-Z]
[[:digit:]] Any digit [0-9]
[[:alnum:]] Any letter or digit [a-zA-Z0-9]
[[:lower:]] Lowercase letters [a-z]
[[:upper:]] Uppercase letters [A-Z]
[[:space:]] Whitespace characters [\t\n\r\f ]
[[:blank:]] Space and tab only [ \t]
[[:punct:]] Punctuation Various symbols
[[:xdigit:]] Hexadecimal digits [a-fA-F0-9]
[[:word:]] Word characters [a-zA-Z0-9_]

Negation: The ^ Inside Brackets #

Here’s something that trips people up. The caret ^ does two completely different things depending on where it is:

[^abc]   # Match any single character that is NOT a, b, or c

MATCH THESE

  • d
  • e
  • f
  • 1
  • !

DO NOT MATCH THESE

  • a
  • b
  • c

Shorthand Character Classes #

You’ll also see shorthand characters. These are quicker to type but less readable. The uppercase version is always the negation:

Shorthand POSIX Equivalent Meaning
\d [[:digit:]] Any digit
\D [^[:digit:]] NOT a digit
\w [[:word:]] Word character (letter, digit, underscore)
\W [^[:word:]] NOT a word character
\s [[:space:]] Whitespace
\S [^[:space:]] NOT whitespace

Practical Example: Validating Usernames #

Let’s match valid usernames: lowercase letters, digits, and underscores only, 3 to 20 characters long.

^                         # Start of line
[[:lower:][:digit:]_]     # Lowercase letter, digit, or underscore
{3,20}                    # Between 3 and 20 of those characters
$                         # End of line

Compact: ^[[:lower:][:digit:]_]{3,20}$

MATCH THESE

  • john_doe
  • user123
  • admin

DO NOT MATCH THESE

  • John_Doe (uppercase)
  • hi (too short)
  • user@name (invalid character)

What to Practice #

  1. Write a character class that matches any vowel (a, e, i, o, u). Test it on a paragraph of text.
  2. Write a regex using [[:xdigit:]] that matches a single hex digit. Then modify it to match exactly 6 hex digits (like a color code).
  3. Write a negated character class that matches anything that is NOT a digit.
  4. Write a regex that matches a single uppercase letter followed by one or more lowercase letters (like a capitalized name).

Definitions #


Series Navigation #

 
0
Kudos
 
0
Kudos

Now read this

SSL, TLS, PCI and your app

TLS superseded SSL a very long time ago. However SSL never really went away since it was still considered to be safe. That changed last year and this year. It is no longer safe to use and needs to be removed, else face the consequences.... Continue →