Learn Regex the Easy Way, Part 3: Character Classes
Learn Regex the Easy Way, Part 3: Character Classes #
Quick Recap #
In Part 2, we learned about anchors and boundaries. The caret ^ matches the start of a line, $ matches the end, and \b matches word edges. They match positions, not characters. Now let’s match actual characters, but with more control than just typing them out literally.
Square Brackets Create a Character Class #
A character class is a set of characters inside square brackets: [abc]. It matches any ONE character from that set. Not all of them. Not some of them. Exactly one character from the options inside the brackets.
[abc] # Match a single character: a, b, or c
MATCH THESE
abcDO NOT MATCH THESE
deababc
Notice that [abc] does NOT match “ab” or “abc.” It matches exactly one character. If your text contains “ab,” the regex will match just the “a” (the first character from the set it finds).
Ranges #
You can use a dash to specify a range of characters:
[a-z]matches any lowercase letter[A-Z]matches any uppercase letter[0-9]matches any digit[a-zA-Z0-9]matches any letter or digit (you can combine ranges)
POSIX Character Classes #
Ranges work, but they’re not very readable. POSIX character classes are named sets that make your regex much clearer:
| POSIX Class | Matches | Equivalent Range |
|---|---|---|
[[:alpha:]] |
Any letter | [a-zA-Z] |
[[:digit:]] |
Any digit | [0-9] |
[[:alnum:]] |
Any letter or digit | [a-zA-Z0-9] |
[[:lower:]] |
Lowercase letters | [a-z] |
[[:upper:]] |
Uppercase letters | [A-Z] |
[[:space:]] |
Whitespace characters | [\t\n\r\f ] |
[[:blank:]] |
Space and tab only | [ \t] |
[[:punct:]] |
Punctuation | Various symbols |
[[:xdigit:]] |
Hexadecimal digits | [a-fA-F0-9] |
[[:word:]] |
Word characters | [a-zA-Z0-9_] |
Negation: The ^ Inside Brackets #
Here’s something that trips people up. The caret ^ does two completely different things depending on where it is:
- Outside brackets:
^is the “start of line” anchor (Part 2) - Inside brackets at the start:
[^abc]means “NOT a, b, or c”
[^abc] # Match any single character that is NOT a, b, or c
MATCH THESE
def1!DO NOT MATCH THESE
abc
Shorthand Character Classes #
You’ll also see shorthand characters. These are quicker to type but less readable. The uppercase version is always the negation:
| Shorthand | POSIX Equivalent | Meaning |
|---|---|---|
\d |
[[:digit:]] |
Any digit |
\D |
[^[:digit:]] |
NOT a digit |
\w |
[[:word:]] |
Word character (letter, digit, underscore) |
\W |
[^[:word:]] |
NOT a word character |
\s |
[[:space:]] |
Whitespace |
\S |
[^[:space:]] |
NOT whitespace |
Practical Example: Validating Usernames #
Let’s match valid usernames: lowercase letters, digits, and underscores only, 3 to 20 characters long.
^ # Start of line
[[:lower:][:digit:]_] # Lowercase letter, digit, or underscore
{3,20} # Between 3 and 20 of those characters
$ # End of line
Compact: ^[[:lower:][:digit:]_]{3,20}$
MATCH THESE
john_doeuser123adminDO NOT MATCH THESE
John_Doe(uppercase)hi(too short)user@name(invalid character)
What to Practice #
- Write a character class that matches any vowel (a, e, i, o, u). Test it on a paragraph of text.
- Write a regex using
[[:xdigit:]]that matches a single hex digit. Then modify it to match exactly 6 hex digits (like a color code). - Write a negated character class that matches anything that is NOT a digit.
- Write a regex that matches a single uppercase letter followed by one or more lowercase letters (like a capitalized name).
Definitions #
- Anchor — A regex element that matches a position in the text, not a character.
- Character Class — A set of characters in square brackets
[...]that matches any ONE character from the set. - Line Anchor (
^,$) — The caret^matches the start of a line; the dollar sign$matches the end. - Metacharacter — A character with special meaning in regex instead of representing itself literally.
- Negation (caret inside brackets) — Placing
^as the first character inside square brackets, like[^abc], negates the class so it matches any character NOT in the set. - POSIX Character Class — A named character class like
[[:digit:]]or[[:alpha:]]that uses a readable name instead of a raw range. - Position (in regex context) — A point between characters in the text. Anchors and boundaries match positions.
- Range (in character classes) — A dash inside brackets specifying a contiguous set:
[a-z]means all lowercase letters from a through z. - Shorthand Character Class — Abbreviated character classes like
\d(digit),\w(word character), and\s(whitespace). Uppercase versions negate them. - Word Boundary (
\b) — Matches the position between a word character and a non-word character.
Series Navigation #
- Part 1: Make Regular Expressions the Easy Way
- Part 2: Anchors and Boundaries
- Part 3: Character Classes (this post)