A Simple Regular Expression Tutorial

(Updated on 05/31/2017)

 

Regular expression is the most efficient way to search for patterns and do any kind of text manipulation.

This tutorial is for those who have difficulties with its syntax, which at first glance may seem very complicated. If you are yet to be able to read a RE, this post is for you!

 

 

What’s This Post About

Regular expression can be an artifice of great value or a source of torment for any developer. When used well, they can facilitate text interaction, helping in the implementation of how your code will read inputs or even in modifying whole sections of a source code.

As it all depends on how they are designed and in what situations they will be used, a certain degree of creativity is needed. The desire to to solve all kinds of problems with RE is common, but this line of thinking can create patterns that are too big and too difficult to understand. Just as it is good to know how to apply them, choosing when to avoid their use is of equal importance.

With this in mind, this post is meant to introduce the most important metacharacters as well as serve as a quick guide if you ever need to remember the particular syntax some program or programming language employ.

Metacharacters Across Different Programs

Before presenting the usage of each metacharacter, the image below shows how each one is used by different programs or programming languages.

It is worth mentioning that the syntax of the following metacharacters never changes: *, [], [^], ^, $, and \.

Index

Metacharacters

 Dot

Metacharacter: .

How It Works:

The dot is a representation of any character, just once. The only exception is the line break ( \n or  \r\n).

To match a literal  ., use \..

To match any character in other quantities, use the metacharacters ?, +, *, or  {}.

When To Use:

When you want to match any character in that exact position.

Example:

view all metacharacters

Character Classes

Metacharacters: [ ]

How It Works:

It’s a list of possibilities for that position, matching any of them only once. Within the list, all characters are literal.

You can use the interval representation, as in  [0-9] (from zero to nine) or  [A-Z] (all capital letters), but keep in mind that all ranges are relative to the ASCII table!

E.g.: [0-9] equals to  [0123456789]

To match any element of the character class in other quantities, use the metacharacters ?, +, *, or  {}.

When To Use:

When there is more than one option that you can match in that exact position.

Example:

Important Remarks:

As stated, the range follows the order defined in the ASCII table. This means that, to match all the letters, you’d use [A-Za-z] and not  [A-z] because you’d be including, in addition to all the letters, the characters [, \, ], ^, _, and  `.

As the character  - within the list is used to denote a range, to match it literally it should be the last member of the character class.

E.g.: [A-Z-]

A literal  ] must be the first element of the list or else it’ll denote the character class’ end.

E.g.: []a-z]

A literal  ^ can not be the first element of the list (it would result in a negated character class).

E.g.: [0-9^]

Besides that, POSIX has defined some lists and metacharacters that work in all languages and programs to make your life easier.

view all metacharacters

Negated Character Classes

Metacharacters: [^ ]

How It Works:

Unlike the previous one, the negated character class tells you which options you can not match, just once.

Its behavior is the same as the regular character class, but with inverted logic.

To avoid matching any element of the negated character class in other quantities, use the metacharacters ?, +, *, or  {}.

When To Use:

When there is more than one option that you can not match in that exact position.

Example:

Important Remarks:

As stated, the range follows the order defined in the ASCII table. This means that, to avoid matching all the letters, you’d use [A-Za-z] and not  [A-z] because you’d be including, in addition to all the letters, the characters [, \, ], ^, _, and  `.

As the character  - within the list is used to denote a range, to match it literally it should be the last member of the negated character class.

E.g.: [^A-Z-]

A literal  ] must be the second element of the list or else it’ll denote the negated character class’ end.

E.g.: [^]a-z]

view all metacharacters

 Grouping

Metacharacters: ( )

How It Works:

It defines groups of RE. Within any given group, you can place characters, metacharacters and even other groups (always counted by the number of open parentheses).

They are quantifiable and very useful as they can be used in conjunction with other metacharacters since each group is treated as a single element inside the RE.

To match any group in other quantities, use the metacharacters ?, +, *, or  {}.

To choose between a group and another element of the RE, use the metacharacter  | after  ).

When To Use:

In addition to saving chunks of a RE, it can also be used to group a part of the ER to be used with the metacharacters ?, +, *{}, or  |.

Examples:

view all metacharacters

Backreference

Metacharacters: \1

How It Works:

Only works when coupled with one or more groups. It is used to repeat the occurrence of a group, at most 9 times.

Some interpreters allow the use of named groups or the possibility to address 10 or more groups in their references.

When To Use:

1. To save part of the RE to be reused in the same RE;
2. To save part of the match to be used afterwards (e.g.: in substitutions).

Example:

view all metacharacters

Alternation

Metacharacter: |

How It Works:

Enables more than one option to match in that position.

Very useful inside groups since the part that will be saved can change!

When To Use:

When you need one RE or another.

Example:

view all metacharacters

Optional

Metacharacter: ?

Ungreedy Version: ??

How It Works:

Informs that the previous element may or may not be present.

As there is more than one possibility, the regular behavior is to match as much as possible, that is, ? will give preference to when the element does exist. In order to match as little as possible, in this case give preference avoid matching the element, use ??.

When To Use:

When you want to match an element zero or one time.

Example:

view all metacharacters

Plus

Metacharacter: +

Ungreedy Version: +?

How It Works:

Informs that the previous element may be present more than once.

As there is more than one possibility, the regular behavior is to match as much as possible, that is, + will match until the very last occurrence is found. In order to match as little as possible, use +?.

When To Use:

When you want to match an element one or more times.

Example:

view all metacharacters

 Star

Metacharacter: *

Ungreedy Version: *?

How It Works:

Informs that the previous element may not be present or be present in any quantity.

As there is more than one possibility, the regular behavior is to match as much as possible, that is, * will match until the very last occurrence is found. In order to match as little as possible, use *?.

When To Use:

When you want to match an element zero, one or more times.

Example:

view all metacharacters

 Curly Brackets

Metacharacters: {min,max}

Ungreedy Version: {min,max}?

How It Works:

Informs that the element can be present from min to max times.

When the minimum value is omitted ( {,max}), the RE matches the element from zero to max times.

When the maximum value is omitted ( {min,}), the RE matches the element at least min times.

When only one value is used without a comma ( {num}), the RE matches the element exactly num times.

As there is more than one possibility, the regular behavior is to match as much as possible, that is, {min,max} will give preference to match up to max times. In order to match as little as possible, in this case give preference to match min times, use {min,max}?.

When To Use:

When you want to specify how many times an element can match.

Example:

Important Remarks:

{0,1} equals to  ?

{0,1}? equals to  ??

{1,} equals to  +

{1,}? equals to  +?

{0,} equals to  *

{0,}? equals to  *?

view all metacharacters

Caret

Metacharacter: ^

How It Works:

If it’s the first character of the RE, it marks the beginning of a line.

When To Use:

To inform that the RE should start matching at the very beginning of the line to be valid.

Example:

view all metacharacters

Dollar Sign

Metacharacter: $

How It Works:

If it’s the last character of the RE, it marks the end of a line ( \n or  \r\n).

When To Use:

To inform that the RE must finish matching at the very end of the line to be valid.

Example:

view all metacharacters

Word Boundaries

Metacharacters: \b

How It Works:

It’s used to delimit the boundary of a word.

Depending on the interpreter, word encompasses  [A-Za-z_] or without the underscore, as in  [A-Za-z].

When To Use:

When you want to make sure the RE matches the left or right boundary of a word.

Examples:

view all metacharacters

Escape

Metacharacter: \

How It Works:

Used to transform a metacharacter into a literal character.

In some programs it’s necessary to escape a metacharacter to use it as such (as a group in Linux’s VI: \(...\)).

When To Use:

When you need to match the literal characters:

. : \. 
[ : \[ 
] : \] 
? : \? 
+ : \+ 
{ : \{ 
} : \} 
^ : \^ 
$ : \$ 
\ : \\ 

Example:

view all metacharacters

POSIX Classes

To easy the use of lists, some were predefined by POSIX.

All possibilities for use within a Character class:

POSIX Similar To Comment
[:upper:] [A-Z] Uppercase letters
[:lower:] [a-z] Lowercase letters
[:alpha:] [A-Za-z] Alphabetic characters
[:alnum:] [A-Za-z0-9] Alphanumeric characters
[:word:] [A-Za-z0-9_] Word characters (letters, numbers and underscores)
[:digit:] [0-9] Digits
[:xdigit:] [0-9A-Fa-f] Hexadecimal digits
[:punct:] [!"\#$%&'()*+,\-./:;<=>?@\[\\\]^_`{|}~] Punctuation and symbols
[:blank:] [ \t] Space and TAB
[:space:] [ \t\n\r\f\v] All whitespace characters, including line breaks
[:cntrl:] [\x00-\x1F\x7F] Control characters
[:graph:] [^\t\n\r\f\v] Visible characters (anything except spaces and control characters)
[:print:] [^ \t\n\r\f\v] Punctuation and symbols
[:ascii:] [\x00-\x7F] All ASCII characters

Example:

All possibilities for use outside a character class:

Metacharacter Translation Name
\d [[:digit:]] Digit
\D [^[:digit:]] Not-Digit
\w [[:alnum:]_] Word
\W [^[:alnum:]_] Not-Word
\s [[:space:]] Space
\S [^[:space:]] Not-Space

Example:

ver todos os metacaracteres

Miscellaneous Metacharacters

Finally, the following table presents some metacharacters that can be used outside character classes.

Since they are not standard, it’s necessary to check if the chosen interpreter implements each of them.

Metacharacter Meaning Similar to
\a Alphabetic [[:alpha:]]
\A Not-Alphabetic [^[:alpha:]]
\h Word Head [[:alpha:]_]
\H Not-Word Head [^[:alpha:]_]
\l Lowercase letters [[:lower:]]
\L Not-Lowercase letters [^[:lower:]]
\N Not-Line End [^\n]
\u Uppercase letters [[:upper:]]
\U Not-Uppercase letters [^[:upper:]]
\o Octal Digit [0-7]
\O Not-Octal Digit [^0-7]
\B Not-Boundary
\A Start of a buffer
\Z End of a buffer
\l Turn to lowercase
\L Turn to lowercase until  \E
\u Turn to uppercase
\U Turn to uppercase until  \E
\Q Escape characteres until  \E
\E End of modification
\G End of last match

ver todos os metacaracteres

Final Words

Don’t forget to leave any questions in the comments, in case you need some help, and good luck!