Regular expressions in R
Regular expressions in R are a very useful way to work with strings and patterns found in them. For this exercise we are going to use the stringr
package.
Regular expressions are expressions that describe patterns in strings. They are very useful to find general patterns instead of having to indicate every possible combination. For example, you can use regular expressions to find letters, numbers and other special characters.
Escaping special characters
Using regular expressions you need to escape special characters. For example, special characters such as .
or \
, need to be escaped with a preceding \\
. Thus, to look for a point in a string you would use \\.
. Other specual characters such as punctutaion characters, parentheses and brackets need to be escaped.
Groups of characters
Regular expressions enable looking for groups of characters. For example, letters, numbers, spaces, etc. Such groups of characters are usually written [:group:]
. Examples of these groups are:
Quantifiers
Additionally, to indicating groups of characters, you can indicate how many instances of the character or group of characters you are interested in finding. The quantifiers are:
Let’s do a simple example with tidyverse
that contains stringr
. In this example we will use str_extract
that extracts only the first match with the indicated pattern. If you wish to extract all the matches, you might use str_extract_all
and then unnest
.
Resulting in the following:
Position in string
Additional expressions can refer to the position of a pattern in a string. For example, if the pattern is at the start or end of the string.
More specific groups
If you are not intereseted in any of the general groups of characters you can create your own group of characters of interest. This can be done with the following expressions.
Continuing with the example
Lookarounds
Lookarounds are used to include characters that precede or proceed after the pattern of interest that can help determine the exact pattern we are interested in. There are four lookarounds:
General groups used afterwards
In some cases, you are not interested just in extracting a string pattern, but you might want to actually use that precise string (instead of the general pattern). In this cases, you might define groups using ()
and then refer to each group by its order of appearance.
For example, in this case we will replace “lett” for the first group character, which is only an “e”.