Using Regular Expressions in R to clean data faster

Last updated on Aug 8, 2020 11 min read

Regular expressions can be used to speed up data cleaning because they automate process of finding a pattern within strings. This can be a huge time saver, especially with larger datasets. In this post, I’ll explain what regular expressions are and show you how they can be used to clean your own data.

How can regular expressions speed up the data cleaning process?

Why take the time to learn about regular expressions? Even if you only work with numeric data (although this seems unlikely), they can come in handy. Here are some of the ways you could use regular expressions to automate data cleaning:

Determine which of your columns end in the string “_total”
Locate all rows that start with “Over” and replace “Over” with “Above”
Count the number of times there is the string “fact” in any of your rows
Capitalize all rows within columns that start with “Country_”
Insert underscores into the middle of column names
Remove tildes found in any column names

These are only a few of the options available to you, I’m sure you’d be able to find a way to use regular expression to suit your specific needs 😎

If you tried to do any of the above manually using the positions of the rows or columns, it would likely take a long time and be prone to mistakes. Although some might try to do this in Excel, it would not be reproducible, so if you ever had to undo your changes or try to backtrack to a certain version of the file, it would be hard to figure out exactly what you did.

What are regular expressions?

Before we get into how to take advantage of regular expressions, let’s understand what they are. A regular expression (regex or regexp for short) is a “sequence of characters that define a search pattern.” This definition may seem confusing, but you probably apply the concept of regular expressions in your own life daily. For instance, if you wanted to find a certain podcast on Spotify by name, you wouldn’t go through a long list of every podcast available on Spotify to find it (that would be painful). You would just enter it into a search box and it will produce a list of results that match the “pattern.” Even if you only type in the last word of the podcast, Spotify will usually be able to find it for you. Essentially, regular expressions are like the words you enter into a search box to find the parts of your data that match those words.

Regular expressions can be really powerful, because they are incredibly flexible. By flexible, I mean you can have fairly complicated and/or diverse patterns that you want to find in your data, and you will almost always be able to write a regular expression to find that pattern in your data. Think of all the different ways someone could write an email address. They can have several letters, numbers, exclamation points, among many other characters in them. The possibilities are nearly endless. Yet, there are ways to write a regular expression that will find almost all valid e-mail addresses in your data (see here, here, and here for examples).

How we do use regular expressions in R?

Base R comes with several functions that work with regular expressions. Also, stringr is a package in the tidyverse that is exclusively dedicated to working with strings, and many of its functions are essentially equivalent to base R functions. Some would argue that the function names in stringr are more intuitive (which means less time spent looking up what each function does ❗), so I’ll provide examples of using regular expressions within the stringr package, but feel free to use the base R equivalent if that’s what you prefer (see here for a comparison between functions in base R and functions in stringr). Some examples of the functions that I’ve used for cleaning/analyzing data in stringr include:

str_which(): Where are the strings that match your regex? (output is a vector of the position for each matching string)
str_detect(): Do any of the strings match your regex? (output is a logical vector, TRUE for matches)
str_replace() or str_replace_all(): Replaces parts of strings that match your regex (output is an updated vector with replaced strings)

This list is far from exhaustive, but these are the functions I’ve found most helpful in my research. It is also worth mentioning that other functions accept regular expressions as arguments, even if they are not dedicated to working with strings/regular expressions. For instance, I’ve used pivot_longer to help with cleaning up repeated measures data through the names_pattern argument.

Regex in action: Example from my research

For a study I ran using Qualtrics, I examined how many multiplication problems subjects answered correctly in the amount of time they used to complete the problems, as a proxy for their performance on the task. To calculate average performance across all questions, I had to find all of the columns that contained information about the amount of time each participant spent on the page before they moved on to the next page. I wanted to be able to sum across those specific columns for my performance measure. For this, I was only interested in the columns labeled “prep_timingX_Page Submit” (note: I would avoid having spaces in column names while analyzing your data in R, instead use - or _). Unfortunately, the columns that contained the timing information that I was interested in were separated by other columns, for example:

Since the original raw dataset had 1268 rows and 1290 columns, trying to search for each of the "Page Submit" timing questions manually and selecting them based on index numbers by using either the View(df) or data.frame(colnames(df)) would have been both 1) time-consuming and 2) prone to error. Also, if you end up having to insert more columns afterwards and the new columns are added before the column numbers that you selected originally, your code will select the wrong columns.

So, I can use regular expressions and stringr (make sure to load this first) to find them! Here’s example of how one might go about doing this:

Write the regular expression for the pattern you want to find

First, let’s start with writing the regular expression. What is the pattern that I’m hoping to find in the column names?

Well, I want the column names to start with "prep_timing". To indicate that I want a pattern to start with a certain character or multiple characters, we can use the ^ anchor. Anchors are useful when you want to match characters based on their location. Why would I want to match based on location? It can be helpful when you have many columns that have similar names, but you only want some that start with, end with, or are between certain characters. Within my own dataset, I also had columns that contained the words "prep_timing" but didn’t start with "prep_timing" (e.g., many of my columns start with the characters "extra_prep_timing"). I would want to exclude those columns because they are not part of the specific set of timing questions that I’m interested in.

Thus, the beginning of the regular expression would be ^prep_timing. The next part of the pattern should have a number and then a letter. However, there are several combinations of numbers and letters that I would want to include (e.g., I want columns starting with "prep_timing1a", "prep_timing2b", and so on). To indicate this in my regular expression, I would use square brackets as follows: ^prep_timing[0-9][a-b]. Square brackets indicate that you want a single character in that spot (e.g., I only want one letter, either "a" or "b"). However, because we want to match multiple numbers AND multiple letters, we use two sets of square brackets for each type of character (in this case, one set of brackets for numbers, and the second for letters, i.e, [0-9][a-b]) and indicate the range of numbers and letters that are acceptable using the - symbol. In this case, the regular expression so far is telling us we can match columns names such as: "prep_timing0a", "prep_timing9b", "prep_timing5a". However, it wouldn’t match column names like "extra_prep_timing0a" (because of the ^ anchor) or "prep_timing10a" (because 10 is not included in the square brackets as a possible number - see here for more info on how to include double-digit numbers in your regular expressions).

Now, the only thing we need is add the "_Page Submit" part of the pattern because we don’t want any of the other timing questions included (e.g., "prep_timing2b_First Click" or "prep_timing3a_Last Click"). For this part, you just need to add that phrase verbatim, so the whole regular expression would be: ^prep_timing[0-9][a-b]_Page Submit$. You’ll note that I added a $ at the end of the regular expression. Like ^, $ is an anchor, but in this case, it means that we want to match anything that ends in that pattern. This will exclude any columns that start with the same "prep_timing" pattern, but have extra characters after "Page Submit" (e.g., "prep_timing6b_Page Submit40").

And that’s it! You’ve made it through your first regex (hopefully unscathed)!

Insert the regular expression into your preferred function

Now, we’ll just need pop this regex into our function of choice. In my case, I wanted to find the index numbers of the columns so I could subset them from the dataframe, so I would use str_which(). This function takes two main arguments: the first argument of this function is the name of the vector (e.g., column of a data frame or list of names of dataframe that you’re interested in) and the second argument is the regular expression/pattern that you want to find within the vector. The output will be the positions of the strings in the vector that matches your desired pattern. The “vector” that I was interested in was the column names of the dataframe, and we’ve already written the regular expressions, so here goes:

str_which(names(raw), "^prep_timing[0-9][a-b]_Page Submit$")

##  [1]  77  87  91 101 105 115 119 129 133 143 147 157 161 171 175 185

And as you can see, I’ve found the positions of the columns that match the regex! Now, all I need to do is subset those columns using their positions, so we can just subset the dataframe using the above line of code, as follows:

raw[str_which(names(raw), "^prep_timing[0-9][a-b]_Page Submit$")]

## # A tibble: 1,268 x 16
##    `prep_timing1a_Pag~ `prep_timing1b_Pag~ `prep_timing2a_Pa~ `prep_timing2b_Pa~
##    <chr>               <chr>               <chr>              <chr>             
##  1 "Timing - Page Sub~ "Timing - Page Sub~ "Timing - Page Su~ "Timing - Page Su~
##  2 "{\"ImportId\":\"Q~ "{\"ImportId\":\"Q~ "{\"ImportId\":\"~ "{\"ImportId\":\"~
##  3  <NA>                <NA>                <NA>               <NA>             
##  4  <NA>                <NA>                <NA>               <NA>             
##  5  <NA>                <NA>                <NA>               <NA>             
##  6 "9.985"             "1.873"             "15.046"           "1.913"           
##  7 "15.555"            "4.152"             "18.289"           "7.695"           
##  8  <NA>                <NA>                <NA>               <NA>             
##  9  <NA>                <NA>                <NA>               <NA>             
## 10 "8.615"             "1.5"               "12.485"           "1.49"            
## # ... with 1,258 more rows, and 12 more variables:
## #   prep_timing3a_Page Submit <chr>, prep_timing3b_Page Submit <chr>,
## #   prep_timing4a_Page Submit <chr>, prep_timing4b_Page Submit <chr>,
## #   prep_timing5a_Page Submit <chr>, prep_timing5b_Page Submit <chr>,
## #   prep_timing6a_Page Submit <chr>, prep_timing6b_Page Submit <chr>,
## #   prep_timing7a_Page Submit <chr>, prep_timing7b_Page Submit <chr>,
## #   prep_timing8a_Page Submit <chr>, prep_timing8b_Page Submit <chr>

And we’ve successfully selected the columns from the dataset that we wanted with (relatively) little effort! Definitely beats trying to find them manually by index number.

Note that I am showing you the completely raw data for this example here, before I removed the extra rows produced by Qualtrics with the text from the questions and the “Import Id” information. This leads R to treat all of the numeric columns as character variables (hence the quotes around numbers in output above), even though they aren’t characters.

Then, I submitted the above piece of code to rowSums(), so I could find the total time spent answering questions across all of the matching columns, like so:

rowSums(raw[str_which(names(raw), "^prep_timing[0-9][a-b]_Page Submit$")], na.rm = TRUE)

FYI, another way of doing this would be using str_detect(), which outputs a logical vector. You can then use this line to select your columns:

raw[str_detect(names(raw), "^prep_timing[0-9][a-b]_Page Submit$")]

## # A tibble: 1,268 x 16
##    `prep_timing1a_Pag~ `prep_timing1b_Pag~ `prep_timing2a_Pa~ `prep_timing2b_Pa~
##    <chr>               <chr>               <chr>              <chr>             
##  1 "Timing - Page Sub~ "Timing - Page Sub~ "Timing - Page Su~ "Timing - Page Su~
##  2 "{\"ImportId\":\"Q~ "{\"ImportId\":\"Q~ "{\"ImportId\":\"~ "{\"ImportId\":\"~
##  3  <NA>                <NA>                <NA>               <NA>             
##  4  <NA>                <NA>                <NA>               <NA>             
##  5  <NA>                <NA>                <NA>               <NA>             
##  6 "9.985"             "1.873"             "15.046"           "1.913"           
##  7 "15.555"            "4.152"             "18.289"           "7.695"           
##  8  <NA>                <NA>                <NA>               <NA>             
##  9  <NA>                <NA>                <NA>               <NA>             
## 10 "8.615"             "1.5"               "12.485"           "1.49"            
## # ... with 1,258 more rows, and 12 more variables:
## #   prep_timing3a_Page Submit <chr>, prep_timing3b_Page Submit <chr>,
## #   prep_timing4a_Page Submit <chr>, prep_timing4b_Page Submit <chr>,
## #   prep_timing5a_Page Submit <chr>, prep_timing5b_Page Submit <chr>,
## #   prep_timing6a_Page Submit <chr>, prep_timing6b_Page Submit <chr>,
## #   prep_timing7a_Page Submit <chr>, prep_timing7b_Page Submit <chr>,
## #   prep_timing8a_Page Submit <chr>, prep_timing8b_Page Submit <chr>

Identical output!

A final note: don’t worry if it takes some time to understand how to write regular expressions. Even if you start out like me the first time I read a regex:

Eventually you’ll get closer to this:

To help you make this transition, I added some resources that are/were helpful for me below.

Other resources to learn more about using regular expressions:

This medium post and this one
Blog post here
This incredibly helpful cheatsheet for when you forget how to write your regex
Test out your expressions here
This youtube video
Great chapter in “Automate the Boring Stuff” by Al Sweigart on Pattern Matching with Regular Expressions in Python
Another list of resources for learning regular expressions
Cheatsheet for the stringr package
Interactive exercises to learn regular expressions

Contents

data cleaning R regular expressions