tidyr::extract()

Function of the Week: tidyr::extract()

tidyr:: extract()

In this document, I will introduce the tidyr::extract() function and show what it’s for.

#load libraries up
library(tidyverse)
## Warning: package 'tidyverse' was built under R version 4.0.5
## -- Attaching packages --------------------------------------- tidyverse 1.3.1 --
## v ggplot2 3.3.5     v purrr   0.3.4
## v tibble  3.1.6     v dplyr   1.0.7
## v tidyr   1.1.4     v stringr 1.4.0
## v readr   2.1.1     v forcats 0.5.1
## Warning: package 'ggplot2' was built under R version 4.0.5
## Warning: package 'tibble' was built under R version 4.0.5
## Warning: package 'tidyr' was built under R version 4.0.5
## Warning: package 'readr' was built under R version 4.0.5
## Warning: package 'purrr' was built under R version 4.0.3
## Warning: package 'dplyr' was built under R version 4.0.5
## Warning: package 'stringr' was built under R version 4.0.3
## Warning: package 'forcats' was built under R version 4.0.5
## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()
#extract() is in tidyr; this package is loaded by tidyverse
#example dataset is 'sentences' from stringr; this package is loaded by tidyverse

What is it for?

extract() will match portions (“capturing groups”) of character strings in a column using regular expressions, and put the groups into new columns. If there’s no match or the input is NA, extract() will output NA.

The basic syntax is: df %>% extract(character_column, c(“Group1_name”, “Group2_name”), “(regex expression for Group1)(regex expression for Group2)”)

#Example 1: A simple example to see the syntax
df <- data.frame(x = c(NA, "a-b", "a-d", "b-c", "e-e"))
df
##      x
## 1 <NA>
## 2  a-b
## 3  a-d
## 4  b-c
## 5  e-e
#extract the letter(s)/number(s) before and after the dash into "A" and "B" variables, respectively
#"[[:alnum:]]+" matches at least one alphanumeric character 
df %>% extract(x, c("A", "B"), "([[:alnum:]]+)-([[:alnum:]]+)")
##      A    B
## 1 <NA> <NA>
## 2    a    b
## 3    a    d
## 4    b    c
## 5    e    e
# If no match, NA:
#note that "[a-d]+" matches at least one a,b,c or d
#this regular expression doesn't match anything in the last row of df, so returns NA
df %>% extract(x, c("A", "B"), "([a-d]+)-([a-d]+)")
##      A    B
## 1 <NA> <NA>
## 2    a    b
## 3    a    d
## 4    b    c
## 5 <NA> <NA>
#Example 2: the Harvard Sentences (stringr::sentences) are "phonetically balanced" statements that are used to test audio systems, because they contain all the sounds heard in everyday language.
# for more information see ?sentences
length(sentences)
## [1] 720
head(sentences, 10)
##  [1] "The birch canoe slid on the smooth planks." 
##  [2] "Glue the sheet to the dark blue background."
##  [3] "It's easy to tell the depth of a well."     
##  [4] "These days a chicken leg is a rare dish."   
##  [5] "Rice is often served in round bowls."       
##  [6] "The juice of lemons makes fine punch."      
##  [7] "The box was thrown beside the parked truck."
##  [8] "The hogs were fed chopped corn and garbage."
##  [9] "Four hours of steady work faced us."        
## [10] "Large size in stockings is hard to sell."

What if we want to try to pluck out some nouns? One way is to look for all the words that follow “a” or “the”. This is going to give us a lot of adjectives too, but it’s a way to start.

#make a tibble out of the sentences and then extract
tibble(sentence = sentences) %>%
   extract(sentence, c("article", "noun"), "(a|the) ([^ ]+)", remove = FALSE)
## # A tibble: 720 x 3
##    sentence                                    article noun   
##    <chr>                                       <chr>   <chr>  
##  1 The birch canoe slid on the smooth planks.  the     smooth 
##  2 Glue the sheet to the dark blue background. the     sheet  
##  3 It's easy to tell the depth of a well.      the     depth  
##  4 These days a chicken leg is a rare dish.    a       chicken
##  5 Rice is often served in round bowls.        <NA>    <NA>   
##  6 The juice of lemons makes fine punch.       <NA>    <NA>   
##  7 The box was thrown beside the parked truck. the     parked 
##  8 The hogs were fed chopped corn and garbage. <NA>    <NA>   
##  9 Four hours of steady work faced us.         <NA>    <NA>   
## 10 Large size in stockings is hard to sell.    <NA>    <NA>   
## # ... with 710 more rows

What if we want to find all the nouns that have color adjectives? Let’s find colors in the sentences and pluck out the color and the word after the color.

#make a list of color names and turn it into a single regular expression
colors <- c("red", "orange", "yellow", "green", "blue", "purple")
color_match <- str_c(colors, collapse = "|")
color_match
## [1] "red|orange|yellow|green|blue|purple"
#select sentences that contain a color
has_color <- str_subset(sentences, color_match)
length(has_color)
## [1] 57
head(has_color, 10)
##  [1] "Glue the sheet to the dark blue background."   
##  [2] "Two blue fish swam in the tank."               
##  [3] "The colt reared and threw the tall rider."     
##  [4] "The wide road shimmered in the hot sun."       
##  [5] "See the cat glaring at the scared mouse."      
##  [6] "A wisp of cloud hung in the blue air."         
##  [7] "Leaves turn brown and yellow in the fall."     
##  [8] "He ordered peach pie with ice cream."          
##  [9] "Pure bred poodles have curls."                 
## [10] "The spot on the blotter was made by green ink."
#extract the color and the word after the color.  
color_matches <- tibble(sentence = sentences) %>%
      extract(sentence, c("color", "noun"), 
              "(red|orange|yellow|green|blue|purple) ([^ ]+)", remove = FALSE)

#drop all the empty rows.  You can see we pulled out words like "shimmered" and "scared", too! 
color_matches <- color_matches %>% drop_na()
head(color_matches, 10)
## # A tibble: 10 x 3
##    sentence                                       color  noun       
##    <chr>                                          <chr>  <chr>      
##  1 Glue the sheet to the dark blue background.    blue   background.
##  2 Two blue fish swam in the tank.                blue   fish       
##  3 The colt reared and threw the tall rider.      red    and        
##  4 The wide road shimmered in the hot sun.        red    in         
##  5 See the cat glaring at the scared mouse.       red    mouse.     
##  6 A wisp of cloud hung in the blue air.          blue   air.       
##  7 Leaves turn brown and yellow in the fall.      yellow in         
##  8 He ordered peach pie with ice cream.           red    peach      
##  9 Pure bred poodles have curls.                  red    poodles    
## 10 The spot on the blotter was made by green ink. green  ink.
#we can get rid of the words accidentally matching "red" by tweaking the regex:
#now we require a space before the "red" so we don't get portions of words
color_matches <- tibble(sentence = sentences) %>%
      extract(sentence, c("color", "noun"), "([ ]red|orange|yellow|green|blue|purple) ([^ ]+)", 
          remove =    FALSE)
color_matches <- color_matches %>% drop_na()
color_matches
## # A tibble: 24 x 3
##    sentence                                       color    noun       
##    <chr>                                          <chr>    <chr>      
##  1 Glue the sheet to the dark blue background.    "blue"   background.
##  2 Two blue fish swam in the tank.                "blue"   fish       
##  3 A wisp of cloud hung in the blue air.          "blue"   air.       
##  4 Leaves turn brown and yellow in the fall.      "yellow" in         
##  5 The spot on the blotter was made by green ink. "green"  ink.       
##  6 The sofa cushion is red and of light weight.   " red"   and        
##  7 A blue crane is a tall wading bird.            "blue"   crane      
##  8 It is hard to erase blue or red ink.           "blue"   or         
##  9 The lamp shone with a steady green flame.      "green"  flame.     
## 10 The box is held by a bright red snapper.       " red"   snapper.   
## # ... with 14 more rows
#Note that extract() only returns the first match in a string! In this example, if there is a sentence with two colors, extract() will return only the first color and following word.
#example:  "It is hard to erase blue or red ink." returns only "blue" "or"

How could this be useful?

#what if we have some super messy data? Multiple observations combined into one column, with no spaces in between so you can't use separate()?

messy_data <- tribble(~name, ~x1, ~x2,
                      "Soniedensis SO0141abcefff", 2.54, 5.784,
                      "Vcholerae VC1124kjelsls", 2.13, 6.534,
                      "Dethogenes DH09483nannnowb", 3.24, 8.74)

messy_data %>% extract(name, c("organism", "abbrev", "gene_num", "letters"), "([A-Z][a-z]*) ([A-Z]+)([0-9]*)([a-z]+)", remove = FALSE) 
## # A tibble: 3 x 7
##   name                       organism    abbrev gene_num letters     x1    x2
##   <chr>                      <chr>       <chr>  <chr>    <chr>    <dbl> <dbl>
## 1 Soniedensis SO0141abcefff  Soniedensis SO     0141     abcefff   2.54  5.78
## 2 Vcholerae VC1124kjelsls    Vcholerae   VC     1124     kjelsls   2.13  6.53
## 3 Dethogenes DH09483nannnowb Dethogenes  DH     09483    nannnowb  3.24  8.74
#another way to extract the same thing
messy_data %>% extract(name, c("organism", "abbrev", "gene_num", "letters"), "([[:upper:]][[:lower:]]*) ([[:upper:]]+)([[:digit:]]*)([[:lower:]]+)", remove = FALSE) 
## # A tibble: 3 x 7
##   name                       organism    abbrev gene_num letters     x1    x2
##   <chr>                      <chr>       <chr>  <chr>    <chr>    <dbl> <dbl>
## 1 Soniedensis SO0141abcefff  Soniedensis SO     0141     abcefff   2.54  5.78
## 2 Vcholerae VC1124kjelsls    Vcholerae   VC     1124     kjelsls   2.13  6.53
## 3 Dethogenes DH09483nannnowb Dethogenes  DH     09483    nannnowb  3.24  8.74

Is it helpful?

I think that if I needed to distribute one character variable into several (for instance, if several observations were combined into one column), I would usually prefer to use a function like separate(). Human-entered data often has spaces or symbols between observations, which makes using separate() easy. However, computer output can be a single very long character string. I can see how using extract() to pluck out relevant information (say, timestamps or ID numbers) from a column of long strings could save a lot of time!