tidyr::drop_na()

Function of the Week:

Submission Instructions

Please sign up for a function here: https://docs.google.com/spreadsheets/d/1-RWAQTlLwttjFuZVAtSs8OiHIwu6AZLUdWugIHHTWVo/edit?usp=sharing

For this assignment, please submit both the .Rmd and the .html files. I will add it to the website. Remove your name from the Rmd if you do not wish it shared. If you select a function which was presented last year, please develop your own examples and content.

drop_na()

In this document, I will introduce the [drop_na] function and show what it’s for. This is part of dplyr, which I load through tidyverse.

#load tidyverse 
library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.1 ──
## ✓ ggplot2 3.3.5     ✓ purrr   0.3.4
## ✓ tibble  3.1.6     ✓ dplyr   1.0.7
## ✓ tidyr   1.1.4     ✓ stringr 1.4.0
## ✓ readr   2.1.1     ✓ forcats 0.5.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()
#load penguins data
library(palmerpenguins)
data(penguins)
options(tibble.width = Inf)

What is it for?

The drop_na function drops rows containing missing values. This function accepts 2 arguments, the dataframe and … (tidy-select), or the columns to check for missing values. If empty, all columns are used. Another way of putting this, is that it only keeps complete rows, unless specifying for a specific column.

Examples

#Viewing a slice of penguins data
penguins%>%
  slice(1:15)
## # A tibble: 15 × 8
##    species island    bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
##    <fct>   <fct>              <dbl>         <dbl>             <int>       <int>
##  1 Adelie  Torgersen           39.1          18.7               181        3750
##  2 Adelie  Torgersen           39.5          17.4               186        3800
##  3 Adelie  Torgersen           40.3          18                 195        3250
##  4 Adelie  Torgersen           NA            NA                  NA          NA
##  5 Adelie  Torgersen           36.7          19.3               193        3450
##  6 Adelie  Torgersen           39.3          20.6               190        3650
##  7 Adelie  Torgersen           38.9          17.8               181        3625
##  8 Adelie  Torgersen           39.2          19.6               195        4675
##  9 Adelie  Torgersen           34.1          18.1               193        3475
## 10 Adelie  Torgersen           42            20.2               190        4250
## 11 Adelie  Torgersen           37.8          17.1               186        3300
## 12 Adelie  Torgersen           37.8          17.3               180        3700
## 13 Adelie  Torgersen           41.1          17.6               182        3200
## 14 Adelie  Torgersen           38.6          21.2               191        3800
## 15 Adelie  Torgersen           34.6          21.1               198        4400
##    sex     year
##    <fct>  <int>
##  1 male    2007
##  2 female  2007
##  3 female  2007
##  4 <NA>    2007
##  5 female  2007
##  6 male    2007
##  7 female  2007
##  8 male    2007
##  9 <NA>    2007
## 10 <NA>    2007
## 11 <NA>    2007
## 12 <NA>    2007
## 13 female  2007
## 14 male    2007
## 15 male    2007
#Dropping all NA
penguins%>%
  slice(1:15) %>% 
  drop_na()
## # A tibble: 10 × 8
##    species island    bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
##    <fct>   <fct>              <dbl>         <dbl>             <int>       <int>
##  1 Adelie  Torgersen           39.1          18.7               181        3750
##  2 Adelie  Torgersen           39.5          17.4               186        3800
##  3 Adelie  Torgersen           40.3          18                 195        3250
##  4 Adelie  Torgersen           36.7          19.3               193        3450
##  5 Adelie  Torgersen           39.3          20.6               190        3650
##  6 Adelie  Torgersen           38.9          17.8               181        3625
##  7 Adelie  Torgersen           39.2          19.6               195        4675
##  8 Adelie  Torgersen           41.1          17.6               182        3200
##  9 Adelie  Torgersen           38.6          21.2               191        3800
## 10 Adelie  Torgersen           34.6          21.1               198        4400
##    sex     year
##    <fct>  <int>
##  1 male    2007
##  2 female  2007
##  3 female  2007
##  4 female  2007
##  5 male    2007
##  6 female  2007
##  7 male    2007
##  8 female  2007
##  9 male    2007
## 10 male    2007

What we may have wanted to do is drop NA in a specific column.

#Drop NA by body mass
penguins%>%
  slice(1:15) %>% 
  drop_na(body_mass_g)
## # A tibble: 14 × 8
##    species island    bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
##    <fct>   <fct>              <dbl>         <dbl>             <int>       <int>
##  1 Adelie  Torgersen           39.1          18.7               181        3750
##  2 Adelie  Torgersen           39.5          17.4               186        3800
##  3 Adelie  Torgersen           40.3          18                 195        3250
##  4 Adelie  Torgersen           36.7          19.3               193        3450
##  5 Adelie  Torgersen           39.3          20.6               190        3650
##  6 Adelie  Torgersen           38.9          17.8               181        3625
##  7 Adelie  Torgersen           39.2          19.6               195        4675
##  8 Adelie  Torgersen           34.1          18.1               193        3475
##  9 Adelie  Torgersen           42            20.2               190        4250
## 10 Adelie  Torgersen           37.8          17.1               186        3300
## 11 Adelie  Torgersen           37.8          17.3               180        3700
## 12 Adelie  Torgersen           41.1          17.6               182        3200
## 13 Adelie  Torgersen           38.6          21.2               191        3800
## 14 Adelie  Torgersen           34.6          21.1               198        4400
##    sex     year
##    <fct>  <int>
##  1 male    2007
##  2 female  2007
##  3 female  2007
##  4 female  2007
##  5 male    2007
##  6 female  2007
##  7 male    2007
##  8 <NA>    2007
##  9 <NA>    2007
## 10 <NA>    2007
## 11 <NA>    2007
## 12 female  2007
## 13 male    2007
## 14 male    2007

Now it only dropped the rows that had a missing values for body mass.

This is important for running tests and for plotting. Dropping missing values looks different depending on your data type or research question. For example, here is a box plot of species and body mass

penguins %>%
  ggplot() + 
    aes(x = species, y=body_mass_g) + 
    geom_boxplot() 
## Warning: Removed 2 rows containing non-finite values (stat_boxplot).

Here is the same plot with body mass NAs dropped

penguins %>%
  drop_na() %>% 
  ggplot() + 
    aes(x = species, y=body_mass_g) + 
    geom_boxplot()

It looks the same

What is we are working with character such as sex

penguins %>%
  ggplot() + 
    aes(x = species, fill = species) + 
    geom_bar() +
  facet_wrap(vars(sex)) 

Now we have this other category that we might now want on the chart.

penguins %>%
  drop_na(sex) %>% 
  ggplot() + 
    aes(x = species, fill = species) + 
    geom_bar() +
  facet_wrap(vars(sex))

Is it helpful?

Discuss whether you think this function is useful for you and your work. Is it the best thing since sliced bread, or is it not really relevant to your work?

Missing data is a fact of life! It is important to remember that missing data is information and we don’t want to just throw it away, especially if it’s not random. Depending on the analysis we may need to do sensitivity analyses etc, so we want to be cautious when dropping NA. However, it can be very useful to get calculations to run, make nice plots, and to help make sense of our data. Filtering out NAs is another choice, that might be more appropriate depending on the analysis.