tidyr::drop_na()
Function of the Week:
Laura Jacobson
Febuary 16, 2022
Submission Instructions
Please sign up for a function here: https://docs.google.com/spreadsheets/d/1-RWAQTlLwttjFuZVAtSs8OiHIwu6AZLUdWugIHHTWVo/edit?usp=sharing
For this assignment, please submit both the .Rmd
and the .html
files. I will add it to the website. Remove your name from the Rmd if you do not wish it shared. If you select a function which was presented last year, please develop your own examples and content.
drop_na()
In this document, I will introduce the [drop_na] function and show what it’s for. This is part of dplyr, which I load through tidyverse.
#load tidyverse
library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.1 ──
## ✓ ggplot2 3.3.5 ✓ purrr 0.3.4
## ✓ tibble 3.1.6 ✓ dplyr 1.0.7
## ✓ tidyr 1.1.4 ✓ stringr 1.4.0
## ✓ readr 2.1.1 ✓ forcats 0.5.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
#load penguins data
library(palmerpenguins)
data(penguins)
options(tibble.width = Inf)
What is it for?
The drop_na function drops rows containing missing values. This function accepts 2 arguments, the dataframe and … (tidy-select), or the columns to check for missing values. If empty, all columns are used. Another way of putting this, is that it only keeps complete rows, unless specifying for a specific column.
Examples
#Viewing a slice of penguins data
penguins%>%
slice(1:15)
## # A tibble: 15 × 8
## species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
## <fct> <fct> <dbl> <dbl> <int> <int>
## 1 Adelie Torgersen 39.1 18.7 181 3750
## 2 Adelie Torgersen 39.5 17.4 186 3800
## 3 Adelie Torgersen 40.3 18 195 3250
## 4 Adelie Torgersen NA NA NA NA
## 5 Adelie Torgersen 36.7 19.3 193 3450
## 6 Adelie Torgersen 39.3 20.6 190 3650
## 7 Adelie Torgersen 38.9 17.8 181 3625
## 8 Adelie Torgersen 39.2 19.6 195 4675
## 9 Adelie Torgersen 34.1 18.1 193 3475
## 10 Adelie Torgersen 42 20.2 190 4250
## 11 Adelie Torgersen 37.8 17.1 186 3300
## 12 Adelie Torgersen 37.8 17.3 180 3700
## 13 Adelie Torgersen 41.1 17.6 182 3200
## 14 Adelie Torgersen 38.6 21.2 191 3800
## 15 Adelie Torgersen 34.6 21.1 198 4400
## sex year
## <fct> <int>
## 1 male 2007
## 2 female 2007
## 3 female 2007
## 4 <NA> 2007
## 5 female 2007
## 6 male 2007
## 7 female 2007
## 8 male 2007
## 9 <NA> 2007
## 10 <NA> 2007
## 11 <NA> 2007
## 12 <NA> 2007
## 13 female 2007
## 14 male 2007
## 15 male 2007
#Dropping all NA
penguins%>%
slice(1:15) %>%
drop_na()
## # A tibble: 10 × 8
## species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
## <fct> <fct> <dbl> <dbl> <int> <int>
## 1 Adelie Torgersen 39.1 18.7 181 3750
## 2 Adelie Torgersen 39.5 17.4 186 3800
## 3 Adelie Torgersen 40.3 18 195 3250
## 4 Adelie Torgersen 36.7 19.3 193 3450
## 5 Adelie Torgersen 39.3 20.6 190 3650
## 6 Adelie Torgersen 38.9 17.8 181 3625
## 7 Adelie Torgersen 39.2 19.6 195 4675
## 8 Adelie Torgersen 41.1 17.6 182 3200
## 9 Adelie Torgersen 38.6 21.2 191 3800
## 10 Adelie Torgersen 34.6 21.1 198 4400
## sex year
## <fct> <int>
## 1 male 2007
## 2 female 2007
## 3 female 2007
## 4 female 2007
## 5 male 2007
## 6 female 2007
## 7 male 2007
## 8 female 2007
## 9 male 2007
## 10 male 2007
What we may have wanted to do is drop NA in a specific column.
#Drop NA by body mass
penguins%>%
slice(1:15) %>%
drop_na(body_mass_g)
## # A tibble: 14 × 8
## species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
## <fct> <fct> <dbl> <dbl> <int> <int>
## 1 Adelie Torgersen 39.1 18.7 181 3750
## 2 Adelie Torgersen 39.5 17.4 186 3800
## 3 Adelie Torgersen 40.3 18 195 3250
## 4 Adelie Torgersen 36.7 19.3 193 3450
## 5 Adelie Torgersen 39.3 20.6 190 3650
## 6 Adelie Torgersen 38.9 17.8 181 3625
## 7 Adelie Torgersen 39.2 19.6 195 4675
## 8 Adelie Torgersen 34.1 18.1 193 3475
## 9 Adelie Torgersen 42 20.2 190 4250
## 10 Adelie Torgersen 37.8 17.1 186 3300
## 11 Adelie Torgersen 37.8 17.3 180 3700
## 12 Adelie Torgersen 41.1 17.6 182 3200
## 13 Adelie Torgersen 38.6 21.2 191 3800
## 14 Adelie Torgersen 34.6 21.1 198 4400
## sex year
## <fct> <int>
## 1 male 2007
## 2 female 2007
## 3 female 2007
## 4 female 2007
## 5 male 2007
## 6 female 2007
## 7 male 2007
## 8 <NA> 2007
## 9 <NA> 2007
## 10 <NA> 2007
## 11 <NA> 2007
## 12 female 2007
## 13 male 2007
## 14 male 2007
Now it only dropped the rows that had a missing values for body mass.
This is important for running tests and for plotting. Dropping missing values looks different depending on your data type or research question. For example, here is a box plot of species and body mass
penguins %>%
ggplot() +
aes(x = species, y=body_mass_g) +
geom_boxplot()
## Warning: Removed 2 rows containing non-finite values (stat_boxplot).
Here is the same plot with body mass NAs dropped
penguins %>%
drop_na() %>%
ggplot() +
aes(x = species, y=body_mass_g) +
geom_boxplot()
It looks the same
What is we are working with character such as sex
penguins %>%
ggplot() +
aes(x = species, fill = species) +
geom_bar() +
facet_wrap(vars(sex))
Now we have this other category that we might now want on the chart.
penguins %>%
drop_na(sex) %>%
ggplot() +
aes(x = species, fill = species) +
geom_bar() +
facet_wrap(vars(sex))
Is it helpful?
Discuss whether you think this function is useful for you and your work. Is it the best thing since sliced bread, or is it not really relevant to your work?
Missing data is a fact of life! It is important to remember that missing data is information and we don’t want to just throw it away, especially if it’s not random. Depending on the analysis we may need to do sensitivity analyses etc, so we want to be cautious when dropping NA. However, it can be very useful to get calculations to run, make nice plots, and to help make sense of our data. Filtering out NAs is another choice, that might be more appropriate depending on the analysis.