dplyr::n_distict()
Function of the Week:
Joseph W. Vera
2022-03-02
Submission Instructions
Please sign up for a function here: https://docs.google.com/spreadsheets/d/1-RWAQTlLwttjFuZVAtSs8OiHIwu6AZLUdWugIHHTWVo/edit?usp=sharing
For this assignment, please submit both the .Rmd
and the .html
files. I will add it to the website. Remove your name from the Rmd if you do not wish it shared. If you select a function which was presented last year, please develop your own examples and content.
Function Name
In this document, I will introduce the n_distict() function and show what it’s for.
#load tidyverse up
library(tidyverse)
## -- Attaching packages --------------------------------------- tidyverse 1.3.1 --
## v ggplot2 3.3.5 v purrr 0.3.4
## v tibble 3.1.6 v dplyr 1.0.7
## v tidyr 1.1.4 v stringr 1.4.0
## v readr 2.1.1 v forcats 0.5.1
## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
library(dplyr)
What is it for?
?distinct
## starting httpd help server ... done
#This function (n_distinct) does what the name suggests, it counts the number of unique values within a vector or variable. If you want to count in a basic way, you can see how n_distinct makes your code more efficient.
data <- data.frame(x= c(1,1,2,2,2,3,3,4,5),
group = c("A", "A", "A",
"B", "B",
"C", "C", "C", "C"))
data_count_1 <- aggregate(data = data,
x ~ group,
function(x) length(unique(x)))
data_count_2 <- data %>%
group_by(group) %>%
summarise(count = n_distinct(x))
#However, you can also use n_distinct in parallel with other functions (e.g., filter or group_by). Let's take a look at what exactly we are doing and how we've combined "n" and "distinct".
library(palmerpenguins)
glimpse(penguins)
## Rows: 344
## Columns: 8
## $ species <fct> Adelie, Adelie, Adelie, Adelie, Adelie, Adelie, Adel~
## $ island <fct> Torgersen, Torgersen, Torgersen, Torgersen, Torgerse~
## $ bill_length_mm <dbl> 39.1, 39.5, 40.3, NA, 36.7, 39.3, 38.9, 39.2, 34.1, ~
## $ bill_depth_mm <dbl> 18.7, 17.4, 18.0, NA, 19.3, 20.6, 17.8, 19.6, 18.1, ~
## $ flipper_length_mm <int> 181, 186, 195, NA, 193, 190, 181, 195, 193, 190, 186~
## $ body_mass_g <int> 3750, 3800, 3250, NA, 3450, 3650, 3625, 4675, 3475, ~
## $ sex <fct> male, female, female, NA, female, male, female, male~
## $ year <int> 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007~
penguins %>% distinct(species)
## # A tibble: 3 x 1
## species
## <fct>
## 1 Adelie
## 2 Gentoo
## 3 Chinstrap
#so we have 3 distinct species in this data set. but what if we want to know the number of different species on each island?
penguins %>% group_by(island) %>%
summarise(count = n_distinct(species))
## # A tibble: 3 x 2
## island count
## <fct> <int>
## 1 Biscoe 2
## 2 Dream 2
## 3 Torgersen 1
#now lets say we want to know the number of unique values not only across species, but also across island and sex...we get to write it 3 times...inefficient!
penguins %>%
summarise(distinct_species = n_distinct(species),
distinct_island = n_distinct(island),
distinct_sex = n_distinct(sex))
## # A tibble: 1 x 3
## distinct_species distinct_island distinct_sex
## <int> <int> <int>
## 1 3 3 3
#isn't there a better way?
penguins %>%
summarise(across(c(species, island, sex),
n_distinct))
## # A tibble: 1 x 3
## species island sex
## <int> <int> <int>
## 1 3 3 3
#it is important to remember that "NA" can be considered a unique value, so how can we avoid that? In addition to telling R what variable you are counting the unique values for, you can also tell it to not count "NA" values. The default is "false" and thus na values are considered to be unique.
penguins %>% summarise(count = n_distinct(sex, na.rm = TRUE))
## # A tibble: 1 x 1
## count
## <int>
## 1 2
Is it helpful?
I think this function is a good way to add another glimpse of your data. Is it something that you will use by itself, more than likely not; however, I think it can help you double check your data. Are there “NA” values that you haven’t noticed? Did you accidentally create a unique value that you didn’t mean to? In the end this is still just a count function.