dplyr::n_distict()

Function of the Week:

Submission Instructions

Please sign up for a function here: https://docs.google.com/spreadsheets/d/1-RWAQTlLwttjFuZVAtSs8OiHIwu6AZLUdWugIHHTWVo/edit?usp=sharing

For this assignment, please submit both the .Rmd and the .html files. I will add it to the website. Remove your name from the Rmd if you do not wish it shared. If you select a function which was presented last year, please develop your own examples and content.

Function Name

In this document, I will introduce the n_distict() function and show what it’s for.

#load tidyverse up
library(tidyverse)
## -- Attaching packages --------------------------------------- tidyverse 1.3.1 --
## v ggplot2 3.3.5     v purrr   0.3.4
## v tibble  3.1.6     v dplyr   1.0.7
## v tidyr   1.1.4     v stringr 1.4.0
## v readr   2.1.1     v forcats 0.5.1
## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()
library(dplyr)

What is it for?

?distinct
## starting httpd help server ... done
#This function (n_distinct) does what the name suggests, it counts the number of unique values within a vector or variable.  If you want to count in a basic way, you can see how n_distinct makes your code more efficient.
data <- data.frame(x= c(1,1,2,2,2,3,3,4,5),
                   group = c("A", "A", "A",
                             "B", "B",
                             "C", "C", "C", "C"))
data_count_1 <- aggregate(data = data,
                          x ~ group,
                          function(x) length(unique(x)))

data_count_2 <- data %>%
  group_by(group) %>%
  summarise(count = n_distinct(x))
#However, you can also use n_distinct in parallel with other functions (e.g., filter or group_by).  Let's take a look at what exactly we are doing and how we've combined "n" and "distinct".
library(palmerpenguins)

glimpse(penguins)
## Rows: 344
## Columns: 8
## $ species           <fct> Adelie, Adelie, Adelie, Adelie, Adelie, Adelie, Adel~
## $ island            <fct> Torgersen, Torgersen, Torgersen, Torgersen, Torgerse~
## $ bill_length_mm    <dbl> 39.1, 39.5, 40.3, NA, 36.7, 39.3, 38.9, 39.2, 34.1, ~
## $ bill_depth_mm     <dbl> 18.7, 17.4, 18.0, NA, 19.3, 20.6, 17.8, 19.6, 18.1, ~
## $ flipper_length_mm <int> 181, 186, 195, NA, 193, 190, 181, 195, 193, 190, 186~
## $ body_mass_g       <int> 3750, 3800, 3250, NA, 3450, 3650, 3625, 4675, 3475, ~
## $ sex               <fct> male, female, female, NA, female, male, female, male~
## $ year              <int> 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007~
penguins %>% distinct(species)
## # A tibble: 3 x 1
##   species  
##   <fct>    
## 1 Adelie   
## 2 Gentoo   
## 3 Chinstrap
#so we have 3 distinct species in this data set.  but what if we want to know the number of different species on each island?

penguins %>% group_by(island) %>%
  summarise(count = n_distinct(species))
## # A tibble: 3 x 2
##   island    count
##   <fct>     <int>
## 1 Biscoe        2
## 2 Dream         2
## 3 Torgersen     1
#now lets say we want to know the number of unique values not only across species, but also across island and sex...we get to write it 3 times...inefficient!

penguins %>%
  summarise(distinct_species = n_distinct(species),
            distinct_island = n_distinct(island),
            distinct_sex = n_distinct(sex))
## # A tibble: 1 x 3
##   distinct_species distinct_island distinct_sex
##              <int>           <int>        <int>
## 1                3               3            3
#isn't there a better way?
penguins %>%
  summarise(across(c(species, island, sex),
                   n_distinct))
## # A tibble: 1 x 3
##   species island   sex
##     <int>  <int> <int>
## 1       3      3     3
#it is important to remember that "NA" can be considered a unique value, so how can we avoid that?  In addition to telling R what variable you are counting the unique values for, you can also tell it to not count "NA" values.  The default is "false" and thus na values are considered to be unique.

penguins %>% summarise(count = n_distinct(sex, na.rm = TRUE))
## # A tibble: 1 x 1
##   count
##   <int>
## 1     2

Is it helpful?

I think this function is a good way to add another glimpse of your data. Is it something that you will use by itself, more than likely not; however, I think it can help you double check your data. Are there “NA” values that you haven’t noticed? Did you accidentally create a unique value that you didn’t mean to? In the end this is still just a count function.