janitor::get_dupes()

Function of the Week: get_dupes ()

Submission Instructions

Please sign up for a function here: https://docs.google.com/spreadsheets/d/1-RWAQTlLwttjFuZVAtSs8OiHIwu6AZLUdWugIHHTWVo/edit?usp=sharing

For this assignment, please submit both the .Rmd and the .html files. I will add it to the website. Remove your name from the Rmd if you do not wish it shared. If you select a function which was presented last year, please develop your own examples and content.

library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.1 ──
## ✓ ggplot2 3.3.5     ✓ purrr   0.3.4
## ✓ tibble  3.1.6     ✓ dplyr   1.0.7
## ✓ tidyr   1.1.4     ✓ stringr 1.4.0
## ✓ readr   2.1.1     ✓ forcats 0.5.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()
library(readxl)
library(skimr)
library(readr)
library(dplyr)
#load janitor which contains get_dupes ()
library(janitor)
## 
## Attaching package: 'janitor'
## The following objects are masked from 'package:stats':
## 
##     chisq.test, fisher.test
#example dataset
library(palmerpenguins)
data(penguins)

get_dupes ()

In this document, I will introduce the get_dupes () function and show what it’s for. get_dupes() is from the janitor package.

What is it for?

This function identifies duplicate row values of specified variables across a given dataset. It takes the arguments (data, variable); variables can be a mix of types (dbl, chr, etc). It returns a dataframe (or tibble if tibble is used) with duplicate rows for all variables specified. A column, “dupe_count”, is inserted containing the number of duplicates. If no variable is specified, it looks for duplicate rows across all variables in the dataset.

num_1 <- c(1:6)
col_1 <- c("red", "orange", "yellow", "green", "green", "blue")
numbers_colors_1 <- data.frame(num_1, col_1)
numbers_colors_1
##   num_1  col_1
## 1     1    red
## 2     2 orange
## 3     3 yellow
## 4     4  green
## 5     5  green
## 6     6   blue
get_dupes (numbers_colors_1)
## No variable names specified - using all columns.
## No duplicate combinations found of: num_1, col_1
## [1] num_1      col_1      dupe_count
## <0 rows> (or 0-length row.names)
num_2 <- c(1, 2, 3, 4, 4, 4)
col_2 <- c("red", "orange", "yellow", "green", "green", "green")
numbers_colors_2 <- data.frame(num_2, col_2)
numbers_colors_2
##   num_2  col_2
## 1     1    red
## 2     2 orange
## 3     3 yellow
## 4     4  green
## 5     4  green
## 6     4  green
get_dupes (numbers_colors_2)
## No variable names specified - using all columns.
##   num_2 col_2 dupe_count
## 1     4 green          3
## 2     4 green          3
## 3     4 green          3
num_3 <- c(1, 2, 3, 4, 5, 5, 6, 6, 6)
col_3 <- c("red", "orange", "yellow", "green", "blue", "blue", "blue", "blue", "pink")
numbers_colors_3 <- data.frame(num_3, col_3)
numbers_colors_3
##   num_3  col_3
## 1     1    red
## 2     2 orange
## 3     3 yellow
## 4     4  green
## 5     5   blue
## 6     5   blue
## 7     6   blue
## 8     6   blue
## 9     6   pink
get_dupes(numbers_colors_3)
## No variable names specified - using all columns.
##   num_3 col_3 dupe_count
## 1     5  blue          2
## 2     5  blue          2
## 3     6  blue          2
## 4     6  blue          2
get_dupes(numbers_colors_3, num_3)
##   num_3 dupe_count col_3
## 1     5          2  blue
## 2     5          2  blue
## 3     6          3  blue
## 4     6          3  blue
## 5     6          3  pink
get_dupes(numbers_colors_3, col_3)
##   col_3 dupe_count num_3
## 1  blue          4     5
## 2  blue          4     5
## 3  blue          4     6
## 4  blue          4     6
#orders by level for character/factor data
get_dupes (penguins)
## No variable names specified - using all columns.
## No duplicate combinations found of: species, island, bill_length_mm, bill_depth_mm, flipper_length_mm, body_mass_g, sex, year
## # A tibble: 0 × 9
## # … with 9 variables: species <fct>, island <fct>, bill_length_mm <dbl>,
## #   bill_depth_mm <dbl>, flipper_length_mm <int>, body_mass_g <int>, sex <fct>,
## #   year <int>, dupe_count <int>
get_dupes (penguins, species)
## # A tibble: 344 × 9
##    species dupe_count island    bill_length_mm bill_depth_mm flipper_length_mm
##    <fct>        <int> <fct>              <dbl>         <dbl>             <int>
##  1 Adelie         152 Torgersen           39.1          18.7               181
##  2 Adelie         152 Torgersen           39.5          17.4               186
##  3 Adelie         152 Torgersen           40.3          18                 195
##  4 Adelie         152 Torgersen           NA            NA                  NA
##  5 Adelie         152 Torgersen           36.7          19.3               193
##  6 Adelie         152 Torgersen           39.3          20.6               190
##  7 Adelie         152 Torgersen           38.9          17.8               181
##  8 Adelie         152 Torgersen           39.2          19.6               195
##  9 Adelie         152 Torgersen           34.1          18.1               193
## 10 Adelie         152 Torgersen           42            20.2               190
## # … with 334 more rows, and 3 more variables: body_mass_g <int>, sex <fct>,
## #   year <int>
get_dupes (penguins, species, island)
## # A tibble: 344 × 9
##    species island dupe_count bill_length_mm bill_depth_mm flipper_length_mm
##    <fct>   <fct>       <int>          <dbl>         <dbl>             <int>
##  1 Adelie  Biscoe         44           37.8          18.3               174
##  2 Adelie  Biscoe         44           37.7          18.7               180
##  3 Adelie  Biscoe         44           35.9          19.2               189
##  4 Adelie  Biscoe         44           38.2          18.1               185
##  5 Adelie  Biscoe         44           38.8          17.2               180
##  6 Adelie  Biscoe         44           35.3          18.9               187
##  7 Adelie  Biscoe         44           40.6          18.6               183
##  8 Adelie  Biscoe         44           40.5          17.9               187
##  9 Adelie  Biscoe         44           37.9          18.6               172
## 10 Adelie  Biscoe         44           40.5          18.9               180
## # … with 334 more rows, and 3 more variables: body_mass_g <int>, sex <fct>,
## #   year <int>
get_dupes (penguins, species, island, year)
## # A tibble: 344 × 9
##    species island  year dupe_count bill_length_mm bill_depth_mm flipper_length_…
##    <fct>   <fct>  <int>      <int>          <dbl>         <dbl>            <int>
##  1 Adelie  Biscoe  2007         10           37.8          18.3              174
##  2 Adelie  Biscoe  2007         10           37.7          18.7              180
##  3 Adelie  Biscoe  2007         10           35.9          19.2              189
##  4 Adelie  Biscoe  2007         10           38.2          18.1              185
##  5 Adelie  Biscoe  2007         10           38.8          17.2              180
##  6 Adelie  Biscoe  2007         10           35.3          18.9              187
##  7 Adelie  Biscoe  2007         10           40.6          18.6              183
##  8 Adelie  Biscoe  2007         10           40.5          17.9              187
##  9 Adelie  Biscoe  2007         10           37.9          18.6              172
## 10 Adelie  Biscoe  2007         10           40.5          18.9              180
## # … with 334 more rows, and 2 more variables: body_mass_g <int>, sex <fct>
#orders numeric data
get_dupes(penguins, body_mass_g, sex)
## # A tibble: 283 × 9
##    body_mass_g sex    dupe_count species   island   bill_length_mm bill_depth_mm
##          <int> <fct>       <int> <fct>     <fct>             <dbl>         <dbl>
##  1        2850 female          2 Adelie    Biscoe             36.5          16.6
##  2        2850 female          2 Adelie    Biscoe             36.4          17.1
##  3        2900 female          4 Adelie    Biscoe             34.5          18.1
##  4        2900 female          4 Adelie    Dream              33.1          16.1
##  5        2900 female          4 Adelie    Torgers…           38.6          17  
##  6        2900 female          4 Chinstrap Dream              43.2          16.6
##  7        3000 female          2 Adelie    Dream              37            16.9
##  8        3000 female          2 Adelie    Dream              37.3          16.8
##  9        3050 female          4 Adelie    Torgers…           35.9          16.6
## 10        3050 female          4 Adelie    Torgers…           35.2          15.9
## # … with 273 more rows, and 2 more variables: flipper_length_mm <int>,
## #   year <int>
Artists <- read_csv("https://media.githubusercontent.com/media/MuseumofModernArt/collection/master/Artists.csv")
## Rows: 15222 Columns: 9
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (5): DisplayName, ArtistBio, Nationality, Gender, Wiki QID
## dbl (4): ConstituentID, BeginDate, EndDate, ULAN
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Artists %>% get_dupes()
## No variable names specified - using all columns.
## No duplicate combinations found of: ConstituentID, DisplayName, ArtistBio, Nationality, Gender, BeginDate, EndDate, Wiki QID, ULAN
## # A tibble: 0 × 10
## # … with 10 variables: ConstituentID <dbl>, DisplayName <chr>, ArtistBio <chr>,
## #   Nationality <chr>, Gender <chr>, BeginDate <dbl>, EndDate <dbl>,
## #   Wiki QID <chr>, ULAN <dbl>, dupe_count <int>
Artists %>% get_dupes(Nationality)
## # A tibble: 15,201 × 10
##    Nationality dupe_count ConstituentID DisplayName  ArtistBio  Gender BeginDate
##    <chr>            <int>         <dbl> <chr>        <chr>      <chr>      <dbl>
##  1 Albanian             3         29622 Anri Sala    Albanian,… Male        1974
##  2 Albanian             3         30621 Adrian Paci  Albanian,… Male        1969
##  3 Albanian             3         31324 Edi Rama     Albanian,… Male        1964
##  4 Algerian             5          1880 M. Fiorini   Algerian,… Male        1922
##  5 Algerian             5         43595 Lyès Salem   Algerian,… Male        1973
##  6 Algerian             5         43626 Djamila Sah… Algerian,… Female      1950
##  7 Algerian             5         49216 Neïl Beloufa Algerian-… Male        1985
##  8 Algerian             5         70052 El Hadi Jaz… Algerian,… Male        1970
##  9 American          5194             1 Robert Arne… American,… Male        1930
## 10 American          5194             3 Bill Arnold  American,… Male        1941
## # … with 15,191 more rows, and 3 more variables: EndDate <dbl>, Wiki QID <chr>,
## #   ULAN <dbl>
Artists %>% get_dupes(Nationality, Gender) 
## # A tibble: 15,157 × 10
##    Nationality Gender dupe_count ConstituentID DisplayName  ArtistBio  BeginDate
##    <chr>       <chr>       <int>         <dbl> <chr>        <chr>          <dbl>
##  1 Albanian    Male            3         29622 Anri Sala    Albanian,…      1974
##  2 Albanian    Male            3         30621 Adrian Paci  Albanian,…      1969
##  3 Albanian    Male            3         31324 Edi Rama     Albanian,…      1964
##  4 Algerian    Male            4          1880 M. Fiorini   Algerian,…      1922
##  5 Algerian    Male            4         43595 Lyès Salem   Algerian,…      1973
##  6 Algerian    Male            4         49216 Neïl Beloufa Algerian-…      1985
##  7 Algerian    Male            4         70052 El Hadi Jaz… Algerian,…      1970
##  8 American    Female       1097            10 Irene Arons… American,…      1918
##  9 American    Female       1097            21 Ruth Asawa   American,…      1926
## 10 American    Female       1097            31 Dana Atchley American,…      1941
## # … with 15,147 more rows, and 3 more variables: EndDate <dbl>, Wiki QID <chr>,
## #   ULAN <dbl>

Is it helpful?

I think this is a useful preliminary data exploration function. It is a way to identify overall duplicate rows that might need to be further explored. It is also useful if you have particular variables you want to assess for duplicate rows either alone or in combination. For example it would be useful to look for duplicate medical records in a large dataset or multiple observations on a study participant in a single year. Its got an easy to remember name and is useful way to do initial assessment on data, but isn’t something that I would use regularly.