janitor::get_dupes()
Function of the Week: get_dupes ()
Becky
2022-02-09
Submission Instructions
Please sign up for a function here: https://docs.google.com/spreadsheets/d/1-RWAQTlLwttjFuZVAtSs8OiHIwu6AZLUdWugIHHTWVo/edit?usp=sharing
For this assignment, please submit both the .Rmd
and the .html
files. I will add it to the website. Remove your name from the Rmd if you do not wish it shared. If you select a function which was presented last year, please develop your own examples and content.
library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.1 ──
## ✓ ggplot2 3.3.5 ✓ purrr 0.3.4
## ✓ tibble 3.1.6 ✓ dplyr 1.0.7
## ✓ tidyr 1.1.4 ✓ stringr 1.4.0
## ✓ readr 2.1.1 ✓ forcats 0.5.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
library(readxl)
library(skimr)
library(readr)
library(dplyr)
#load janitor which contains get_dupes ()
library(janitor)
##
## Attaching package: 'janitor'
## The following objects are masked from 'package:stats':
##
## chisq.test, fisher.test
#example dataset
library(palmerpenguins)
data(penguins)
get_dupes ()
In this document, I will introduce the get_dupes ()
function and show what it’s for. get_dupes() is from the janitor package.
What is it for?
This function identifies duplicate row values of specified variables across a given dataset. It takes the arguments (data, variable); variables can be a mix of types (dbl, chr, etc). It returns a dataframe (or tibble if tibble is used) with duplicate rows for all variables specified. A column, “dupe_count”, is inserted containing the number of duplicates. If no variable is specified, it looks for duplicate rows across all variables in the dataset.
num_1 <- c(1:6)
col_1 <- c("red", "orange", "yellow", "green", "green", "blue")
numbers_colors_1 <- data.frame(num_1, col_1)
numbers_colors_1
## num_1 col_1
## 1 1 red
## 2 2 orange
## 3 3 yellow
## 4 4 green
## 5 5 green
## 6 6 blue
get_dupes (numbers_colors_1)
## No variable names specified - using all columns.
## No duplicate combinations found of: num_1, col_1
## [1] num_1 col_1 dupe_count
## <0 rows> (or 0-length row.names)
num_2 <- c(1, 2, 3, 4, 4, 4)
col_2 <- c("red", "orange", "yellow", "green", "green", "green")
numbers_colors_2 <- data.frame(num_2, col_2)
numbers_colors_2
## num_2 col_2
## 1 1 red
## 2 2 orange
## 3 3 yellow
## 4 4 green
## 5 4 green
## 6 4 green
get_dupes (numbers_colors_2)
## No variable names specified - using all columns.
## num_2 col_2 dupe_count
## 1 4 green 3
## 2 4 green 3
## 3 4 green 3
num_3 <- c(1, 2, 3, 4, 5, 5, 6, 6, 6)
col_3 <- c("red", "orange", "yellow", "green", "blue", "blue", "blue", "blue", "pink")
numbers_colors_3 <- data.frame(num_3, col_3)
numbers_colors_3
## num_3 col_3
## 1 1 red
## 2 2 orange
## 3 3 yellow
## 4 4 green
## 5 5 blue
## 6 5 blue
## 7 6 blue
## 8 6 blue
## 9 6 pink
get_dupes(numbers_colors_3)
## No variable names specified - using all columns.
## num_3 col_3 dupe_count
## 1 5 blue 2
## 2 5 blue 2
## 3 6 blue 2
## 4 6 blue 2
get_dupes(numbers_colors_3, num_3)
## num_3 dupe_count col_3
## 1 5 2 blue
## 2 5 2 blue
## 3 6 3 blue
## 4 6 3 blue
## 5 6 3 pink
get_dupes(numbers_colors_3, col_3)
## col_3 dupe_count num_3
## 1 blue 4 5
## 2 blue 4 5
## 3 blue 4 6
## 4 blue 4 6
#orders by level for character/factor data
get_dupes (penguins)
## No variable names specified - using all columns.
## No duplicate combinations found of: species, island, bill_length_mm, bill_depth_mm, flipper_length_mm, body_mass_g, sex, year
## # A tibble: 0 × 9
## # … with 9 variables: species <fct>, island <fct>, bill_length_mm <dbl>,
## # bill_depth_mm <dbl>, flipper_length_mm <int>, body_mass_g <int>, sex <fct>,
## # year <int>, dupe_count <int>
get_dupes (penguins, species)
## # A tibble: 344 × 9
## species dupe_count island bill_length_mm bill_depth_mm flipper_length_mm
## <fct> <int> <fct> <dbl> <dbl> <int>
## 1 Adelie 152 Torgersen 39.1 18.7 181
## 2 Adelie 152 Torgersen 39.5 17.4 186
## 3 Adelie 152 Torgersen 40.3 18 195
## 4 Adelie 152 Torgersen NA NA NA
## 5 Adelie 152 Torgersen 36.7 19.3 193
## 6 Adelie 152 Torgersen 39.3 20.6 190
## 7 Adelie 152 Torgersen 38.9 17.8 181
## 8 Adelie 152 Torgersen 39.2 19.6 195
## 9 Adelie 152 Torgersen 34.1 18.1 193
## 10 Adelie 152 Torgersen 42 20.2 190
## # … with 334 more rows, and 3 more variables: body_mass_g <int>, sex <fct>,
## # year <int>
get_dupes (penguins, species, island)
## # A tibble: 344 × 9
## species island dupe_count bill_length_mm bill_depth_mm flipper_length_mm
## <fct> <fct> <int> <dbl> <dbl> <int>
## 1 Adelie Biscoe 44 37.8 18.3 174
## 2 Adelie Biscoe 44 37.7 18.7 180
## 3 Adelie Biscoe 44 35.9 19.2 189
## 4 Adelie Biscoe 44 38.2 18.1 185
## 5 Adelie Biscoe 44 38.8 17.2 180
## 6 Adelie Biscoe 44 35.3 18.9 187
## 7 Adelie Biscoe 44 40.6 18.6 183
## 8 Adelie Biscoe 44 40.5 17.9 187
## 9 Adelie Biscoe 44 37.9 18.6 172
## 10 Adelie Biscoe 44 40.5 18.9 180
## # … with 334 more rows, and 3 more variables: body_mass_g <int>, sex <fct>,
## # year <int>
get_dupes (penguins, species, island, year)
## # A tibble: 344 × 9
## species island year dupe_count bill_length_mm bill_depth_mm flipper_length_…
## <fct> <fct> <int> <int> <dbl> <dbl> <int>
## 1 Adelie Biscoe 2007 10 37.8 18.3 174
## 2 Adelie Biscoe 2007 10 37.7 18.7 180
## 3 Adelie Biscoe 2007 10 35.9 19.2 189
## 4 Adelie Biscoe 2007 10 38.2 18.1 185
## 5 Adelie Biscoe 2007 10 38.8 17.2 180
## 6 Adelie Biscoe 2007 10 35.3 18.9 187
## 7 Adelie Biscoe 2007 10 40.6 18.6 183
## 8 Adelie Biscoe 2007 10 40.5 17.9 187
## 9 Adelie Biscoe 2007 10 37.9 18.6 172
## 10 Adelie Biscoe 2007 10 40.5 18.9 180
## # … with 334 more rows, and 2 more variables: body_mass_g <int>, sex <fct>
#orders numeric data
get_dupes(penguins, body_mass_g, sex)
## # A tibble: 283 × 9
## body_mass_g sex dupe_count species island bill_length_mm bill_depth_mm
## <int> <fct> <int> <fct> <fct> <dbl> <dbl>
## 1 2850 female 2 Adelie Biscoe 36.5 16.6
## 2 2850 female 2 Adelie Biscoe 36.4 17.1
## 3 2900 female 4 Adelie Biscoe 34.5 18.1
## 4 2900 female 4 Adelie Dream 33.1 16.1
## 5 2900 female 4 Adelie Torgers… 38.6 17
## 6 2900 female 4 Chinstrap Dream 43.2 16.6
## 7 3000 female 2 Adelie Dream 37 16.9
## 8 3000 female 2 Adelie Dream 37.3 16.8
## 9 3050 female 4 Adelie Torgers… 35.9 16.6
## 10 3050 female 4 Adelie Torgers… 35.2 15.9
## # … with 273 more rows, and 2 more variables: flipper_length_mm <int>,
## # year <int>
Artists <- read_csv("https://media.githubusercontent.com/media/MuseumofModernArt/collection/master/Artists.csv")
## Rows: 15222 Columns: 9
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (5): DisplayName, ArtistBio, Nationality, Gender, Wiki QID
## dbl (4): ConstituentID, BeginDate, EndDate, ULAN
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Artists %>% get_dupes()
## No variable names specified - using all columns.
## No duplicate combinations found of: ConstituentID, DisplayName, ArtistBio, Nationality, Gender, BeginDate, EndDate, Wiki QID, ULAN
## # A tibble: 0 × 10
## # … with 10 variables: ConstituentID <dbl>, DisplayName <chr>, ArtistBio <chr>,
## # Nationality <chr>, Gender <chr>, BeginDate <dbl>, EndDate <dbl>,
## # Wiki QID <chr>, ULAN <dbl>, dupe_count <int>
Artists %>% get_dupes(Nationality)
## # A tibble: 15,201 × 10
## Nationality dupe_count ConstituentID DisplayName ArtistBio Gender BeginDate
## <chr> <int> <dbl> <chr> <chr> <chr> <dbl>
## 1 Albanian 3 29622 Anri Sala Albanian,… Male 1974
## 2 Albanian 3 30621 Adrian Paci Albanian,… Male 1969
## 3 Albanian 3 31324 Edi Rama Albanian,… Male 1964
## 4 Algerian 5 1880 M. Fiorini Algerian,… Male 1922
## 5 Algerian 5 43595 Lyès Salem Algerian,… Male 1973
## 6 Algerian 5 43626 Djamila Sah… Algerian,… Female 1950
## 7 Algerian 5 49216 Neïl Beloufa Algerian-… Male 1985
## 8 Algerian 5 70052 El Hadi Jaz… Algerian,… Male 1970
## 9 American 5194 1 Robert Arne… American,… Male 1930
## 10 American 5194 3 Bill Arnold American,… Male 1941
## # … with 15,191 more rows, and 3 more variables: EndDate <dbl>, Wiki QID <chr>,
## # ULAN <dbl>
Artists %>% get_dupes(Nationality, Gender)
## # A tibble: 15,157 × 10
## Nationality Gender dupe_count ConstituentID DisplayName ArtistBio BeginDate
## <chr> <chr> <int> <dbl> <chr> <chr> <dbl>
## 1 Albanian Male 3 29622 Anri Sala Albanian,… 1974
## 2 Albanian Male 3 30621 Adrian Paci Albanian,… 1969
## 3 Albanian Male 3 31324 Edi Rama Albanian,… 1964
## 4 Algerian Male 4 1880 M. Fiorini Algerian,… 1922
## 5 Algerian Male 4 43595 Lyès Salem Algerian,… 1973
## 6 Algerian Male 4 49216 Neïl Beloufa Algerian-… 1985
## 7 Algerian Male 4 70052 El Hadi Jaz… Algerian,… 1970
## 8 American Female 1097 10 Irene Arons… American,… 1918
## 9 American Female 1097 21 Ruth Asawa American,… 1926
## 10 American Female 1097 31 Dana Atchley American,… 1941
## # … with 15,147 more rows, and 3 more variables: EndDate <dbl>, Wiki QID <chr>,
## # ULAN <dbl>
Is it helpful?
I think this is a useful preliminary data exploration function. It is a way to identify overall duplicate rows that might need to be further explored. It is also useful if you have particular variables you want to assess for duplicate rows either alone or in combination. For example it would be useful to look for duplicate medical records in a large dataset or multiple observations on a study participant in a single year. Its got an easy to remember name and is useful way to do initial assessment on data, but isn’t something that I would use regularly.