dplyr::ntile()
Function of the Week: dplyr::ntile()
Qijia Liu
2022-01-26
In this document, I will introduce the ntile() function and show what it’s for.
library(dplyr)
library(tidyverse)
library(tidytuesdayR)
tuesdata <- tt_load('2022-01-18')
ChocolateBar <- tuesdata$chocolate
What is it for?
The ntile() function can divide the observations into specified number of roughly equal sized groups by sorting the variable of interest with ascending order and then split up into. It requires two arguments as input: a vector(i.e. x) and an integer(i.e. 4).
Examples
vector <- rep(c(-1,1,2), each=4)
vector
## [1] -1 -1 -1 -1 1 1 1 1 2 2 2 2
ntile(vector, 2)
## [1] 1 1 1 1 1 1 2 2 2 2 2 2
vector <- rep(c(2,-1,1), each=4)
vector
## [1] 2 2 2 2 -1 -1 -1 -1 1 1 1 1
ntile(vector, 2)
## [1] 2 2 2 2 1 1 1 1 1 1 2 2
Chocolate Bar:
1. Divide the rating within a country into four ranked groups
# How many countries?
n_distinct(ChocolateBar$company_location)
## [1] 67
# Only select countries start with "A":
ChocolateBar <- ChocolateBar %>%
select(company_location, rating) %>%
filter(str_detect(company_location, "^A"))
table(ChocolateBar$company_location)
##
## Amsterdam Argentina Australia Austria
## 12 9 53 30
ChocolateBar <- ChocolateBar %>% mutate(quantile_rating = ntile(rating, 4))
table(ChocolateBar$quantile_rating)
##
## 1 2 3 4
## 26 26 26 26
dim(ChocolateBar)
## [1] 104 3
2. Divide the rating within a country into four ranked groups
by_ChocolateBar_quartile <- ChocolateBar %>%
group_by(company_location) %>%
mutate(quantile_company_location = ntile(rating, 4))
table(by_ChocolateBar_quartile$company_location, by_ChocolateBar_quartile$quantile_company_location)
##
## 1 2 3 4
## Amsterdam 3 3 3 3
## Argentina 3 2 2 2
## Australia 14 13 13 13
## Austria 8 8 7 7
3. Filter ChocolateBar according to rating median
median(ChocolateBar$rating)
## [1] 3.25
range(ChocolateBar$rating)
## [1] 2.5 4.0
#only keep the observations less than the median
ChocolateBar <- filter(ChocolateBar, ntile(rating, 2) < 2)
range(ChocolateBar$rating)
## [1] 2.50 3.25
dim(ChocolateBar)
## [1] 52 3
Is it helpful?
It is very useful when we categorize continuous predictor variables. Other function like
cut()
could perform data binning as well.