dplyr::ntile()

Function of the Week: dplyr::ntile()


In this document, I will introduce the ntile() function and show what it’s for.

library(dplyr)
library(tidyverse)
library(tidytuesdayR)
tuesdata <- tt_load('2022-01-18')
ChocolateBar <-  tuesdata$chocolate


What is it for?

The ntile() function can divide the observations into specified number of roughly equal sized groups by sorting the variable of interest with ascending order and then split up into. It requires two arguments as input: a vector(i.e. x) and an integer(i.e. 4).


Examples

vector <- rep(c(-1,1,2), each=4)
vector
##  [1] -1 -1 -1 -1  1  1  1  1  2  2  2  2
ntile(vector, 2)
##  [1] 1 1 1 1 1 1 2 2 2 2 2 2
vector <- rep(c(2,-1,1), each=4)
vector
##  [1]  2  2  2  2 -1 -1 -1 -1  1  1  1  1
ntile(vector, 2)
##  [1] 2 2 2 2 1 1 1 1 1 1 2 2

Chocolate Bar:

1. Divide the rating within a country into four ranked groups

# How many countries?
n_distinct(ChocolateBar$company_location)
## [1] 67
# Only select countries start with "A":
ChocolateBar <- ChocolateBar %>% 
  select(company_location, rating) %>% 
  filter(str_detect(company_location, "^A"))

table(ChocolateBar$company_location)
## 
## Amsterdam Argentina Australia   Austria 
##        12         9        53        30
ChocolateBar <- ChocolateBar %>% mutate(quantile_rating = ntile(rating, 4))
table(ChocolateBar$quantile_rating)
## 
##  1  2  3  4 
## 26 26 26 26
dim(ChocolateBar)
## [1] 104   3

2. Divide the rating within a country into four ranked groups

by_ChocolateBar_quartile  <- ChocolateBar %>% 
  group_by(company_location) %>% 
  mutate(quantile_company_location = ntile(rating, 4))

table(by_ChocolateBar_quartile$company_location, by_ChocolateBar_quartile$quantile_company_location)
##            
##              1  2  3  4
##   Amsterdam  3  3  3  3
##   Argentina  3  2  2  2
##   Australia 14 13 13 13
##   Austria    8  8  7  7

3. Filter ChocolateBar according to rating median

median(ChocolateBar$rating)
## [1] 3.25
range(ChocolateBar$rating)
## [1] 2.5 4.0
#only keep the observations less than the median
ChocolateBar <- filter(ChocolateBar, ntile(rating, 2) < 2)
range(ChocolateBar$rating)
## [1] 2.50 3.25
dim(ChocolateBar)
## [1] 52  3


Is it helpful?

It is very useful when we categorize continuous predictor variables. Other function like cut() could perform data binning as well.