dplyr::ntile()

Function of the Week: dplyr::ntile()

In this document, I will introduce the ntile() function and show what it’s for.

library(dplyr)
library(tidyverse)
library(tidytuesdayR)
tuesdata <- tt_load('2022-01-18')
ChocolateBar <-  tuesdata$chocolate

What is it for?

The ntile() function can divide the observations into specified number of roughly equal sized groups by sorting the variable of interest with ascending order and then split up into. It requires two arguments as input: a vector(i.e. x) and an integer(i.e. 4).

Examples

vector <- rep(c(-1,1,2), each=4)
vector

##  [1] -1 -1 -1 -1  1  1  1  1  2  2  2  2

ntile(vector, 2)

##  [1] 1 1 1 1 1 1 2 2 2 2 2 2

vector <- rep(c(2,-1,1), each=4)
vector

##  [1]  2  2  2  2 -1 -1 -1 -1  1  1  1  1

ntile(vector, 2)

##  [1] 2 2 2 2 1 1 1 1 1 1 2 2

Chocolate Bar:

1. Divide the rating within a country into four ranked groups

# How many countries?
n_distinct(ChocolateBar$company_location)

## [1] 67

# Only select countries start with "A":
ChocolateBar <- ChocolateBar %>% 
  select(company_location, rating) %>% 
  filter(str_detect(company_location, "^A"))

table(ChocolateBar$company_location)

## 
## Amsterdam Argentina Australia   Austria 
##        12         9        53        30

ChocolateBar <- ChocolateBar %>% mutate(quantile_rating = ntile(rating, 4))
table(ChocolateBar$quantile_rating)

## 
##  1  2  3  4 
## 26 26 26 26

dim(ChocolateBar)

## [1] 104   3

2. Divide the rating within a country into four ranked groups

by_ChocolateBar_quartile  <- ChocolateBar %>% 
  group_by(company_location) %>% 
  mutate(quantile_company_location = ntile(rating, 4))

table(by_ChocolateBar_quartile$company_location, by_ChocolateBar_quartile$quantile_company_location)

##            
##              1  2  3  4
##   Amsterdam  3  3  3  3
##   Argentina  3  2  2  2
##   Australia 14 13 13 13
##   Austria    8  8  7  7

3. Filter ChocolateBar according to rating median

median(ChocolateBar$rating)

## [1] 3.25

range(ChocolateBar$rating)

## [1] 2.5 4.0

#only keep the observations less than the median
ChocolateBar <- filter(ChocolateBar, ntile(rating, 2) < 2)
range(ChocolateBar$rating)

## [1] 2.50 3.25

dim(ChocolateBar)

## [1] 52  3

Is it helpful?

It is very useful when we categorize continuous predictor variables. Other function like cut() could perform data binning as well.

Last updated on February 28, 2022

Edit this page

dplyr::ntile()

Function of the Week: dplyr::ntile()

Qijia Liu

2022-01-26

In this document, I will introduce the ntile() function and show what it’s for.

What is it for?

Examples

Chocolate Bar:

1. Divide the rating within a country into four ranked groups

2. Divide the rating within a country into four ranked groups

3. Filter ChocolateBar according to rating median

Is it helpful?