forcats::fct_lump()

Function of the Week:

fct_lump()

In this document, I will introduce the fct_lump() function and show what it’s for.

#load packages
pacman::p_load(
  tidyverse,
  readxl,
  here,         
  janitor,
  dplyr  
)

#load data for examples
smoke_complete <- read_excel("data/smoke_complete.xlsx", sheet =1, na= "NA")

#prepare data for analysis
smoke_sample <- smoke_complete %>% 
  #create small sample
  sample_n(., 10)  %>%
  #keep only relevant fields
  select(primary_diagnosis, tumor_stage, disease)
  
#output example data set for reference
(smoke_sample)
## # A tibble: 10 x 3
##    primary_diagnosis tumor_stage  disease
##    <chr>             <chr>        <chr>  
##  1 C34.1             stage ib     LUSC   
##  2 C53.9             not reported CESC   
##  3 C34.1             stage ib     LUSC   
##  4 C34.1             stage ia     LUSC   
##  5 C67.1             stage ii     BLCA   
##  6 C34.1             stage ia     LUSC   
##  7 C34.9             stage iiib   LUSC   
##  8 C67.9             stage iii    BLCA   
##  9 C34.1             stage iib    LUSC   
## 10 C34.9             stage ib     LUSC

What is it for?

fct_lump() is a family of functions used for bucketing levels into a category called “other” based on their frequencies. There are four functions within the fct_lump() family.

  1. fct_lump_min()

  2. fct_lump_prop()

  3. fct_lump_n()

  4. fct_lump_lowfreq()

1. fct_lump_min()

This function buckets levels which appear fewer than n times.

Example

#create frequency table of primary_diagnosis for reference
freq_primary_diagnosis <- smoke_sample %>%
  tabyl(primary_diagnosis) 
  #sort descending
  freq_primary_diagnosis %>%
    arrange(desc(percent))
##  primary_diagnosis n percent
##              C34.1 5     0.5
##              C34.9 2     0.2
##              C53.9 1     0.1
##              C67.1 1     0.1
##              C67.9 1     0.1
#Set any levels which occur fewer than 2 times to "Suppressed". Keep all other entries as-is.
(fct_lump_min(smoke_sample$primary_diagnosis, 2, other_level = "Suppressed"))
##  [1] C34.1      Suppressed C34.1      C34.1      Suppressed C34.1     
##  [7] C34.9      Suppressed C34.1      C34.9     
## Levels: C34.1 C34.9 Suppressed
    #Note that the other_level = "name" argument can be used to specify a bucket name aside from the default "other".

Is it helpful?

I would use this function if I had concerns about HIPPA compliance and wanted to suppress information relating to rare traits, diseases, or procedures that might identify a patient.

2. fct_lump_prop()

This function buckets levels which make up less than p, a specified proportion, of the data.

Example

#create frequency table of tumor_stage for reference
freq_tumor_stage <- smoke_sample %>%
  tabyl(tumor_stage) 
  #sort descending
  freq_tumor_stage %>%
    arrange(desc(percent))
##   tumor_stage n percent
##      stage ib 3     0.3
##      stage ia 2     0.2
##  not reported 1     0.1
##      stage ii 1     0.1
##     stage iib 1     0.1
##     stage iii 1     0.1
##    stage iiib 1     0.1
#Set any levels which make up 10% or less of the data to "Rare", and keep all other levels as-is.
(fct_lump_prop(smoke_sample$tumor_stage, 0.10, other_level = "Rare"))
##  [1] stage ib Rare     stage ib stage ia Rare     stage ia Rare     Rare    
##  [9] Rare     stage ib
## Levels: stage ia stage ib Rare
#Set any levels which make up more than 10% of the data to "Common", and keep all other levels as-is. 
(fct_lump_prop(smoke_sample$tumor_stage, -0.10, other_level = "Common"))
##  [1] Common       not reported Common       Common       stage ii    
##  [6] Common       stage iiib   stage iii    stage iib    Common      
## Levels: not reported stage ii stage iib stage iii stage iiib Common

Is it helpful?

This function could potentially be useful, if you wanted to bucket and blind only extreme results. However, it would not be helpful if you wanted to bucket all entries of a particular field.

3. fct_lump_n():

This function buckets levels which appear fewer than the most common n levels.

Example

#create frequency table of tumor_stage for reference
freq_tumor_stage <- smoke_sample %>%
  tabyl(tumor_stage) 
  #sort descending
  freq_tumor_stage %>%
    arrange(desc(percent))
##   tumor_stage n percent
##      stage ib 3     0.3
##      stage ia 2     0.2
##  not reported 1     0.1
##      stage ii 1     0.1
##     stage iib 1     0.1
##     stage iii 1     0.1
##    stage iiib 1     0.1
#Keep the the most common level of primary_diagnosis, and set any remaining levels to "other".
fct_lump_n(smoke_sample$tumor_stage, 1)
##  [1] stage ib Other    stage ib Other    Other    Other    Other    Other   
##  [9] Other    stage ib
## Levels: stage ib Other

Is it helpful?

I cannot think of a situation in my work where this function would be useful. I would prefer to use a combination of the tabyl() and arrange() functions (as seen at the top of the code for this example) to determine the most common level(s) and make bucketing determinations after manual review, as frequencies may change with refreshed data sets.

4. fct_lump_lowfreq()

This function buckets the least frequent levels to “other”, while still keeping the “other” bucket as the smallest level.

Example

#example where fails
  #create frequency table of primary_diagnosis for reference
freq_primary_diagnosis <- smoke_sample %>%
  tabyl(primary_diagnosis) 
  #sort descending
  freq_primary_diagnosis %>%
    arrange(desc(percent))
##  primary_diagnosis n percent
##              C34.1 5     0.5
##              C34.9 2     0.2
##              C53.9 1     0.1
##              C67.1 1     0.1
##              C67.9 1     0.1
  #Bucket levels which make up the lowest frequency
  fct_lump_lowfreq(smoke_sample$primary_diagnosis)
##  [1] C34.1 C53.9 C34.1 C34.1 C67.1 C34.1 C34.9 C67.9 C34.1 C34.9
## Levels: C34.1 C34.9 C53.9 C67.1 C67.9
#example where works
  #create frequency table of disease for reference
freq_disease <- smoke_sample %>%
  tabyl(disease) 
  #sort descending
  freq_disease %>%
    arrange(desc(disease))
##  disease n percent
##     LUSC 7     0.7
##     CESC 1     0.1
##     BLCA 2     0.2
  #Bucket levels which make up the lowest frequency
  fct_lump_lowfreq(smoke_sample$disease)
##  [1] LUSC  Other LUSC  LUSC  Other LUSC  LUSC  Other LUSC  LUSC 
## Levels: LUSC Other

Is it helpful?

I cannot think of a scenario in my work where I would find this helpful.