forcats::fct_lump()
Function of the Week:
Kathryn Liu
2022-02-02
fct_lump()
In this document, I will introduce the fct_lump() function and show what it’s for.
#load packages
pacman::p_load(
tidyverse,
readxl,
here,
janitor,
dplyr
)
#load data for examples
smoke_complete <- read_excel("data/smoke_complete.xlsx", sheet =1, na= "NA")
#prepare data for analysis
smoke_sample <- smoke_complete %>%
#create small sample
sample_n(., 10) %>%
#keep only relevant fields
select(primary_diagnosis, tumor_stage, disease)
#output example data set for reference
(smoke_sample)
## # A tibble: 10 x 3
## primary_diagnosis tumor_stage disease
## <chr> <chr> <chr>
## 1 C34.1 stage ib LUSC
## 2 C53.9 not reported CESC
## 3 C34.1 stage ib LUSC
## 4 C34.1 stage ia LUSC
## 5 C67.1 stage ii BLCA
## 6 C34.1 stage ia LUSC
## 7 C34.9 stage iiib LUSC
## 8 C67.9 stage iii BLCA
## 9 C34.1 stage iib LUSC
## 10 C34.9 stage ib LUSC
What is it for?
fct_lump() is a family of functions used for bucketing levels into a category called “other” based on their frequencies. There are four functions within the fct_lump() family.
fct_lump_min()
fct_lump_prop()
fct_lump_n()
fct_lump_lowfreq()
1. fct_lump_min()
This function buckets levels which appear fewer than n times.
Example
#create frequency table of primary_diagnosis for reference
freq_primary_diagnosis <- smoke_sample %>%
tabyl(primary_diagnosis)
#sort descending
freq_primary_diagnosis %>%
arrange(desc(percent))
## primary_diagnosis n percent
## C34.1 5 0.5
## C34.9 2 0.2
## C53.9 1 0.1
## C67.1 1 0.1
## C67.9 1 0.1
#Set any levels which occur fewer than 2 times to "Suppressed". Keep all other entries as-is.
(fct_lump_min(smoke_sample$primary_diagnosis, 2, other_level = "Suppressed"))
## [1] C34.1 Suppressed C34.1 C34.1 Suppressed C34.1
## [7] C34.9 Suppressed C34.1 C34.9
## Levels: C34.1 C34.9 Suppressed
#Note that the other_level = "name" argument can be used to specify a bucket name aside from the default "other".
Is it helpful?
I would use this function if I had concerns about HIPPA compliance and wanted to suppress information relating to rare traits, diseases, or procedures that might identify a patient.
2. fct_lump_prop()
This function buckets levels which make up less than p, a specified proportion, of the data.
Example
#create frequency table of tumor_stage for reference
freq_tumor_stage <- smoke_sample %>%
tabyl(tumor_stage)
#sort descending
freq_tumor_stage %>%
arrange(desc(percent))
## tumor_stage n percent
## stage ib 3 0.3
## stage ia 2 0.2
## not reported 1 0.1
## stage ii 1 0.1
## stage iib 1 0.1
## stage iii 1 0.1
## stage iiib 1 0.1
#Set any levels which make up 10% or less of the data to "Rare", and keep all other levels as-is.
(fct_lump_prop(smoke_sample$tumor_stage, 0.10, other_level = "Rare"))
## [1] stage ib Rare stage ib stage ia Rare stage ia Rare Rare
## [9] Rare stage ib
## Levels: stage ia stage ib Rare
#Set any levels which make up more than 10% of the data to "Common", and keep all other levels as-is.
(fct_lump_prop(smoke_sample$tumor_stage, -0.10, other_level = "Common"))
## [1] Common not reported Common Common stage ii
## [6] Common stage iiib stage iii stage iib Common
## Levels: not reported stage ii stage iib stage iii stage iiib Common
Is it helpful?
This function could potentially be useful, if you wanted to bucket and blind only extreme results. However, it would not be helpful if you wanted to bucket all entries of a particular field.
3. fct_lump_n():
This function buckets levels which appear fewer than the most common n levels.
Example
#create frequency table of tumor_stage for reference
freq_tumor_stage <- smoke_sample %>%
tabyl(tumor_stage)
#sort descending
freq_tumor_stage %>%
arrange(desc(percent))
## tumor_stage n percent
## stage ib 3 0.3
## stage ia 2 0.2
## not reported 1 0.1
## stage ii 1 0.1
## stage iib 1 0.1
## stage iii 1 0.1
## stage iiib 1 0.1
#Keep the the most common level of primary_diagnosis, and set any remaining levels to "other".
fct_lump_n(smoke_sample$tumor_stage, 1)
## [1] stage ib Other stage ib Other Other Other Other Other
## [9] Other stage ib
## Levels: stage ib Other
Is it helpful?
I cannot think of a situation in my work where this function would be useful. I would prefer to use a combination of the tabyl() and arrange() functions (as seen at the top of the code for this example) to determine the most common level(s) and make bucketing determinations after manual review, as frequencies may change with refreshed data sets.
4. fct_lump_lowfreq()
This function buckets the least frequent levels to “other”, while still keeping the “other” bucket as the smallest level.
Example
#example where fails
#create frequency table of primary_diagnosis for reference
freq_primary_diagnosis <- smoke_sample %>%
tabyl(primary_diagnosis)
#sort descending
freq_primary_diagnosis %>%
arrange(desc(percent))
## primary_diagnosis n percent
## C34.1 5 0.5
## C34.9 2 0.2
## C53.9 1 0.1
## C67.1 1 0.1
## C67.9 1 0.1
#Bucket levels which make up the lowest frequency
fct_lump_lowfreq(smoke_sample$primary_diagnosis)
## [1] C34.1 C53.9 C34.1 C34.1 C67.1 C34.1 C34.9 C67.9 C34.1 C34.9
## Levels: C34.1 C34.9 C53.9 C67.1 C67.9
#example where works
#create frequency table of disease for reference
freq_disease <- smoke_sample %>%
tabyl(disease)
#sort descending
freq_disease %>%
arrange(desc(disease))
## disease n percent
## LUSC 7 0.7
## CESC 1 0.1
## BLCA 2 0.2
#Bucket levels which make up the lowest frequency
fct_lump_lowfreq(smoke_sample$disease)
## [1] LUSC Other LUSC LUSC Other LUSC LUSC Other LUSC LUSC
## Levels: LUSC Other
Is it helpful?
I cannot think of a scenario in my work where I would find this helpful.