ggplot2::geom_smooth()
Function of the Week: Geom_smooth
Mia Truman
2022-02-23
geom_smooth
In this document, I will introduce the geom_smooth() function and show what it’s for.
#load tidyverse up
library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.1 ──
## ✓ ggplot2 3.3.5 ✓ purrr 0.3.4
## ✓ tibble 3.1.6 ✓ dplyr 1.0.8
## ✓ tidyr 1.2.0 ✓ stringr 1.4.0
## ✓ readr 2.1.2 ✓ forcats 0.5.1
## Warning: package 'tidyr' was built under R version 4.1.2
## Warning: package 'readr' was built under R version 4.1.2
## Warning: package 'dplyr' was built under R version 4.1.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
#read in the dataset
library(readxl)
library(ggplot2)
#load the two datasets I will use
rats <- read_excel("/Users/miatruman/Documents/R/512/data/CH05Q09.xls")
smoking <- read_excel("smoke_1.xlsx",na = "NA")
What is it for?
The function geom_smooth is useful for visually showing
trends in a plot. Specifically it can help aid the eye when there is
overplotting. Additionally it is the function used to find a linear fit.
If not specified, the method that geom_smooth() uses to create a fit
depends on how much data there is. If not specified, the function will
decide based on the size of data. “Loess” is used for less than 1000
observations (polynomial regression fitting), otherwise “gam” is used
(generalized additive models). Other options are “rlm” and “lm.”
The following example is a linear model. When plotting X vs Y, a negative linear trend can be seen visually. Geom_smooth(method = “lm”) creates a regression line and a confidence interval to more clearly illustrate the relationship between x and y.The default formula is Y~X, but this can be changed to include a quadratic formula.
The following is an example from Applied Regression Analysis and Other Multivariable Methods (Kleinbaum, D. G., Kupper, L. L., Nizam, A., & Rosenberg, E. S. (2014). In Applied regression analysis and other multivariable methods (p. 90). essay, Cengage Learning.).
#make sample ggplot
rats_plot <- ggplot(rats,
aes(x = LOGDOSE,
y = LOGCONC,
na.rm = TRUE
)) + geom_point()+ labs(title = "dose–response curve for vitamin K in rats", y = "concentration of clotting agent (log)", x = "dosage level (log)")+
theme(plot.title = element_text(hjust = 0.5))
rats_plot
#now add the geom_smooth without specifying a method
rats_plot+geom_smooth()+ggtitle("with geom_smooth() (no method specified. default = loess)")
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
#geom_smooth with specifying a method. it assumes we want y~x
rats_plot+geom_smooth(method = "lm")+ggtitle("with method = lm")
## `geom_smooth()` using formula 'y ~ x'
#without the confidence interval
rats_plot+geom_smooth(method = "lm", se = F) +ggtitle("se = f to get rid of confidence interval")
## `geom_smooth()` using formula 'y ~ x'
Another Example: How geom_smooth() can visually assist the user when speculating whether there is a linear trend. In this example, a trend is unclear. It looks that perhaps days to death and years smoked are positively correlated, but it’s not obvious. Adding geom_smooth() makes it more clear.
smoking_plot <- ggplot(smoking,
aes(x = years_smoked,
y = days_to_death,
na.rm = TRUE
))+geom_point()+labs(title = "years smoked vs days to death", y = "days to death", x = "years smoked")+
theme(plot.title = element_text(hjust = 0.5))
#plot it:
smoking_plot
## Warning: Removed 978 rows containing missing values (geom_point).
smoking_plot+geom_smooth()
## `geom_smooth()` using method = 'gam' and formula 'y ~ s(x, bs = "cs")'
## Warning: Removed 978 rows containing non-finite values (stat_smooth).
## Removed 978 rows containing missing values (geom_point).
Is it helpful?
Yes. geom_smooth is vital to showing trends in data. It is a method that allows us to insert a linear regression line into a ggplot. It is helpful if the type of model that fits your data is known, but also helpful to quickly see a relationship between the variables. It is very relevant to statistical analysis.