Part 3: ggplot2, factors, boxplots, dplyr: subsetting

Materials from class on Wednesday, January 19, 2022

R Project files

Please download the part3 folder for class materials. Use the grey “download” button to download the whole folder, please keep the file structure and folder organization exactly the same as we need this for class. Be sure to unzip if necessary. You may move the folder part3 wherever you like on your computer.

Readings

Required and suggested class readings can be found on the Readings tab by class. These readings may be done anytime before or after class, but they will supplement your understanding of the class materials and help make homework and project work easier.

Class Video

The class video is here, but I forgot to video tape the part about here::here(). If I have a chance I will re-record myself talking about it, but in the meantime, click here for Ted’s video from last year, which explains similar ideas.

View last year’s class and materials here.

Slides

Open the class introduction slides in a separate window: https://sph-r-programming-2022.netlify.app/slides/03-ggplot2-dplyr-part1s#1

Post-Class

Please fill out the following survey and we will discuss the results during the next lecture. All responses will be anonymous. The first two questions will count toward your attendance part of the grade.

  • Clearest Point: What was the most clear part of the lecture?
  • Muddiest Point: What was the most unclear part of the lecture to you?
  • Anything Else: Is there something you’d like me to know?

https://forms.gle/4tVx1mL7SzQx7MCu5

Muddiest Points

I didn’t get fill to work, need to try again

Yes this was a curve ball (assuming you mean fill() from the challenge), I expected it to be trickier since I didn’t give you any idea how to use it =) Keep practicing and let me know if it still doesn’t work for you! It’s a similar idea to in Excel where you can you “fill down” (Edit -> Fill -> Down in Excel) which fills the same value down rows in the same column. You can see some examples in ?tidyr::fill and here is the solution (also in the smoke_messy.Rmd file in “part3” folder on dropbox)

smoke_messy <- read_excel("data/smoke_complete.xlsx",
                          sheet = "smoke_messy",
                          skip = 5,
                          na = c("missing", "Missing","", "NA"))

smoke_clean <- smoke_messy %>% 
  janitor::clean_names() %>% 
  janitor::remove_empty(which=c("rows", "cols")) %>% 
  tidyr::fill(tumor_stage, .direction = "down") %>% # fill the empty tumor_stage variable down
  select(-notes)

Finding all of the problems in data files

Ah yes this will take practice. In future classes I will try to show more examples of looking for problems in data files. Class 4 will use a relatively clean data file again, but I will try to make things trickier in the future, for practice.

Read excel and how to select for tabs within the file

Please see the explanation in Class 2 page for the answer to a similar question under Muddiest Points, and re-watch the loading data review from last class. I think the key here is that “tab” is called “sheet” in excel and in the read_excel() function. So read_excel("filename.xlsx", sheet = 1) is the first tab/sheet. If it’s still troublesome please set up a 1:1 with me or Colin for more help. I may be misunderstanding the question!

“here” function. I wonder how the here function to indicate the specific data folder. Since each part folder has a data folder.

here() is relative to your project folder. Since each part has it’s own Rstudio project associated with it (there is a .Rproj file in each part folder) it only looks in the “data/” folder that is inside that root folder (defined as where the .Rproj file lives). See the excerpt on here from Ted’s video, as well. I will keep using here examples as well to get more practice.

I keep getting confused on which variables are independent/dependent, which ones go on x-axis and y-axis. It’s so simple, this is not a reflection of your teaching, but in all of my classes throughout the years my brain struggles with knowing how to figure it out

This actually is not always clear, so don’t feel bad for being confused! Honestly I don’t think people take much care in thinking about this when plotting. Remember that independent/dependent variables are important in a statistical model, not always in a graph. Those terms might not make sense in the context of what we are graphing. If there is an independent and dependent variable and we want to make a scatterplot, we can technically graph it either way. Though usually if we have a dependent variable (our outcome) and an independent variable (our predictor, or covariate), we plot the outcome as Y (on Y axis) and the predictor as X (on X axis). This is because as we learned back in algebra a billion years ago,

\(Y = f(X)\)

“Y is a function of X.”

I wouldn’t worry that much about it in graphing, though. You might just be plotting two “related” variables against each other, not making assumptions about any causal pathways.

I was a little unclear about how the data wrangling and the ggplots connected to one another. Are they two separate ideas?

Excellent point, I haven’t really merged those ideas together yet. Often, data wrangling is needed first to do what you want to do in ggplot, but so far we’ve covered relatively simple actions in both and haven’t put them together. We start to put them together a bit in part 4, and when we talk about merging/joining data and reshaping data next week we will see data wrangling and ggplot fit together more.

What makes tidyselect stuff different from normal select? The ability to select names based on partial matches? (eg “ends in s”)

Yeah this is confusing for sure, and honestly I am still somewhat confused as it’s a newish concept to tidyverse. tidyselect we can think of as a language (or a backend), it is used in multiple functions, including select() but also across() as we will see in class 4, pivot_longer(), rename_with(), and many more. However, it also has to do with the tidyverse’s use of “unquoted” variable names in all the functions, and how to deal with that when making your own functions. There are some examples in this vignette that involve enquo(), expr(), etc (but I don’t recommend going down that rabbit hole just yet). For now, I would just get familiar with all of the selection helpers listed in this vignette. They do show how you would select names based on partial matches or other characteristics.

Clearest Points

Thanks everyone for answering these, it helps me see what is wokring!

Nice review of loading data

The functions such as select and filter

data manipulation using

I felt like I understand basic data wrangling in R now.

Using the pipe

the pipe!

I really like how you laid out the formatting for piping and for ggplot code structuring

here, filter & select

Everything covered in class was very clear for me.

Messages to me

Please explain the aspects of how to remove columns.

I’ll show more examples in class, but the simple way is with the - negative sign before the column name, such as:

smoke_complete %>% select(-tumor_stage)

But you can also use the more sophisticated tidyselect methods here. Say we wanted to remove all columns with column names that contain the word “day”:

smoke_complete %>% select(-contains("day"))

Class is going well!

Thank you! This class is always so interesting and useful!

Yay!!