Part 5 contd (Class 6): Data summarizing, reshaping, and wrangling with multiple tables (contd)

Materials from class on Wednesday, February 9, 2022

R Project files

In this class we finished the part5 material from this folder link. Please download this folder and be sure to unzip if necessary. Knit the part5.Rmd to install any required packages.

Class Video

View last year’s class and materials here.

Slides

No slides this class.

Post-Class

Please fill out the following survey and we will discuss the results during the next lecture. All responses will be anonymous.

  • Clearest Point: What was the most clear part of the lecture?
  • Muddiest Point: What was the most unclear part of the lecture to you?
  • Anything Else: Is there something you’d like me to know?

https://forms.gle/4tVx1mL7SzQx7MCu5

Muddiest Points

I’m still confused about what makes a list special (I know we’re going to talk about it more later). I loved the walk-through of summarize with across but I need some practice with that before it becomes completely clear – I hope it’ll be on the HW! I also have trouble visualizing facet wraps and the necessary pivoting without actually trying it and watching my code break. Maybe that just takes practice!

Yeah, sorry I was trying to avoid talking about lists until we can cover them fully but it turns out they are hard to avoid! We will talk about lists more in class 8, along with functions!

In class 7 (part6) we will have more examples with summarize with across, and also facet wraps and pivoting. Basically, class 7 is a perfect response to this comment, even though I read this comment after I created the materials. Glad to be on the same wavelength =)

I was confused by the “.fns =” inside “summarize(acros())”. I know it specifies the function, but I kept getting confused by how to code after that.

This is hard stuff. I think it will make a bit more sense after we talk about functions in part7 (class 8 probably) and how to use them with purrr since it is similar syntax. Stay tuned for a couple more examples with summarize(across()) in part6 (class 7), and I can’t emphasize enough how much I recommend reading the reference on across here and other tidyverse functions that are confusing, but here’s a quick explanation in the meantime.

One thing to remember is that when using summarize, the function you are applying must result in one value, that is a vector of length one! Otherwise, it’s not a summary statistic. This can be variations on n_distinct(), length(), sum(), min(), etc.

library(tidyverse)
library(palmerpenguins)
library(gt)

penguins %>%
  summarize(
    # all the code for the column specification AND the function goes in across()
    across(
      # use tidyselect to specify the columns
      .cols = contains("length"),
      # we can specify a list() of functions to apply
      # to add a suffix to column names of result, name the functions
      # the ~ in front specifies a custom function is next, .x is the argument
      # or use built in functions
      .fns = list(mean_cm = ~ mean(.x/10, na.rm = TRUE),
                  n_miss = ~ sum(is.na(.x)),
                  min = min,
                  max = max
                  ), # end list
      # add additional argument for min and max
      na.rm = TRUE,
      # use "." to separate the col name & the function name
      .names = "{.col}.{.fn}"
    ) # end across
  ) %>% # end mutate
  gt()
bill_length_mm.mean_cm bill_length_mm.n_miss bill_length_mm.min bill_length_mm.max flipper_length_mm.mean_cm flipper_length_mm.n_miss flipper_length_mm.min flipper_length_mm.max
4.392193 2 32.1 59.6 20.09152 2 172 231

The most difficult part was towards the end when we were working with long data and were graphing it. Are there other examples of geomtile?

Towards the end when going over some of the ggplot section.

I do have another couple examples of geom_tile() in part6, but at the end so I’m not confident we will get to it. But we will go over ggplot with long data a lot in part6 (class 7) so I hope that will help.

geom_tile works best on summarized data, showing for instance the mean of a numeric value within groups:

penguin_means <- penguins %>%
  group_by(species, island) %>%
  summarize(mb = mean(bill_length_mm, na.rm = TRUE))
## `summarise()` has grouped output by 'species'. You can override using the
## `.groups` argument.
penguin_means
## # A tibble: 5 × 3
## # Groups:   species [3]
##   species   island       mb
##   <fct>     <fct>     <dbl>
## 1 Adelie    Biscoe     39.0
## 2 Adelie    Dream      38.5
## 3 Adelie    Torgersen  39.0
## 4 Chinstrap Dream      48.8
## 5 Gentoo    Biscoe     47.5
ggplot(penguin_means) + 
  aes(x = island,
      y = species,
      fill = mb) + 
  geom_tile()+
  labs(fill = "Mean bill length (mm)")

Clearest Points

Lots of summarize(), join, pivot! Thanks, all!

Other Notes

The very best part of this class is the strange and useful tidbits that aren’t even on the syllabus!

Well that’s good to know! I’ll try to go on more tangents =)

I’m interested in doing more with summary tables

Yes, this is useful. It was my plan to get to this when we talk about statistical modeling and summary tables of cohorts/data. I hope we get there, we will do this after we talk about lists/purrr.

I think I’m getting a little turned around as functions are added, used in concert and combined with tips for advanced users. A main, base take-away for primary functions etc. would help me integrate new concepts to previous ones.

Good feedback, thank you! I try to do this with more. I’m hoping part6 will give everyone a chance to practice with what we’ve learned so far, to solidify these concepts before we move on to the next section of lists, functions, and purrr topics.