Part 5 contd (Class 6): Data summarizing, reshaping, and wrangling with multiple tables (contd)
R Project files
In this class we finished the part5 material from this folder link. Please download this folder and be sure to unzip if necessary. Knit the part5.Rmd
to install any required packages.
Class Video
View last year’s class and materials here.
Slides
No slides this class.
Post-Class
Please fill out the following survey and we will discuss the results during the next lecture. All responses will be anonymous.
- Clearest Point: What was the most clear part of the lecture?
- Muddiest Point: What was the most unclear part of the lecture to you?
- Anything Else: Is there something you’d like me to know?
Muddiest Points
I’m still confused about what makes a list special (I know we’re going to talk about it more later). I loved the walk-through of summarize with across but I need some practice with that before it becomes completely clear – I hope it’ll be on the HW! I also have trouble visualizing facet wraps and the necessary pivoting without actually trying it and watching my code break. Maybe that just takes practice!
Yeah, sorry I was trying to avoid talking about lists until we can cover them fully but it turns out they are hard to avoid! We will talk about lists more in class 8, along with functions!
In class 7 (part6) we will have more examples with summarize with across, and also facet wraps and pivoting. Basically, class 7 is a perfect response to this comment, even though I read this comment after I created the materials. Glad to be on the same wavelength =)
I was confused by the “.fns =” inside “summarize(acros())”. I know it specifies the function, but I kept getting confused by how to code after that.
This is hard stuff. I think it will make a bit more sense after we talk about functions in part7 (class 8 probably) and how to use them with purrr
since it is similar syntax. Stay tuned for a couple more examples with summarize(across())
in part6 (class 7), and I can’t emphasize enough how much I recommend reading the reference on across here and other tidyverse functions that are confusing, but here’s a quick explanation in the meantime.
One thing to remember is that when using summarize, the function you are applying must result in one value, that is a vector of length one! Otherwise, it’s not a summary statistic. This can be variations on n_distinct()
, length()
, sum()
, min()
, etc.
library(tidyverse)
library(palmerpenguins)
library(gt)
penguins %>%
summarize(
# all the code for the column specification AND the function goes in across()
across(
# use tidyselect to specify the columns
.cols = contains("length"),
# we can specify a list() of functions to apply
# to add a suffix to column names of result, name the functions
# the ~ in front specifies a custom function is next, .x is the argument
# or use built in functions
.fns = list(mean_cm = ~ mean(.x/10, na.rm = TRUE),
n_miss = ~ sum(is.na(.x)),
min = min,
max = max
), # end list
# add additional argument for min and max
na.rm = TRUE,
# use "." to separate the col name & the function name
.names = "{.col}.{.fn}"
) # end across
) %>% # end mutate
gt()
bill_length_mm.mean_cm | bill_length_mm.n_miss | bill_length_mm.min | bill_length_mm.max | flipper_length_mm.mean_cm | flipper_length_mm.n_miss | flipper_length_mm.min | flipper_length_mm.max |
---|---|---|---|---|---|---|---|
4.392193 | 2 | 32.1 | 59.6 | 20.09152 | 2 | 172 | 231 |
The most difficult part was towards the end when we were working with long data and were graphing it. Are there other examples of geomtile?
Towards the end when going over some of the ggplot section.
I do have another couple examples of geom_tile() in part6, but at the end so I’m not confident we will get to it. But we will go over ggplot
with long data a lot in part6 (class 7) so I hope that will help.
geom_tile
works best on summarized data, showing for instance the mean of a numeric value within groups:
penguin_means <- penguins %>%
group_by(species, island) %>%
summarize(mb = mean(bill_length_mm, na.rm = TRUE))
## `summarise()` has grouped output by 'species'. You can override using the
## `.groups` argument.
penguin_means
## # A tibble: 5 × 3
## # Groups: species [3]
## species island mb
## <fct> <fct> <dbl>
## 1 Adelie Biscoe 39.0
## 2 Adelie Dream 38.5
## 3 Adelie Torgersen 39.0
## 4 Chinstrap Dream 48.8
## 5 Gentoo Biscoe 47.5
ggplot(penguin_means) +
aes(x = island,
y = species,
fill = mb) +
geom_tile()+
labs(fill = "Mean bill length (mm)")
Clearest Points
Lots of summarize()
, join
, pivot
! Thanks, all!
Other Notes
The very best part of this class is the strange and useful tidbits that aren’t even on the syllabus!
Well that’s good to know! I’ll try to go on more tangents =)
I’m interested in doing more with summary tables
Yes, this is useful. It was my plan to get to this when we talk about statistical modeling and summary tables of cohorts/data. I hope we get there, we will do this after we talk about lists/purrr.
I think I’m getting a little turned around as functions are added, used in concert and combined with tips for advanced users. A main, base take-away for primary functions etc. would help me integrate new concepts to previous ones.
Good feedback, thank you! I try to do this with more. I’m hoping part6 will give everyone a chance to practice with what we’ve learned so far, to solidify these concepts before we move on to the next section of lists, functions, and purrr
topics.