Please open your RStudio project from Day 1
Day 1
Day 2
Hint: Put the packages you need at the top of every script.
library("readr")
library("dplyr")
# Your code starts here.
read_csv()
all_cats <- read_csv("data/missing_cat_list.csv")
arrange()
# Sort from low to high
arrange(all_cats, age)
# Sort from high to low
arrange(all_cats, desc(age))
filter()
Hint: Numbers don’t need quotes.
# Filter to cats 5 years old
cats_5yo <- filter(all_cats, age == 5)
# Filter to cats from UK or Australia
cricket_cats <- filter(all_cats, country %in% c("UK", "Australia"))
[r]
tag?arrange()
filter()
: https://jjallaire.shinyapps.io/learnr-tutorial-03a-data-manip-filter/“Are you the best cat detective out there?”"
Let’s play a game to find out.
The rules
filter()
function that you can use to narrow down the potential cats.If you have time play another round. Best of 3?
Put your detective hat back on. Your neighbor just called and he shared some bad news. His cat has gone missing. He was so worked-up the only thing you could make out in the message was a description of how great his cat’s personality was. Here are the few details you were able to jot down.
My neighbor’s great cat:
The cat’s clumsiness and its greediness sum to more than 12.
You could barely make this out, but you think the cat’s grumpiness and playfulness only sum to 6.
And it’s a female cat.
Oh and the cat is the same age as your neighbor (in human years), who is probably around 85, maybe younger.
It sounds like we’re going to need some math for this puzzle. Let’s calculate some new columns with mutate()
to help find your neighbor’s cat.
mutate()
It’s often useful to edit existing columns in a data frame or add new columns that are functions of existing columns. That’s the job of mutate()
.
Before we go changing things in the data frame we’ll need to get to know the column names and the tables dimensions a bit better. These quick functions are all great ways to describe your data frame.
names(all_cats)
show the column names
nrow(all_cats)
number of rows
ncol(all_cats)
number of columns
summary(all_cats)
summary of all columns
glimpse(all_cats)
column names, plus a glimpse of first few values (requires loading dplyr package)
In our work we often use mutate
to calculate new units for measurments. In this case, let’s estimate the cat ages in human years. We’ll use the equation below to convert cat years to human years.
Human years = Cat years * 4 + 20
Use mutate()
to add a column called human_age
to the table my_cat.
all_cats <- mutate(all_cats,
human_age = age * 4 + 20)
You can also add a found_by
column. That way people will know where to send the reward money$$.
my_cat <- mutate(my_cat,
found_by = "Pet Detective Cooper")
You can also use mutate()
to update the value of a column that is already in your data frame.
my_cat <- mutate(my_cat,
found_by = "Nevermind, it's a secret")
If you use
mutate()
and provide a single value for a column such asfound_by = "Pet Detective"
, then every row in the column you created will have that value. If you provide a vector of values, such ashuman_age = cat_age * 4 + 20
, then the column you created will have a list of values that each correspond to the values in the vector you provided. If you provide a vector that has a different length than the number of rows in your data frame, you’ll get an error telling you that the number of values you provide must be equal to the number of rows in your data frame or be a single value.
Now let’s see if you can use mutate()
to help match the description of the cat’s personality.
Here’s a reminder of how your neighbor described his cat.
The cat’s clumsiness and its greediness scores sum to more than 12. You could barely make this out, but you think the cat’s grumpiness and playfulness only sum to 6. It’s a female cat.
And the cat is the same age as him (in human years), which is probably around 85, maybe less.
Use mutate()
to add columns to the missing cat list, and then filter()
the new columns to narrow the list down to your neighbor’s cat.
Here’s a snippet to get you started.
library("dplyr")
# Add columns to all_cats
all_cats <- mutate(all_cats,
clumsy_and_greedy = clumsy + greedy,
grumpy_and_playful = grumpy + ...,
human_age = age * 4 + ...
)
# Filter for cats matching neighbor's description
filter(all_cats,
clumsy_and_greedy > ...,
grumpy_and_playful == ...,
human_age > ...
)
You can chain these two functions together and do everything in one go. For that you can use the %>%
(pipe).
In a script the %>%
is read as “and then”. In the code below we are telling R to mutate
the data and then to filter
it.
nei_cat <- mutate(all_cats,
clumsy_and_greedy = clumsy + greedy,
grumpy_and_playful = grumpy + playful,
human_age = age * 4 + 20) %>%
filter(clumsy_and_greedy > 12,
grumpy_and_playful == 6,
human_age > 80)
What is the name of your friend’s cat?
Damon
Fluffy George
Stinkerbell
Precious Abe
Fluffy George
Breaking Meows! You rock!
Sitting down for your morning coffee you notice the front page of the paper has a story on cats!
According to Phyllis Cattleby there’s been a whole string of striped kittens gone missing. This sounds like a case for a pet detective.
Phyllis is pointing her finger at a circus that just came to town. A circus that seems to be drawing big crowds with their adorable and surprisingly tame “tiger babies”.
What evidence could we find in the cat database to support or refute the claim that the circus is stealing young striped cats?
For this level of sleuthing we’re going to need to summarize our data.
summarize()
thissummarize
allows you to apply a summary function like median()
to a column and collapse your data down to a single row. To really dig into summarize
you’ll want to know some common summary functions, such as sum()
, mean()
, median()
, min()
, and max()
.
sum()
Use summarize()
and sum()
to find the total of all greedy
scores.
summarize(all_cats, total_greedy = sum(greedy))
mean()
Use summarize()
and mean()
to calculate the mean level of grumpiness in all cats.
cat_summary <- summarize(all_cats,
mean_age = mean(age, na.rm = T))
Note the
na.rm = TRUE
in themean()
function. This tells R to ignore empty cells or missing values that show up in R asNA
. If you leavena.rm
out, the mean funciton will return ‘NA’ when it finds a missing value in the data.
median()
Use summarize to calculate the median level of grumpiness in all cats.
summarize(all_cats, median_grumpy = median(grumpy))
max()
Use summarize to calculate the maximum playful score for all cats.
all_cats %>% summarize(max_playful = max(playful))
min()
Use summarize to calculate the minimum playful score for all cats.
all_cats %>% summarize(min_playful = min(playful))
nth()
Use summarize and nth(name, 12)
to find the name of the 12th oldest cat in human years.
Hint: Use arrange()
first.
arrange(all_cats, desc(human_age)) %>% summarize(cat_name_2 = nth(name, 12))
sd()
What is the standard deviation of the grumpiness scores?
summarize(all_cats, stdev_grumpy = sd(grumpy))
quantile()
Quantiles are useful for finding the upper or lower range of a column. Use the quantile()
function to find the the 5th and 95th quantile of the cat ages.
summarize(all_cats,
age_5th_pctile = quantile(age, 0.05, na.rm = T),
age_95th_pctile = quantile(age, 0.95))
Hint: add na.rm = T
to quantile()
.
n()
n()
stands for count.
Use summarize and n()
to count the number of "brown"
cats.
Hint: Use filter()
first.
filter(all_cats, color == "brown") %>% summarize(cat_count = n())
Create a cat summary using 3 of the math functions above.
Now that we’re equipped with some powerful tools, let’s use summarize()
to answer a few questions about the striped cats.
Is the age of missing striped
cats lower than expected?
First find the median age for all the missing cats?
summarize(all_cats, median_age = median(age, na.rm = T))
Now, what is the median age for only striped cats?
filter(all_cats, color == "striped") %>% summarize(striped_med_age = median(age, na.rm = T))
But are striped cats the only color group that is younger than average?
Wouldn’t it be great if we could easily find the age for every color of cat?
group_by()
Enter group_by()
stage left. If you thought summarize
was awesome, wait until you include group_by
with your summarize
commands.
Try using group_by
with the column color and then use summarize
to count the number of cats in each group.
group_by(all_cats, color) %>% summarize(color_count = n()) %>% ungroup()
Ending with
ungroup()
is good practice. This will prevent your data from staying grouped after the summarizing has been completed.
Well that’s interesting, but not conclusive evidence.
What about the age of the missing striped cats? Are they younger on average than all the other groups?
Let’s use group_by
with the column color again, but this time use summarize
to find the mean(age)
for each cat color.
group_by(all_cats, color) %>%
summarize(mean_age = mean(age, na.rm = T)) %>% ungroup()
That’s a lot of digits!
round()
You can round the ages to a certain number of digits using the round()
function. We can finish by adding the arrange()
function to sort the table by our new column.
group_by(all_cats, color) %>%
summarize(mean_age = mean(age, na.rm = T),
mean_age_round = round(mean_age, digits = 1)) %>%
arrange(mean_age_round) %>% ungroup()
NOTE: The round()
function in R does not automatically round values ending in 5 up, instead it uses scientific rounding. It rounds values ending in 5 to the nearest even number, so 2.5 rounded to the nearest whole number using round()
is 2, and 3.5 rounded to the nearest whole number is 4. If you want to round all values ending in 5 up, then you’ll have to use a rounding function from another package.
Why are the striped cats so much younger? Are they being catnapped and sent to the circus? Let’s put this piece of evidence in our back pocket for now. We can return to it after we learn to make some charts. Maybe then we’ll be able to put together a convincing report to send to the police chief.
Let’s save the last summary table we created to a CSV. That way we can print it to have it faxed to the police later. To save a data frame we’ll use the write_csv()
function from our favorite readr package.
# First give the new data a name
ages_by_color <- group_by(all_cats, color) %>%
summarize(mean_age = mean(age, na.rm = T),
mean_age_round = round(mean_age, digits = 1)) %>%
arrange(mean_age_round) %>% ungroup()
# Write the file to your project folder
write_csv(ages_by_color, "mean_cat_ages_by_color.csv")
Warning! R will overwrite a file if the file already exists in a folder. It will not ask for confirmation. You will not collect $200.
mutate()
We can bring back mutate
to add a column based on the grouped values in a data set. For example, you may want to add a column showing the average age by country to the whole table.
When you combine group_by
and mutate
the new column will be calculated based on the values within each group.
group_by(all_cats, country) %>% mutate(country_mean_age = mean(age, na.rm = T)) %>% ungroup()
Estimate the mean grumpiness for each group of cats with the same greediness score.
Find the median grumpiness score for each country.
group_by(all_cats, country) %>%
mutate(grumpy_by_country = median(grumpy, na.rm = T)) %>%
ungroup()
Take 5 minutes to relax.
Install ggplot2 using install.packages("ggplot2")
.
library("readr")
library("dplyr")
library("ggplot2")
# Path to movie data
movie_url <- "https://raw.githubusercontent.com/MPCA-air/RCamp/master/data/movies/IMDB.csv"
# Read the IMDB movie data and save as `movies`
movies <- read_csv(movie_url)
color
to “movie_color”.# Show column names
#names(movies)
# Rename the 'actor_1_name' column
movies <- rename(movies, superstar = actor_1_name)
ggplot(movies)
Note when we load the package it’s
library ("ggplot2")
but when we use the function, it’sggplot(movies)
without the 2 following ggplot. It’s annoying, but that’s the way it is.
Aesthetics are the visual components from the data that you want to use in the chart. These also determine the dimensions of the plot.
ggplot(movies, aes(x = movie_facebook_likes, y = gross_mil))
ggplot(movies, aes(x = movie_facebook_likes, y = gross_mil)) +
geom_point()
When you add more layers using
+
, remember to place it at the end of each new line.
# This will work
ggplot() +
geom_point()
# BUT this will give you a nasty error message
ggplot()
+ geom_point()
Try making a scatterplot of any two columns.
Hint: Numeric variables will be more informative.
ggplot(movies, aes(x = ?column1, y = ?column2)) + geom_point()
Let’s select only recent movies using filter()
.
filter(movies, title_year >= 2010)
%>%
pipe.filter(movies, title_year >= 2010) %>%
ggplot(aes(x = movie_facebook_likes, y = gross_mil)) +
geom_point()
filter(movies, title_year >= 2010) %>%
ggplot(aes(x = movie_facebook_likes, y = gross_mil)) +
geom_point(alpha = 0.1)
We can keep adding layers! You can build your plot sandwich as big as you like.
Use geom_smooth()
to add a regression line.
filter(movies, title_year >= 2010) %>%
ggplot(aes(x = movie_facebook_likes, y = gross_mil)) +
geom_point(alpha = 0.25) +
geom_smooth(method = "lm")
Make a scatterplot of imdb_score
and gross_mil
with a fitted line showing the relationship.
Stop cheating! Just kidding here’s some code to help.
filter(movies, title_year >= 2010) %>%
ggplot(aes(x = imdb_score, y = gross_mil)) +
geom_point(alpha = 0.25) +
geom_smooth(method = "lm")
Now let’s make some histograms showing how the total number of movies change over time.
ggplot(movies, aes(x = title_year)) + geom_histogram()
To show the changes per decade we can break the years into groups of 10.
ggplot(movies, aes(x = title_year)) + geom_histogram(binwidth = 10)
movie_color
.You can assign different aesthetics to variables in the data set. The example below sets the fill
color to variable movie_color
. This will color code each color type, one color for black and white movies and one for color movies.
ggplot(movies, aes(x = title_year, fill = movie_color)) +
geom_histogram(binwidth = 10)
It’s difficult to see what’s going on with the black and white films. Let’s split the colors apart using position_dodge()
.
ggplot(movies, aes(x = title_year, fill = movie_color)) +
geom_histogram(binwidth = 10, position = "dodge")
Maybe it would work better to use two separate charts. For that we can use facet_wrap()
.
ggplot(movies, aes(x = title_year, fill = movie_color)) +
geom_histogram(binwidth = 10) +
facet_wrap(~ movie_color)
That is almost good. It’s still hard to see the changes in black and white films. Let’s make the y-axis independent for each group using scales = "free_y"
. Type ?facet_wrap
to see more options.
ggplot(movies, aes(title_year, fill = movie_color)) +
geom_histogram(binwidth = 10) +
facet_wrap(~ movie_color, scales = "free_y")
Note: When the scales are not uniform, make sure to point this out to your readers. Otherwise people might assume the scales are the same. Then they would think there are just as many black and white films as color movies.
Make a histogram of the number of movies by decade with separate fill
colors for each content_rating
.
Decide which is the best way to present the bars: stacked, side-by-side, or on separate charts.
# Stacked
ggplot(movies, aes(title_year, fill = content_rating)) +
geom_histogram(binwidth = 10)
# Side by side
ggplot(movies, aes(title_year, fill = content_rating)) +
geom_histogram(binwidth = 10, position = position_dodge())
# Separate charts
ggplot(movies, aes(title_year, fill = content_rating)) +
geom_histogram(binwidth = 10) +
facet_wrap( ~ content_rating)
# Free scale it
ggplot(movies, aes(title_year, fill = content_rating)) +
geom_histogram(binwidth = 10) +
facet_wrap( ~ content_rating, scales = "free_y")
runif(1)
into your console.
0.2
, email your charts to the rest of the class.