In general, left_join() is used as a tool to add columns to your table. Imagine you have the 2 cat tables below. One is a table of cat names with their grumpiness level, and the second is a table of cat names with their ages. Since both tables share a column in common —cat names— we can join the tables together using the name columns as the joining key.
cat_grumpy
cat_name | grumpy |
---|---|
Fluffy George | 1 |
Zorro | 3 |
SantaClaws | 7 |
cat_ages
name | age |
---|---|
Fluffy George | 2 |
Zorro | 11 |
Dinobytes | 14 |
Uh oh! There’s no “SantaClaws” in the age table and there’s no “Dinobytes” in the grumpy table. Can we still join the tables? Run the code below to see what happens.
library(dplyr)
# Create new tables
cat_grumpy <- data_frame(cat_name = c("Fluffy George", "Zorro", "SantaClaws"),
grumpy = c(1, 3, 7))
cat_ages <- data_frame(name = c("Fluffy George", "Zorro", "Dinobytes"),
age = c(2, 11, 14))
# Join the tables together by cat name
## Tell left_join which columns to use as the key with: by = c("key_left" = "key_right")
grumpy_and_age <- left_join(cat_grumpy, cat_ages, by = c("cat_name" = "name" ))
grumpy_and_age
cat_name | grumpy | age |
---|---|---|
Fluffy George | 1 | 2 |
Zorro | 3 | 11 |
SantaClaws | 7 | NA |
Did it work? Yes. When left_join added the age column to the cat_grumpy table it only entered a value for the age when the tables had a matching cat name. So when R couldn’t find “SantaClaws” in the age table, the cat’s age was recorded as NA.
Now let’s imagine that the cat_ages
table has two cats named Zorro with different ages. What will happen when you join the tables?
name | age |
---|---|
Fluffy George | 2 |
Zorro | 11 |
Zorro | 7 |
Dinobytes | 14 |
When you run the code below you’ll see that left_join is very thorough and adds each age it finds for Zorro as a new row in the joined table.
# Create new tables
cat_grumpy <- data_frame(cat_name = c("Fluffy George", "Zorro", "SantaClaws"),
grumpy = c(1, 3, 7))
cat_ages <- data_frame(name = c("Fluffy George", "Zorro", "Zorro", "Dinobytes"),
age = c(2, 11, 7, 14))
# Join the tables together by cat name
grumpy_and_age <- left_join(cat_grumpy, cat_ages, by = c("cat_name" = "name"))
# Check number of rows
nrow(grumpy_and_age)
grumpy_and_age
cat_name | grumpy | age |
---|---|---|
Fluffy George | 1 | 2 |
Zorro | 3 | 11 |
Zorro | 3 | 7 |
SantaClaws | 7 | NA |
This results in a table with one extra row than we had before. So growing table sizes are a good sign of duplicate values when you’re using left_join. In practice, when you see this you’ll want to investigate why one of your tables has duplicate entries, especially if the observation for the two rows is different (like Zorro’s age). In this situation, we’d want to find out whether we really had two cats named Zorro, or if someone made two different guesses about Zorro’s age.
Let’s say you conclude that both values are indeed valid estimates (maybe Zorro’s age is particularly difficult to estimate). One solution is to calculate a summary table that produces a single estimate for the age of each cat. This can be accomplished with our favorite duo group_by()
& summarize()
.
# Group the cats by name and take the mean of all age estimates
cat_ages <- group_by(cat_ages, name) %>% summarize(age = mean(age, na.rm = T))
cat_ages
# Now Zorro's average age is 9 years old
# Join the tables together by cat name
grumpy_and_age <- left_join(cat_grumpy, cat_ages, by = c("cat_name" = "name"))
grumpy_and_age
cat_ages
name | age |
---|---|
Dinobytes | 14 |
Fluffy George | 2 |
Zorro | 9 |
grumpy_and_age
cat_name | grumpy | age |
---|---|---|
Fluffy George | 1 | 2 |
Zorro | 3 | 9 |
SantaClaws | 7 | NA |
Voila! The average age was added for Zorro. Unfortunately, we still have no idea how old SantaClaws is.
There is an alternative solution when the two rows have the same value, or if you don’t mind simply dropping one of the rows randomly. In that case you can use the function duplicated()
to filter out any names that have already occurred in the table.
You’ll see that the filter works its way from top to bottom, meaning that the first occurrence of each cat name gets the privilege to remain in the table, while any identical names appearing below that row will be dropped.
# Create new tables
cat_grumpy <- data_frame(cat_name = c("Fluffy George", "Zorro", "SantaClaws"),
grumpy = c(1, 3, 7))
cat_ages <- data_frame(name = c("Fluffy George", "Zorro", "Zorro", "Dinobytes"),
age = c(2, 11, 7, 14))
# Find the cat names that are a duplicate of a name in a higher row
duplicated_names <- filter(cat_ages, duplicated(name))
# Keep only the rows with names that are not(!) duplicates
cat_ages <- filter(cat_ages, !duplicated(name))
cat_ages
cat_ages
name | age |
---|---|
Fluffy George | 2 |
Zorro | 11 |
Dinobytes | 14 |
There’s no longer any evidence of the 7 year old Zorro. As they say, sometimes in R it pays to be first.