library(tidyverse)
Day 01: separate_longer_*()
Learn about tidyr’s separate_longer_*()
functions which separate strings into multiple rows.
Depending on how you get your data, the “tidying” process can take different forms. Much of the focus in the tidyr (Wickham and Girlich 2022) documentation (e.g. in the Rectangling vignette) focuses on dealing with nested data1.
Today you’ll see how the new separate_longer_*()
family of functions2 (formerly separate_rows()
) can be used to unpack data that’s not nested, but, rather, has multiple strings to achieve what unnest_longer()
does for nested data.
We’re going to look at the same data in two different formats: one where strings have been nested in a list column, and the other where there are multiple strings in a character column.
Here’s a quick visual summary, below, of what we’ll do in the code that follows.
The nested version
First, let’s create a nested version.
<- tibble(
nested_df song = c("Rake it Up", "Bang Bang", "Side to Side"),
artist = list(c("Yo Gotti", "Nicki Minaj"), c("Jessie J", "Ariana Grande", "Nicki Minaj"), c("Ariana Grande", "Nicki Minaj"))
)
nested_df
# A tibble: 3 × 2
song artist
<chr> <list>
1 Rake it Up <chr [2]>
2 Bang Bang <chr [3]>
3 Side to Side <chr [2]>
As you can see, we have three songs, and an artist
column which contains the lists of artists on each song.
To tidy this up, we’ll use unnest_longer()
.
%>%
nested_df unnest_longer(artist)
# A tibble: 7 × 2
song artist
<chr> <chr>
1 Rake it Up Yo Gotti
2 Rake it Up Nicki Minaj
3 Bang Bang Jessie J
4 Bang Bang Ariana Grande
5 Bang Bang Nicki Minaj
6 Side to Side Ariana Grande
7 Side to Side Nicki Minaj
The comma-separated-string version
Now let’s look at the same data in a different shape of sorts—one in which the artists are stored as a single character string separated by commas.
<- tibble(
stringy_df song = c("Rake it Up", "Bang Bang", "Side to Side"),
artist = c("Yo Gotti, Nicki Minaj", "Jessie J, Ariana Grande, Nicki Minaj", "Ariana Grande, Nicki Minaj")
)
stringy_df
# A tibble: 3 × 2
song artist
<chr> <chr>
1 Rake it Up Yo Gotti, Nicki Minaj
2 Bang Bang Jessie J, Ariana Grande, Nicki Minaj
3 Side to Side Ariana Grande, Nicki Minaj
We can use separate_longer_delim()
along with the delim
argument to get the same result for this data frame that we did for the nested one.
%>%
stringy_df separate_longer_delim(artist, delim = ", ")
# A tibble: 7 × 2
song artist
<chr> <chr>
1 Rake it Up Yo Gotti
2 Rake it Up Nicki Minaj
3 Bang Bang Jessie J
4 Bang Bang Ariana Grande
5 Bang Bang Nicki Minaj
6 Side to Side Ariana Grande
7 Side to Side Nicki Minaj
To separate by a fixed width rather than a delimiter, you can use separate_longer_position()
.
Different inputs, same output
The diagram below illustrates how the two different data frames each produced the same result by using the relevant functions.