Day 01: separate_longer_*()

Learn about tidyr’s separate_longer_*() functions which separate strings into multiple rows.

Published

December 1, 2022

Depending on how you get your data, the “tidying” process can take different forms. Much of the focus in the tidyr (Wickham and Girlich 2022) documentation (e.g. in the Rectangling vignette) focuses on dealing with nested data1.

Today you’ll see how the new separate_longer_*() family of functions2 (formerly separate_rows()) can be used to unpack data that’s not nested, but, rather, has multiple strings to achieve what unnest_longer() does for nested data.

We’re going to look at the same data in two different formats: one where strings have been nested in a list column, and the other where there are multiple strings in a character column.

Here’s a quick visual summary, below, of what we’ll do in the code that follows.

Figure 1: A nested data frame with a list column is transformed to tidy output using unnest_longer(), a data frame with the same data as a comma-separated character column is transformed to tidy output using separate_longer_delim().

A nested data frame (left), with a song character column, and a an artist list column is transformed into a data frame with one artist and song per row by using the unnest_longer() function. A data frame (right) with two character columns, song and artist, that contains multiple artists in the artist column as a string separated by commas is transformed into a tidy data frame with one song per row and one artist per row using separate_longer_delim().

The nested version

First, let’s create a nested version.

library(tidyverse)
nested_df <- tibble(
  song = c("Rake it Up", "Bang Bang", "Side to Side"),
  artist = list(c("Yo Gotti", "Nicki Minaj"), c("Jessie J", "Ariana Grande", "Nicki Minaj"), c("Ariana Grande", "Nicki Minaj"))
)

nested_df
# A tibble: 3 × 2
  song         artist   
  <chr>        <list>   
1 Rake it Up   <chr [2]>
2 Bang Bang    <chr [3]>
3 Side to Side <chr [2]>

As you can see, we have three songs, and an artist column which contains the lists of artists on each song.

To tidy this up, we’ll use unnest_longer().

nested_df %>%
  unnest_longer(artist)
# A tibble: 7 × 2
  song         artist       
  <chr>        <chr>        
1 Rake it Up   Yo Gotti     
2 Rake it Up   Nicki Minaj  
3 Bang Bang    Jessie J     
4 Bang Bang    Ariana Grande
5 Bang Bang    Nicki Minaj  
6 Side to Side Ariana Grande
7 Side to Side Nicki Minaj  

The comma-separated-string version

Now let’s look at the same data in a different shape of sorts—one in which the artists are stored as a single character string separated by commas.

stringy_df <- tibble(
  song = c("Rake it Up", "Bang Bang", "Side to Side"),
  artist = c("Yo Gotti, Nicki Minaj", "Jessie J, Ariana Grande, Nicki Minaj", "Ariana Grande, Nicki Minaj")
)

stringy_df
# A tibble: 3 × 2
  song         artist                              
  <chr>        <chr>                               
1 Rake it Up   Yo Gotti, Nicki Minaj               
2 Bang Bang    Jessie J, Ariana Grande, Nicki Minaj
3 Side to Side Ariana Grande, Nicki Minaj          

We can use separate_longer_delim() along with the delim argument to get the same result for this data frame that we did for the nested one.

stringy_df %>%
  separate_longer_delim(artist, delim = ", ")
# A tibble: 7 × 2
  song         artist       
  <chr>        <chr>        
1 Rake it Up   Yo Gotti     
2 Rake it Up   Nicki Minaj  
3 Bang Bang    Jessie J     
4 Bang Bang    Ariana Grande
5 Bang Bang    Nicki Minaj  
6 Side to Side Ariana Grande
7 Side to Side Nicki Minaj  

To separate by a fixed width rather than a delimiter, you can use separate_longer_position().

Different inputs, same output

The diagram below illustrates how the two different data frames each produced the same result by using the relevant functions.

Figure 2: Nested data frame transformed to tidy output using unnest_longer().

The nested data frame, with a song column, and a list column with artists is transformed into a data frame with one artist and song per row by using the unnest_longer() function.

Figure 3: Data frame with comma-delimited strings in artist column transformed to tidy output using separate_longer_delim().

A data frame with two character columns, song and artist, that contains multiple artists in the artist column as a string separated by commas is transformed into a tidy data frame with one song per row and one artist per row using separate_longer_delim().

References

Wickham, Hadley, and Maximilian Girlich. 2022. tidyr: Tidy Messy Data. https://tidyr.tidyverse.org.