library(tidyverse)
library(googlesheets4)
How to use purrr::walk()
to make many files
How to use purrr::walk()
to write many files, featuring file-system navigation with the fs package.
Meet the map()
family
purrr’s map()
family of functions are tools for iteration, performing the same action on multiple inputs. If you’re new to purrr, the Iteration chapter of R for Data Science is a good place to get started.
One of the benefits of using map()
is that the function has variants (e.g. map2()
, pmap()
, etc.) all of which work the same way. To borrow from Jennifer Thompson’s excellent Intro to purrr,the arguments can be broken into two groups: what we’re iterating over, and what we’re doing each time. The adapted figure below shows what this looks like for map()
, map2()
, and pmap()
.
In addition to handling different input arguments, the map family of functions has variants that create different outputs. The following table from the Map-variants section of Advanced R shows how the orthogonal inputs and outputs can be used to organise the variants into a matrix:
List | Atomic | Same type | Nothing | |
---|---|---|---|---|
One argument |
|
|||
Two arguments |
|
|||
One argument + index |
|
|||
N arguments |
|
— |
What’s up with walk()
?
Based on the table above, you might think that walk()
isn’t very useful. Indeed, walk()
, walk2()
, and pwalk()
all invisibly return .x
. However, they come in handy when you want to call a function for its side effects rather than its return value.
Here, we’ll go through two common use cases: saving multiple CSVs, and multiple plots. We’ll also make use of the fs package, a cross-platform interface to file system operations, to inspect our outputs.
If you want to try this out but don’t want to save files locally, there’s a companion project on Posit Cloud where you can follow along.
Writing (and deleting) multiple CSVs
To get started, we’ll need some data. Let’s use the gapminder example Sheet built into googlesheets4. Because there are multiple worksheets (one for each continent), we’ll use map()
to apply read_sheet()
1 to each one, and get back a list of data frames.
<- gs4_example("gapminder") # get sheet id
ss <- sheet_names(ss) # get the names of individual sheets
sheets <- map(sheets, .f = \(x) read_sheet(ss, sheet = x)) gap_dfs
✔ Reading from "gapminder".
✔ Range ''Africa''.
✔ Reading from "gapminder".
✔ Range ''Americas''.
✖ Request failed [429]. Retry 1 happens in 6.1 seconds ...
✖ Request failed [429]. Retry 2 happens in 3.3 seconds ...
✖ Request failed [429]. Retry 3 happens in 4.7 seconds ...
✔ Reading from "gapminder".
✔ Range ''Asia''.
✔ Reading from "gapminder".
✔ Range ''Europe''.
✔ Reading from "gapminder".
✔ Range ''Oceania''.
The backslash syntax for anonymous functions (e.g. \(x) x + 1
) was introduced in base R version 4.1.0 as a shorthand for function(x) x + 1
.
If you’re using an earlier version of R, you can use purrr’s shorthand: a formula (e.g. ~ .x + 1
).
Typically, you’d want to combine these data frames into one to make it easier to work with your data. To do so, we’ll use list_rbind()
on gap_dfs
. I’ve kept the intermediary object, since we’ll use it in a moment with walk()
, but could have just as easily piped the output directly. The combination of purrr::map()
and list_rbind()
is a handy one that you can learn more about in the R for Data Science.
<- gap_dfs |>
gap_combined list_rbind()
Now let’s say that, for whatever reason, you’d like to save the data from these sheets as individual CSVs. This is where walk()
comes into play—writing out the file with write_csv()
is a “side effect.” We’ll use fs::dir_create()
to create a data folder to put our files into2, and build a vector of paths/file names. Since we have two arguments, the list of data frames, and the paths, we’ll use walk2()
.
::dir_create("data")
fs<- str_glue("data/gapminder_{tolower(sheets)}.csv")
paths walk2(
gap_dfs,
paths,write_csv(df, name)
\(df, name) )
To see what we’ve done, we can use fs::dir_tree()
to see the contents of the directory as a tree, or fs::dir_ls()
to return the paths as a vector. These functions also take glob
and regexp
arguments, allowing you to filter paths by file type with globbing patterns (e.g. *.csv
) or using a regular expression passed on to grep()
.
::dir_tree("data") fs
data
├── gapminder_africa.csv
├── gapminder_americas.csv
├── gapminder_asia.csv
├── gapminder_europe.csv
└── gapminder_oceania.csv
::dir_ls("data") fs
data/gapminder_africa.csv data/gapminder_americas.csv
data/gapminder_asia.csv data/gapminder_europe.csv
data/gapminder_oceania.csv
If you’re having regrets, or want to return your example project to its previous state, it’s just as easy to walk()
fs::file_delete()
along those same paths.3
walk(paths, \(paths) fs::file_delete(paths))
Saving multiple plots
Now, let’s say you want to create and save a bunch of plots. We’ll use a modified version of the conditional_bars()
4 function from the R for Data Science chapter on writing functions, and the built-in diamonds dataset.
# modified conditional bars function from R4DS
<- function(df, condition, var) {
conditional_bars |>
df filter({{ condition }}) |>
ggplot(aes(x = {{ var }})) +
geom_bar() +
ggtitle(rlang::englue("Count of diamonds by {{var}} where {{condition}}"))
}
It’s easy enough to run this for one condition, for example for the diamonds with cut == "Good"
.
|> conditional_bars(cut == "Good", clarity) diamonds
But what if we want to make and save a plot for each cut? Again, it’s map()
and walk()
to the rescue.
Because we’re using the same data (diamonds
) and conditioning on the same variable (cut
), we’ll only need to map()
across the levels of cut
, and can hard code the rest into the anonymous function.
# get the levels
<- levels(diamonds$cut)
cuts
# make the plots
<- map(
plots
cuts,conditional_bars(
\(x) df = diamonds,
== {{ x }},
cut
clarity
) )
The plots are now saved in a list—a fine format for storing ggplots. As we did when saving our CSVs, we’ll use fs to create a directory to store them in, and make a vector of paths for file names.
# make the folder to put them in (if exists, {fs} does nothing)
::dir_create("plots")
fs# make the file names
<- str_glue("plots/{tolower(cuts)}_clarity.png") plot_paths
Now we can use the paths and plots with walk2()
to pass them as arguments to ggsave()
.
walk2(
plot_paths,
plots,ggsave(path, plot, width = 6, height = 6)
\(path, plot) )
Again, we can use fs to see what we’ve done:
::dir_tree("plots") fs
plots
├── fair_clarity.png
├── good_clarity.png
├── ideal_clarity.png
├── premium_clarity.png
└── very good_clarity.png
And, clean up after ourselves if we didn’t really want those plots after all.
walk(plot_paths, \(paths) fs::file_delete(paths))
Fin
Hopefully this gave you a taste for some of what walk()
can do. To learn more, see Saving multiple outputs in the Iteration chapter of R for Data Science.