Day 18: brio
String encoding in R (or any programming language for that matter) is no simple matter. Kevin Ushey wrote an excellent blog post, String Encoding and R, that:
is an attempt to explore, and answer, the surprisingly difficult question:
How do I write UTF-8 encoded content to a file? (Ushey 2018)
The aim of brio (Hester and Csárdi 2022) is to make that practice easier. brio (an initialism for Basic R Input Output) provides functions that always read and write UTF-8 files1, and provide more explicit control over line endings.
1 See Kevin’s blog post to understand why this is a good idea.
In addition to providing consistency and control over encoding and line endings, brio’s primary functions, read_lines() and write_lines(), happen to be faster than their base R and readr equivalents (see the benchmarks section) of the README for data), which is a nice added bonus.
Learn more
The brio documentation provides details on how to use its full suite of functions, including drop-in replacements for base readLines()
and writeLines()
.
To learn more on encoding and character sets, see the two articles Kevin recommends at the end of his post on string encoding and R:
- The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!) by Joel Spolsky (Spolsky 2003); and
- What Every Programmer Absolutely, Positively Needs To Know About Encodings And Character Sets To Work With Text by David C. Zentgraf (Zentgraf 2015).