Reading and Writing Data with {arrow}
The Physics of Finance 2024-01-19
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
This is part of a series of related posts on Apache Arrow. Other postsin the series are:
- Understanding the Parquet fileformat
- Reading and Writing Data with {arrow} (This post)
- Parquet vs the RDS Format (Coming soon)
What is (Apache) Arrow?
Apache Arrow is a cross-languagedevelopment platform for in-memory data. As it’s in-memory (as opposedto data stored on disk), it provides additional speed boosts. It’sdesigned for efficient analytic operations, and uses a standardisedlanguage-independent columnar memory format for flat and hierarchicaldata. The {arrow} R package provides an interface to the ‘Arrow C++’library – an efficient package foranalytic operations on modern hardware.
There are many great tutorials on using{arrow} (see the links at the bottomof the post for example). The purpose of this blog post isn’t to simplyreproduce a few examples, but to understand some of what’s happeningbehind the scenes. In this particular post, we’re interested inunderstanding the reading/writing aspects of {arrow}.
Getting started
The R package is installed from CRAN in the usual way
install.packages("arrow")
Then loaded using
library("arrow")
This blog post uses the NYC Taxidata. It’s prettybig – around ~40GB in total. To download it locally,
data_nyc = "data/nyc-taxi"open_dataset("s3://voltrondata-labs-datasets/nyc-taxi") |> dplyr::filter(year %in% 2012:2021) |> write_dataset(data_nyc, partitioning = c("year", "month"))
Once this has completed, you can check everything has downloadedcorrectly by running
nrow(open_dataset(data_nyc))## [1] 1150352666
Loading in data
Unsurprisingly, the first command we come across is open_dataset()
.This opens the data and (sort of) reads it in.
library("arrow")open_dataset(data_nyc)## FileSystemDataset with 120 Parquet files## vendor_name: string## pickup_datetime: timestamp[ms]## dropoff_datetime: timestamp[ms]## passenger_count: int64## trip_distance: double## ...
Reading is a lazy action. This allows us to manipulate much larger datasets than R could typically deal with. The default print method liststhe columns in the data set, with their associated type. These datatypes come directly from the C++ API so don’t always have acorresponding R type. For example, the year
column is an int32
(a 32bit integer), whereas passenger_count
is int64
(a 64 bit integer).In R, these are both integers.
As you might guess, there’s a corresponding function write_dataset()
.Looking at the (rather good) documentation, we come across a fewconcepts that are worth exploring further.
Whether you want to start from scratch, or improve your skills, Jumping Rivers has a training course for you.
File formats
The main file formats associated are
-
parquet
: a format designed to minimise storage – see our recentblogpostthat delves into some of the details surrounding the format; -
arrow
/feather
: in-memory format created to optimise vectorisedcomputations; -
csv
: the world runs on csv files (and Excel).
The common workflow is storing your data as parquet files. The Arrowlibrary then loads the data and processes the data in the arrow format.
Storing data in the Arrow format
The obvious thought (to me at least) was, why not store the data asarrow? Ignoring for the moment that Arrow doesn’tpromise long-term archival storage usingthe arrow format, we can do a few tests.
Using the NYC-taxi data, we can create a quick subset
# Replace format = "arrow" with format = "parquet" # to create the correspond# parquet equivalentopen_dataset(file.path(data_path, "year=2019")) |> write_dataset("data/nyc-taxi-arrow", partitioning = "month", format = "arrow")
A very quick, but not particularly thorough test suggests that
- the arrow format requires ten times more storage space. So for theentire
nyc-taxi
data set, parquet takes around ~38GB, but arrowwould take around 380GB. - storing as arrow makes some operations quicker. For the few examples Itried, there was around a 10% increase in speed.
The large storage penalty was enough to convince me of the merits ofstoring data as parquet, but there may be some niche situations whereyou might switch.
Hive partitioning
Both open_dataset()
and write_dataset()
functions mention “Hivepartitioning” – in fact we sneakily included a partioning
argument inthe code above. For the open_dataset()
function, it guesses if we useHive partitioning, whereas for the write_dataset()
function we canspecify the partition. But what actually is it?
Hive partitioning is a method used to split a table into multiple filesbased on partition keys. A partition key is a variable of interest inyour data, for example, year or month. The files are then organised infolders. Within each folder, the key has a value is determined bythe name of the folder. By partitioning the data in this way, we canmake it faster to do queries on data slices.
Suppose we wanted to partition the data by year, then the file structurewould be
taxi-data year=2018 file1.parquet file2.parquet year=2019 file4.parquet file5.parquet
Of course, we can partition by more than one variable, such as both yearand month
taxi-data year=2018 month=01 file01.parquet month=02 file02.parquet file03.parquet ... year=2019 month=01 ...
See the excellent vignette ondatasets in the{arrow} package.
Example: Partitioning
Parquet files aren’t the only files we can partition. We can also usethe same concept with CSV files. For example,
tmp_dir = tempdir()write_dataset(palmerpenguins::penguins, path = tmp_dir, partitioning = "species", format = "csv")
This looks like
list.files(tmp_dir, recursive = TRUE, pattern = "\\.csv$")## [1] "species=Adelie/part-0.csv" "species=Chinstrap/part-0.csv"## [3] "species=Gentoo/part-0.csv"
You can also partition using the group()
function from {dplyr}
palmerpenguins::penguins |> dplyr::group_by(species) |> write_dataset(path = tmp_dir, format = "csv")
In my opinion, while it makes conceptual sense to partition CSV files,in practice it’s probably not worthwhile. Any CSV files that youpartition to get speed benefits, you might as well use parquet.
Single files vs dataset APIs
When reading in data using Arrow, we can either use the single filefunction (these start with read_
) or use the dataset API (these startwith open_
).
For example, using read_csv_arrow()
reads the CSV file directly intomemory. If the file is particularly large, then we’ll run out of memory.One thing to note, is the as_data_frame
argument. By default this isset to TRUE
, meaning that read_csv_arrow()
will return a tibble
.The upside of this is that we have a familiar object. The downside isthat it takes up more room than Arrow’s internal data representation (anArrowTable)
This blog post by FrançoisMichonneau goesinto far more detail, and discusses the R and Python implementations ofthe different APIs.
Acknowledgements
This blog was motivated by the excellent Arrowtutorial at Posit Conf 2023,run by Steph Hazlitt and Nic Crane. The NYC dataset came from thattutorial, and a number of the ideas that I explored were discussed withthe tutorial leaders. I also used a number of resources found on variouscorners of the web. I’ve tried to provide links, but if I’ve missed any,let me know.
For updates and revisions to this article, see the original post
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.