analogsea: Using Arrow, S3 and DigitalOcean for efficient model fitting in RStudio

The Physics of Finance 2021-07-18

[This article was first published on Pachá, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Introduction

This tutorial explains how to use arrow withanalogsea to take fully advantagefrom S3 filesystems and parallel computing.

Analogsea is a community project created by and for statisticians from the Rcommunity: Scott Chamberlain, Hadley Wickham, Winston Chang, Bob Rudis, BryceMecum, and yours truly. This package provides an interface toDigitalOcean,a cloud service such as AWS, Linode and others, for R users.

Even when DigitalOcean provides an API, computing across clusters composed byvirtual servers is not straightforward, since vanilla Ubuntu servers(‘droplets’ in DO jargon) require additional configurations to install Ritself and numerical libraries such as OpenBLAS.

The configuration problem is largely reduced with the RStudio image fromDigitalOcean Marketplace. In the last two years, the RStudio image for theMarketplace has provided a multipurpose image with out-of-box Shiny Server andcommon R packages such as the Tidvyerse and common SQL connectors, and now aconfiguration to use Arrow with S3 filesystems.

Just as an example, theCatholic University of Chile, the best university ofChile and one of the few in Latin America with excellent research metrics, usesArrow and the mentioned RStudio image to offer aCovid Dashboard for Chile.

This tutorial is somewhat based in ablog postby Dr. Andrew Heiss, while most of the steps are simplified here thanks tonewer software developments.

NYC Taxi dataset

You can have a look at the NYC Taxi dataset, which is stored in DigitalOceanSpaces with S3 compatible features. You can copy and explore the data locallylike this.

library(arrow)space <- S3FileSystem$create(  anonymous = TRUE,  scheme = "https",  endpoint_override = "sfo3.digitaloceanspaces.com")# just 2009/01 for a quick explorationtry(dir.create("~/nyc-taxi/2009/01", recursive = T))arrow::copy_files(space$path("nyc-taxi/2009/01"), "~/nyc-taxi/2009/01")d <- open_dataset(  "~/nyc-taxi",  partitioning = c("year", "month"))d## FileSystemDataset with 1 Parquet file## vendor_id: string## pickup_at: timestamp[us]## dropoff_at: timestamp[us]## passenger_count: int8## trip_distance: float## pickup_longitude: float## pickup_latitude: float## rate_code_id: null## store_and_fwd_flag: string## dropoff_longitude: float## dropoff_latitude: float## payment_type: string## fare_amount: float## extra: float## mta_tax: float## tip_amount: float## tolls_amount: float## total_amount: float## year: int32## month: int32## ## See $metadata for additional Schema metadata

One interesting exercise is to explore how trip distance and total amount arerelated. You can fit a regression to do that! But you should explore the datafirst, and a good way to do so is to visualize how the observations aredistributed.

library(dplyr)dreg <- d %>%   select(trip_distance, total_amount) %>%   collect()dreg## # A tibble: 14,092,413 x 2##    trip_distance total_amount##            <dbl>        <dbl>##  1         2.63          9.40##  2         4.55         14.6 ##  3        10.4          28.4 ##  4         5            18.5 ##  5         0.400         3.70##  6         1.20          6.60##  7         0.400         6.70##  8         1.72          6.60##  9         1.60         10   ## 10         0.700         5.90## # … with 14,092,403 more rows

As this is a ~14 million rows for a single month, it will be hard to plotwithout at least some aggregation. There are repeated rows so you can count toget the frequencies for the pairs (distance, amount), and even round thenumbers before doing aggregation since the idea is to explore if there’s atrend.

dplot <- dreg %>%   mutate_if(is.numeric, function(x) round(x,0)) %>%   group_by(trip_distance, total_amount) %>%   count()dplot## # A tibble: 4,070 x 3## # Groups:   trip_distance, total_amount [4,070]##    trip_distance total_amount      n##            <dbl>        <dbl>  <int>##  1             0            2  39663##  2             0            3 181606##  3             0            4 421546##  4             0            5 165490##  5             0            6  63610##  6             0            7  26822##  7             0            8  20584##  8             0            9  13889##  9             0           10  12073## 10             0           11   8271## # … with 4,060 more rows

There is not an obvious trend in this data, but it’s still possible to obtain atrend line.

library(ggplot2)ggplot(dplot, aes(x = trip_distance, y = total_amount, alpha = log(n))) +  geom_jitter() +  labs(title = "Exploring distance versus total amount")

You can fit a model with Gaussian errors such as the next model with intercept.

summary(lm(total_amount ~ trip_distance, data = dreg))## ## Call:## lm(formula = total_amount ~ trip_distance, data = dreg)## ## Residuals:##      Min       1Q   Median       3Q      Max ## -122.736   -1.242   -0.542    0.509  228.125 ## ## Coefficients:##                Estimate Std. Error t value Pr(>|t|)    ## (Intercept)   4.0253576  0.0013786    2920   <2e-16 ***## trip_distance 2.4388382  0.0003532    6905   <2e-16 ***## ---## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1## ## Residual standard error: 3.909 on 14092411 degrees of freedom## Multiple R-squared:  0.7718, Adjusted R-squared:  0.7718 ## F-statistic: 4.767e+07 on 1 and 14092411 DF,  p-value: < 2.2e-16

Fitting models in the cloud

Let’s say we are interested in fitting one model per month and then average theobtain coefficients to smooth a potential seasonal effect (e.g. such asholidays, Christmas, etc).

To do this you can create 1,2,…,N droplets and assign different months to eachdroplet and read the data from S3, to then get the regression results to yourlaptop.

You can start by loading the analogsea package and create two droplets thatalready include arrow. The next code assumes that you already registered yourSSH key at digitalocean.com.

library(analogsea)s <- "c-8" # 4 dedicated CPUs + 8GB memory, run sizes()droplet1 <- droplet_create("RDemo1", region = "sfo3", size = s,                            image = "rstudio-20-04", wait = F)droplet2 <- droplet_create("RDemo2", region = "sfo3", size = s,                            image = "rstudio-20-04", wait = T)

What if you need to install additional R packages in the droplets? Well, youcan control that remotely after updating the droplet status and install, forexample, eflm to speed-up the regressions.

droplet1 <- droplet(droplet1$id)droplet2 <- droplet(droplet2$id)install_r_package(droplet1, "eflm")install_r_package(droplet2, "eflm")

Before proceeding, you need the droplet’s IPs to create a cluster that will use each droplet as a CPU core from your laptop.

ip1 <- droplet(droplet1$id)$networks$v4[[2]]$ip_addressip2 <- droplet(droplet2$id)$networks$v4[[2]]$ip_addressips <- c(ip1, ip2)

Now you can create a cluster, and you need to specify which SSH key to use.

library(future)ssh_private_key_file <- "~/.ssh/id_rsa"cl <- makeClusterPSOCK(  ips,  user = "root",  rshopts = c(    "-o", "StrictHostKeyChecking=no",    "-o", "IdentitiesOnly=yes",    "-i", ssh_private_key_file  ),  dryrun = FALSE)

The next step is to create a working plan (i.e. 1 droplet = 1 ‘CPU core’) whichallows to run processes in parallel and a function to specify which data toread and how to fit the models in the droplets.

plan(cluster, workers = cl)fit_model <- function(y, m) {  message(paste(y,m))  suppressMessages(library(arrow))  suppressMessages(library(dplyr))  space <- S3FileSystem$create(    anonymous = TRUE,    scheme = "https",    endpoint_override = "sfo3.digitaloceanspaces.com"  )  d <- open_dataset(    space$path("nyc-taxi"),    partitioning = c("year", "month")  )  d <- d %>%    filter(year == y, month == m) %>%    select(total_amount, trip_distance) %>%    collect()  fit <- try(eflm::elm(total_amount ~ trip_distance, data = d, model = F))  if (class(fit) == "lm") fit <- fit$coefficients  rm(d); gc()  return(fit)}

The fit_model function is a function that you should iterate over months andyears. The furrr package is very efficient for these tasks and works well withthe working plan defined previously, and the iteration can be run in a verysimilar way to purrr.

library(furrr)fitted_models <- future_map2(  c(rep(2009, 12),rep(2010, 12)), rep(1:12,2), ~fit_model(.x, .y))  fitted_models[[1]]# (Intercept) trip_distance #    4.025358      2.438838

In the previous code, even when the iteration was made with two years there aresome clear efficiencies. On the one hand, the data is read from S3 withoutleaving DigitalOcean network, and any problem with home connections istherefore eliminated because you only have to upload a text with the functionto the droplets and then you get the estimated coefficients without movinggigabytes of data to your laptop. On the other, and the computation is run inparallel so I could create more droplets to reduce the reading and fittingtimes even more.

Using the results from the cloud

If you want to average the estimated coefficients, you need to subset anddiscard the unusable months.

intercept <- c()slope <- c()for (i in seq_along(fitted_models)) {  intercept[i] <- ifelse(is.numeric(fitted_models[[i]]),                         fitted_models[[i]][1], NA)  slope[i] <- ifelse(is.numeric(fitted_models[[i]]),                           fitted_models[[i]][2], NA)}avg_coefficients <- c(mean(intercept, na.rm = T), mean(slope, na.rm = T))names(avg_coefficients) <- c("intercept", "slope")avg_coefficients# intercept     slope #  4.406586  2.533299

Now you can say that, on average, if you travelled 5 miles by taxi, you’dexpect to pay 4.4 + 2.5 x 5 ~= 17 USD, which is just an estimate depending ontraffic conditions, finding many red lights, etc.

Don’t forget to delete the droplets when you are done using them.

droplet_delete(droplet1)droplet_delete(droplet2)

Final remarks

Arrow and S3 provide a powerful combination for big data analysis. You can evencreate a droplet to control other droplets if you feat that your internetconnection is not that good like to transfer complete regression summaries.

Think about the possibility of fitting many other models: generalized linearmodels, random forest, non-linear regression, posterior bootstrap, orbasically any output that R understands.

var vglnk = {key: '949efb41171ac6ec1bf7f206d57e90b8'}; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true;// s.defer = true;// s.src = '//cdn.viglink.com/api/vglnk.js'; s.src = 'https://www.r-bloggers.com/wp-content/uploads/2020/08/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Pachá.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

https://www.r-bloggers.com ">[click to continue reading]