MySql in a container

Win-Vector Blog 2016-12-05

I have previously written on using containerized PostgreSQL with R. This show the steps for using containerized MySQL with R.

As a consulting data scientist I often have to debug and rehearse work away from the clients actual infrastructure. Because of this it is useful to be able to spin up disposable PostgreSQL or MySQL work environments. I have already written on how to do this for PostgreSQL, and here are our notes on how to do this for MySQL.

First make sure you have a current version of Docker installed and running on your system. Then launch a MySQL image with the standard MySQL communication port (3306) bound to your host machine’s network interface by typing the following at the command line (such as the shell in OSX, some instructions can be found here):

docker run -p 3306:3306 --name mysql -e MYSQL_ROOT_PASSWORD=passwd -d mysql/mysql-server:5.6

Now from the R side of things you can use the package RMySQL to connect to the database.

library('RMySQL')
library('dplyr')
mysql <- src_mysql('mysql','127.0.0.1',3306,'root','passwd')

And we are ready to work dplyr examples such as these:

library('nycflights13')

# dplyr/mysql connector seems to error out on NA, overwrite them for now
# submitted issue: https://github.com/hadley/dplyr/issues/2259
flts <- flights
for(ci in colnames(flts)) {
  napos <- is.na(flts[[ci]])
  if(any(napos)) {
    if(is.numeric(flts[[ci]])) {
      flts[[ci]][napos] <- NaN
    }
    if(is.character(flts[[ci]])) {
      flts[[ci]][napos] <- ''
    }
  }
}

flights_mysql <- copy_to(mysql, flts, temporary = FALSE, indexes = list(
  c("year", "month", "day"), "carrier", "tailnum"))

flights_mysql

## Source:   query [?? x 19]
## Database: mysql 5.6.34 [root@127.0.0.1:/mysql]
## 
##     year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time arr_delay carrier flight tailnum origin
##    <int> <int> <int>    <dbl>          <int>     <dbl>    <dbl>          <int>     <dbl>   <chr>  <int>   <chr>  <chr>
## 1   2013     1     1      517            515         2      830            819        11      UA   1545  N14228    EWR
## 2   2013     1     1      533            529         4      850            830        20      UA   1714  N24211    LGA
## 3   2013     1     1      542            540         2      923            850        33      AA   1141  N619AA    JFK
## 4   2013     1     1      544            545        -1     1004           1022       -18      B6    725  N804JB    JFK
## 5   2013     1     1      554            600        -6      812            837       -25      DL    461  N668DN    LGA
## 6   2013     1     1      554            558        -4      740            728        12      UA   1696  N39463    EWR
## 7   2013     1     1      555            600        -5      913            854        19      B6    507  N516JB    EWR
## 8   2013     1     1      557            600        -3      709            723       -14      EV   5708  N829AS    LGA
## 9   2013     1     1      557            600        -3      838            846        -8      B6     79  N593JB    JFK
## 10  2013     1     1      558            600        -2      753            745         8      AA    301  N3ALAA    LGA
## # ... with more rows, and 6 more variables: dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>,
## #   time_hour <chr>

Or you can work directly through the underlying database connection:

dbGetQuery(mysql$con,'SELECT * FROM flts LIMIT 10')

And we have a MySQL database running. For more on working with R, databases, and dplyr please read Nina Zumel‘s article Using PostgreSQL in R: A quick how-to.