MySql in a container
Win-Vector Blog 2016-12-05
I have previously written on using containerized PostgreSQL with R. This show the steps for using containerized MySQL with R.
As a consulting data scientist I often have to debug and rehearse work away from the clients actual infrastructure. Because of this it is useful to be able to spin up disposable PostgreSQL or MySQL work environments. I have already written on how to do this for PostgreSQL, and here are our notes on how to do this for MySQL.
First make sure you have a current version of Docker installed and running on your system. Then launch a MySQL image with the standard MySQL communication port (3306) bound to your host machine’s network interface by typing the following at the command line (such as the shell in OSX, some instructions can be found here):
docker run -p 3306:3306 --name mysql -e MYSQL_ROOT_PASSWORD=passwd -d mysql/mysql-server:5.6
Now from the R side of things you can use the package RMySQL to connect to the database.
library('RMySQL')
library('dplyr')
mysql <- src_mysql('mysql','127.0.0.1',3306,'root','passwd')
And we are ready to work dplyr examples such as these:
library('nycflights13')
# dplyr/mysql connector seems to error out on NA, overwrite them for now
# submitted issue: https://github.com/hadley/dplyr/issues/2259
flts <- flights
for(ci in colnames(flts)) {
napos <- is.na(flts[[ci]])
if(any(napos)) {
if(is.numeric(flts[[ci]])) {
flts[[ci]][napos] <- NaN
}
if(is.character(flts[[ci]])) {
flts[[ci]][napos] <- ''
}
}
}
flights_mysql <- copy_to(mysql, flts, temporary = FALSE, indexes = list(
c("year", "month", "day"), "carrier", "tailnum"))
flights_mysql
## Source: query [?? x 19]
## Database: mysql 5.6.34 [root@127.0.0.1:/mysql]
##
## year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time arr_delay carrier flight tailnum origin
## <int> <int> <int> <dbl> <int> <dbl> <dbl> <int> <dbl> <chr> <int> <chr> <chr>
## 1 2013 1 1 517 515 2 830 819 11 UA 1545 N14228 EWR
## 2 2013 1 1 533 529 4 850 830 20 UA 1714 N24211 LGA
## 3 2013 1 1 542 540 2 923 850 33 AA 1141 N619AA JFK
## 4 2013 1 1 544 545 -1 1004 1022 -18 B6 725 N804JB JFK
## 5 2013 1 1 554 600 -6 812 837 -25 DL 461 N668DN LGA
## 6 2013 1 1 554 558 -4 740 728 12 UA 1696 N39463 EWR
## 7 2013 1 1 555 600 -5 913 854 19 B6 507 N516JB EWR
## 8 2013 1 1 557 600 -3 709 723 -14 EV 5708 N829AS LGA
## 9 2013 1 1 557 600 -3 838 846 -8 B6 79 N593JB JFK
## 10 2013 1 1 558 600 -2 753 745 8 AA 301 N3ALAA LGA
## # ... with more rows, and 6 more variables: dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>,
## # time_hour <chr>
Or you can work directly through the underlying database connection:
dbGetQuery(mysql$con,'SELECT * FROM flts LIMIT 10')
And we have a MySQL database running. For more on working with R, databases, and dplyr please read Nina Zumel‘s article Using PostgreSQL in R: A quick how-to.