Databases in containers

Win-Vector Blog 2016-02-08

A great number of readers reacted very positively to Nina Zumel‘s article Using PostgreSQL in R: A quick how-to. Part of the reason is she described an incredibly powerful data science pattern: using a formerly expensive permanent system infrastructure as a simple transient tool.

In her case the tools were the data manipulation grammars SQL (Structured Query Language) and dplyr. It happened to be the case that in both cases the implementation was supplied by a backing database system (PostgreSQL), but the database was not the center of attention for very long.

In this note we will concentrate on SQL (which itself can be used to implement dplyr operators, and is available on even Hadoop scaled systems such as Hive). Our point can be summarized as: SQL isn’t the price of admission to a server, a server is the fee paid to use SQL. We will try to reduce the fee and show how to containerize PostgreSQL on Microsoft Windows (as was already done for us on Apple OSX).

Containerized DB

NewImage

The Smashing Pumpkins “Bullet with Butterfly Wings” (start 2 minutes 6s)

“Despite all my rage I am still just a rat in a cage!”

(image credit).

In data science we get distracted by shiny things. We (wrongly) end up thinking of data centers, software packages, systems, and lines of code as being direct benefits or assets. Really they are like weight on an aircraft: the price you pay for being able to transform data, but a cost to minimize.

You can get cheap access to industrial strength SQL implementations in many ways:

  • Use SQL directly on R data frames using the sqldf package
  • (something we have blogged about and included in our book Practical Data Science with R).

  • Figure out how to install and talk to a “no server” database such as SQLite or H2.
  • Use a containerized database. This is what Nina did in her article, and is the topic of this article.
  • Use a remote service like Amazon redshift. Pay commodity prices for somebody else to worry about the all so important infrastructure.

As we said, we are going to write about containerized databases. Nina pointed out a pre-made containerized version of PostgreSQL for Apple OSX: Postgres.app. What it does is trap the database in a user application (the user can stop and start at will) that then declares it is a “server.” From that point on you use the database for data services: transforming and aggregating data. When you are done you close the app and it is gone. The merit of the zero-install is this “no install” database doesn’t interfere with any other system installed databases or services. Installing a database usually triggers a bunch of dependencies and re-configurations of system services. Historically you were never sure if you had successfully uninstalled a PostgreSQL from an Apple Mac, so it is nice to run one without installing.

Basically database systems are jerks. Give them an inch and they take a mile. The strategy of containerization is: let them be lords in a pocket universe (the container) that we can close and even throw away. Let the database implementation do all of the non-negotiable important things it wants inside a container to keep your real machine clean.

We are now going to show how to use PostgreSQL in a containerized fashion on Microsoft Windows. Our strategy is to use a pre-made standard docker PostgreSQL container. This means we have to put up with the foibles of the container system (in this case Docker), but not a tick more than that. Containers and virtualization systems are also jerks, but we are going to pay our price once and be done.

Below are the steps.

  1. Install Docker on your Windows machine (instructions here). You must install and run docker toolbox to make any progress. How this works depends on your system (and note you can not run Docker inside some virtualized environments such as VirtualBox). Basically you are front-paying with installation frustration for later flexibility. I found starting Docker on non-Linux environments (Apple OSX, and Microsoft Windows 8) a bit hit-or miss (requiring a re-try).
  2. Once you have the Docker Quickstart up you should see the following:
                            ##         .                  ## ## ##        ==               ## ## ## ## ##    ===           /"""""""""""""""""\___/ ===      ~~~ {~~ ~~~~ ~~~ ~~~~ ~~~ ~ /  ===- ~~~           \______ o           __/             \    \         __/              \____\_______/docker is configured to use the default machine with IP 192.168.99.100For help getting started, check out the docs at https://docs.docker.com

    Write down the IP address in the message (in our case 192.168.99.100). Then type the following (substituting the IP address you saw) into the Docker Quickstart shell:

    docker run -p 5432:5432 --name pg -e POSTGRES_PASSWORD=pg -d postgres

  3. Now in your R or RStudio type (substituting the IP address you saw):
    library('RPostgreSQL')pg <- dbDriver("PostgreSQL")con <- dbConnect(pg, user="postgres", password="pg",                host="192.168.99.100", port=5432)

    (As always running install.packages('RPostgreSQL') if RPostgreSQL is not already installed.)

And you are good to go. You are running against a PostgreSQL database trapped in a docker image (itself implemented in a VirtualBox virtual machine). Treat this database as a transient resource used to run complex SQL and dplyr queries efficiently (and not as a data store of record).

For anything more we suggest a good Docker reference (and would in fact like recommendations!).