Using PostgreSQL in R: A quick how-to

Win-Vector Blog 2016-02-01

The combination of R plus SQL offers an attractive way to work with what we call medium-scale data: data that’s perhaps too large to gracefully work with in its entirety within your favorite desktop analysis tool (whether that be R or Excel), but too small to justify the overhead of big data infrastructure. In some cases you can use a serverless SQL database that gives you the power of SQL for data manipulation, while maintaining a lightweight infrastructure.

We call this work pattern “SQL Screwdriver”: delegating data manipulation to a lightweight infrastructure with the power of SQL for data manipulation.

NewImageImage: Iainf, some rights reserved.

We assume for this how-to that you already have a PostgreSQL database up and running. To get PostgreSQL for Windows, OSX, or Unix use the instructions at PostgreSQL downloads. If you happen to be on a Mac, then Postgres.app provides a “serverless” (or application oriented) install option.

For the rest of this post, we give a quick how-to on using the RpostgreSQL package to interact with Postgres databases in R.

You have your PostgresSQL database up and running. Now you want to work with the data in that database in R. First, let’s create a data frame that we want to insert into the database.

# An example data frame to play withiris = as.data.frame(iris)summary(iris)##   Sepal.Length    Sepal.Width     Petal.Length    Petal.Width   ##  Min.   :4.300   Min.   :2.000   Min.   :1.000   Min.   :0.100  ##  1st Qu.:5.100   1st Qu.:2.800   1st Qu.:1.600   1st Qu.:0.300  ##  Median :5.800   Median :3.000   Median :4.350   Median :1.300  ##  Mean   :5.843   Mean   :3.057   Mean   :3.758   Mean   :1.199  ##  3rd Qu.:6.400   3rd Qu.:3.300   3rd Qu.:5.100   3rd Qu.:1.800  ##  Max.   :7.900   Max.   :4.400   Max.   :6.900   Max.   :2.500  ##        Species  ##  setosa    :50  ##  versicolor:50  ##  virginica :50  

The column names of this data frame are problematic for databases (and especially PostgreSQL) for a few reasons: the “.”s in the names can be an issue, and PostgreSQL expects column names to be all lowercase. Here’s a function to make the column names db safe:

# make names db safe: no '.' or other illegal characters,# all lower case and uniquedbSafeNames = function(names) {  names = gsub('[^a-z0-9]+','_',tolower(names))  names = make.names(names, unique=TRUE, allow_=TRUE)  names = gsub('.','_',names, fixed=TRUE)  names}colnames(iris) = dbSafeNames(colnames(iris))summary(iris)##   sepal_length    sepal_width     petal_length    petal_width   ##  Min.   :4.300   Min.   :2.000   Min.   :1.000   Min.   :0.100  ##  1st Qu.:5.100   1st Qu.:2.800   1st Qu.:1.600   1st Qu.:0.300  ##  Median :5.800   Median :3.000   Median :4.350   Median :1.300  ##  Mean   :5.843   Mean   :3.057   Mean   :3.758   Mean   :1.199  ##  3rd Qu.:6.400   3rd Qu.:3.300   3rd Qu.:5.100   3rd Qu.:1.800  ##  Max.   :7.900   Max.   :4.400   Max.   :6.900   Max.   :2.500  ##        species  ##  setosa    :50  ##  versicolor:50  ##  virginica :50

Now let’s open up a database connection and insert the table.

# Create a connection to the databaselibrary('RPostgreSQL')## Loading required package: DBIpg = dbDriver("PostgreSQL")# Local Postgres.app database; no password by default# Of course, you fill in your own database information here.con = dbConnect(pg, user="ninazumel", password="",                 host="localhost", port=5432, dbname="ninazumel")# write the table into the database.# use row.names=FALSE to prevent the query # from adding the column 'row.names' to the table # in the dbdbWriteTable(con,'iris',iris, row.names=FALSE)## [1] TRUE

The function dbWriteTable() returns TRUE if the table was successfully written. Note this call will fail if iris already exists in the database. Use overwrite=TRUE to force overwriting of an existing table, and append=TRUE to append to an existing table.

Now you can read the table back out.

# read back the full table: method 1dtab = dbGetQuery(con, "select * from iris")summary(dtab)##   sepal_length    sepal_width     petal_length    petal_width   ##  Min.   :4.300   Min.   :2.000   Min.   :1.000   Min.   :0.100  ##  1st Qu.:5.100   1st Qu.:2.800   1st Qu.:1.600   1st Qu.:0.300  ##  Median :5.800   Median :3.000   Median :4.350   Median :1.300  ##  Mean   :5.843   Mean   :3.057   Mean   :3.758   Mean   :1.199  ##  3rd Qu.:6.400   3rd Qu.:3.300   3rd Qu.:5.100   3rd Qu.:1.800  ##  Max.   :7.900   Max.   :4.400   Max.   :6.900   Max.   :2.500  ##    species         ##  Length:150        ##  Class :character  ##  Mode  :character  ##                    # read back the full table: method 2rm(dtab)dtab = dbReadTable(con, "iris")summary(dtab)##   sepal_length    sepal_width     petal_length    petal_width   ##  Min.   :4.300   Min.   :2.000   Min.   :1.000   Min.   :0.100  ##  1st Qu.:5.100   1st Qu.:2.800   1st Qu.:1.600   1st Qu.:0.300  ##  Median :5.800   Median :3.000   Median :4.350   Median :1.300  ##  Mean   :5.843   Mean   :3.057   Mean   :3.758   Mean   :1.199  ##  3rd Qu.:6.400   3rd Qu.:3.300   3rd Qu.:5.100   3rd Qu.:1.800  ##  Max.   :7.900   Max.   :4.400   Max.   :6.900   Max.   :2.500  ##    species         ##  Length:150        ##  Class :character  ##  Mode  :character  

Of course, the point of using a database is to extract subsets or transformations of your data, using SQL.

# get part of the tablerm(dtab)dtab = dbGetQuery(con, "select sepal_length, species from iris")summary(dtab)##   sepal_length     species         ##  Min.   :4.300   Length:150        ##  1st Qu.:5.100   Class :character  ##  Median :5.800   Mode  :character  ##  Mean   :5.843                     ##  3rd Qu.:6.400                     ##  Max.   :7.900

You can use dbSendQuery for sending queries that don’t return a data-frame-like result.

# remove table from databasedbSendQuery(con, "drop table iris")# commit the changedbCommit(con)

When you are done, disconnect.

# disconnect from the databasedbDisconnect(con)

And that’s it!

Extra: RPostgreSQL and sqldf

If you are accustomed to manipulating data with SQL, you may prefer SQL notation to the sometimes convoluted calling conventions of the analogous R operations like aggregate(), or the functions in the dplyr package. In this case you have probably already discovered the sqldf package, which allows you to manipulate data frames using SQL. If you are a sqldf user, there is an additional subtlety if you are also using RPostgreSQL or other R packages for talking to databases: sqldf uses its own internal (and ephemeral) database to perform its operations, but if RPostgreSQL is loaded, sqldf will pick up your PostgreSQL driver by default. This is probably not what you want.

options(gsubfn.engine = "R")library(sqldf)  ## Loading required package: gsubfn## Loading required package: proto## Loading required package: RSQLite## sqldf will default to using PostgreSQL <=== NOTE THIS!# Now try to use sqldf to aggregate sepal_length by speciesquery = "select avg(sepal_length) avg_sepal_length,                 species          from dtab          group by species"sqldf(query)# Error in postgresqlNewConnection(drv, ...) : #   RS-DBI driver: (could not connect postgres@localhost on dbname "test"# )# Error in !dbPreExists : invalid argument type

To use sqldf on local data frames, you must specify the driver and dbname explicitly.

sqldf(query, drv="SQLite", dbname=":memory:")##   avg_sepal_length    species## 1            5.006     setosa## 2            5.936 versicolor## 3            6.588  virginica

Extra: PostgreSQL and dplyr

If you do use dplyr, the good news is that you can connect to a PostgreSQL database directly through the dplyr function src_postgres().

library('dplyr')# Connect to local PostgreSQL via dplyrlocaldb <- src_postgres(dbname = '',                           host = 'localhost',                           port = 5432,                           user = 'ninazumel',                           password = '')# cheat and access the db connection directly# assume we have made the colnames db safedbWriteTable(localdb$con,'iris',iris, row.names=FALSE)## [1] TRUE

The tbl() command lets you access tables in the database remotely, and sql() lets you send queries.

# this is not a data frame; it's a dplyr PostgreSQL handle into the databased = tbl(localdb, "iris") d## Source: postgres 9.5.0 [ninazumel@localhost:5432/ninazumel]## From: iris [150 x 5]## ##    sepal_length sepal_width petal_length petal_width species## 1           5.1         3.5          1.4         0.2  setosa## 2           4.9         3.0          1.4         0.2  setosa## 3           4.7         3.2          1.3         0.2  setosa## 4           4.6         3.1          1.5         0.2  setosa## 5           5.0         3.6          1.4         0.2  setosa## 6           5.4         3.9          1.7         0.4  setosa## 7           4.6         3.4          1.4         0.3  setosa## 8           5.0         3.4          1.5         0.2  setosa## 9           4.4         2.9          1.4         0.2  setosa## 10          4.9         3.1          1.5         0.1  setosa## ..          ...         ...          ...         ...     ...# this is a data framedtab = as.data.frame(d)# send a query through dplyrquery = "select avg(sepal_length) avg_sepal_length,                 species          from iris         group by species"dsub = tbl(localdb, sql(query))dsub## Source: postgres 9.5.0 [ninazumel@localhost:5432/ninazumel]## From:  [?? x 2]## ##    avg_sepal_length    species## 1             5.936 versicolor## 2             6.588  virginica## 3             5.006     setosa## ..              ...        ...# make it localdsub = as.data.frame(dsub)summary(dsub)##  avg_sepal_length   species         ##  Min.   :5.006    Length:3          ##  1st Qu.:5.471    Class :character  ##  Median :5.936    Mode  :character  ##  Mean   :5.843                      ##  3rd Qu.:6.262                      ##  Max.   :6.588# shuts down databaserm(list=c('d','localdb')); gc()