Defining a new transformation for ggplot2/scales

ggplot2 2013-03-15

Summary:

Inspired by writing an answer to this question on StackOverflow, I decided to write up a more detailed description of creating a new transformation using the scales package (and also to make sure that I understood all the details about how to really do it).

Background

To start with, it helps to understand the philosophy behind the scales package. From the description of the scales package:

Scales map data to aesthetics, and provide methods for automatically determining breaks and labels for axes and legends.

Within the realm of scales, a transformation allows for a maniuplation of the data space prior to its mapping to an aesthetic. In particular, it is responsible for

  • The mapping, in both directions, between the data space and an intermediate representation space
  • Providing a mechanism for determining “nice” breaks in the data space
  • Providing a mechanism for formatting the labels in the data space

There are two main use cases for a transformation:

  • Taking an existing continuous scale and performing a functional transformation of it prior to mapping. For example, taking the logarithm, exponential, square root, recriprocal, inverse, etc. of a variable.
  • Providing a way of handling a variable of a type which represents a continuous quantity, but has specific structure and/or formatting conventions, typically represented with a class. Prototypical examples of this are dates and datetimes.

These variable transformations take place before any stats are performed on the data. In fact, they are equivalent, in terms of effects on data, as putting a transform in as the variable itself (though the axes breaks and labels are different). Quoting from ggplot2: Elegent Graphics for Data Analysis (page 100):

Of course, you can also perform the transformation yourself. For example, instead of using scale_x_log(), you could plot log10(x). That produces an identical result inside the plotting region, but the axis and tick labels won’t be the same. If you use a transformed scale, the axes will be labelled in the original data space. In both cases, the transformaiton occurs before the statistical summary.

I reproduce Figure 6.4 using the current version of the code because it is different than what was published.

qplot(log10(carat), log10(price), data=diamonds)qplot(carat, price, data=diamonds) +   scale_x_log10() + scale_y_log10()

Building blocks

The pieces that are needed to create a transformation are described on the help page for trans_new, but I’ll go through them in more detail.

transform and inverse

These are the workhorses of the transformation and define the functions that map from the original data space to the intermediate data space (transform) and back again (inverse). These can be specified as a function (an anonymous function or a function object) or as a character string which will cause a function of that name to be used (as determined by match.fun).

Each of these functions should take a vector of values and return a vector of values of the same length. Callilng inverse on the results of transform should result in the original vector (to within any error introduced by floating point arithmetic). That is all.equal(inverse(tranform(x)), x) should be TRUE for any x (for which transform is defined; see domain below).

Both of these functions are required.

breaks

breaks is a function which takes a vector of length 2 which represents the range of the data, expressed in the original data space, that is to be represented. This will include any requested expansion, in addition to the actual data values. breaks should return a vector of whatever length it deems appropriate such that each break is represented by one element of the vector. Optionally, the vector can be a named vector. If it is, the default formatter will use the names as the displayed version of the values.

In general, this is a hard problem, primarily because breaks should look “nice” which is difficult for an algorithm to determine. Luckily, others have spent time working on the problem and often much of what they have learned and implemented can be used without having to do much yourself. In partciular, there are existing break determination algorithms in scales such as pretty_breaks (which is based on base::pretty) which find breaks for a simple numeric scale, extended_breaks which is ba

Link:

http://blog.ggplot2.org/post/25938265813

From feeds:

Statistics and Visualization » ggplot2

Tags:

ggplot2 scales

Authors:

brian-diggs

Date tagged:

03/15/2013, 20:07

Date published:

06/26/2012, 14:01