data_algebra/rquery as a Category Over Table Descriptions

Win-Vector Blog 2019-12-14

Introduction

I would like to talk about some of the design principles underlying the data_algebra package (and also in its sibling rquery package).

The data_algebra package is a query generator that can act on either Pandas data frames or on SQL tables. This is discussed on the project site and the examples directory. In this note we will set up some technical terminology that will allow us to discuss some of the underlying design decisions. These are things that when they are done well, the user doesn’t have to think much about. Discussing such design decisions at length can obscure some of their charm, but we would like to point out some features here.

(Note: if there are any blog-rendering problems the original source-workbook used to create this article can be found here.)

Background

We will introduce a few ideas before trying to synthesize our thesis.

The data_algebra

An introduction to the data_algebra package can be found here. In this note we will discuss some of the inspirations for the package: Codd’s relational algebra, experience working with dplyr at scale, sklearn Pipeline, and category theory.

sklearn Pipeline

A major influence on the data_algebra design is sklearn.pipeline.Pipeline. sklearn.pipeline.Pipeline itself presumably become public with Edouard Duchesnay’s Jul 27, 2010 commit: “add pipeline”.

sklearn.pipeline.Pipeline maintains a list of steps to be applied to data. What is interesting is the steps are not functions. Steps are instead objects that implement both a .transform() and a .fit() method.

.transform() typically accepts a data-frame type structure and returns a modified version. Typical operations include adding a new derived column, selecting columns, selecting rows, and so on.

From a transform-alone point of view the steps compose like functions. For list [s, t] transform(x) is defined to as:

   transform([s, t], x) := 
      t.transform(s.transform(x))

(the glyph “:=” denoting “defined as”). The fit-perspective is where things get interesting. obj.fit(x) changes the internal state of obj based on the value x and returns a reference to obj. I.e. it learns from the data and stores this result as a side-effect in the object itself. In sklearn it is common to assume a composite method called .fit_transform() often defined as:

   obj.fit_transform(x) := obj.fit(x).transform(x)

(though for our own vtreat package, this is deliberately not the case). Using .fit_transform() we can explain that in a sklearn Pipeline .fit() is naturally thought of as:

   fit([s, t], x]) := 
      t.fit(s.fit_transform(x))

My point is: sklearn.pipeline.Pipeline generalizes function composition to something more powerful: the ability to both fit and to transform. sklearn.pipeline.Pipeline is a natural way to store a sequence of parameterized or data-dependent data transform steps (such as centering, scaling, missing value imputation, and much more).

This gives as a concrete example of where rigid mindset where “function composition is the only form of composition” would not miss design opportunities.

Fist class citizens

We are going to try to design our tools to be “first class citizens” in the sense of Strachey:

First and second class objects. In ALGOL, a real number may appear in an expression or be assigned to a variable, and either of them may appear as an actual parameter in a procedure call. A procedure, on the other hand, may only appear in another procedure call either as the operator (the most common case) or as one of the actual parameters. There are no other expressions involving procedures or whose results are procedures. Thus in a sense procedures in ALGOL are second class citizens—they always have to appear in person and can never be represented by a variable or expression (except in the case of a formal parameter)…

Quoted in Wikipedia. Christopher Strachey, “Fundamental Concepts in Programming Languages” in Higher-Order and Symbolic Computation 13:11 (2000); though published in 2000, these are notes from lectures Strachey delivered in August, 1967

What we will draw out: is if our data transform steps are “first class citizens” we should expect to be able to store them in variables, compose them, examine them, and many other steps. A function that we can only use or even re-use is not giving us as much as we expect from other types. Or alternately, if functions don't give us everything we want, we may not want to use them as our only type or abstraction of data processing steps.

Composability

Most people first encounter the mathematical concept of “composability” in terms of functions. This can give the false impression that to work with composable design principles, one must shoe-horn the object of interest to be functions or some sort of augmented functions.

This Procrustean view loses a lot of design opportunities.

In mathematics composability is directly studied by the field called “Category Theory.” So it makes sense to see if category theory may have ideas, notations, tools, and results that may be of use.

Category Theory

A lot of the benefit of category theory is lost if every time we try to apply category theory (or even just use some of the notation conventions) we attempt to explain all of category theory as a first step. So I will try to resist that urge here. I will introduce the bits I am going to use here.

Category theory routinely studies what are called “arrows.” When treated abstractly an arrow has two associated objects called the “domain” and “co-domain.” The names are meant to be evocative of the “domain” (space of inputs) and “co-domains” (space of outputs) from the theory of functions.

Functions are commonly defined as having:

  • A domain, which is a set of values the function can be applied to.
  • A co-domain, which is a set of values the function evaluations are contained in (or range).
  • A fixed composition rule that if domain(g) is contained in co-domain(f) then: g . f is defined as the function such that (g . f)(x) = g(f(x)) for all values in the domain of f. Mathematical functions are usually thought of as a specialization of binary relation, and considered to be uniquely determined by their evaluations (by the axiom of extensionality).

Packages that use function composition typically collect functions in lists and define operator composition either through lambda-abstraction or through list concatenation.

Category theory differs from function theory in that category theory talks about arrows instead of functions. The theory is careful to keep separate the following two concepts: what arrows are and how arrows are composed.

When using arrows to model a system we expect to be able to specify, with some extra degrees of freedom in specifying:

  • How to choose domains and co-domains, for categories these do not have to be sets.
  • How arrows compose. For arrows a and b with co-domain(b) = domain(a) then: a . b denotes the composition in the category, and is itself a new arrow in the same category. Composition is not allowed (or defined) when co-domain(b) != domain(a).
  • How arrows act on items. Arrows can have multiple actions, and arrows can act on items that are not elements of their domains.

An action is a mapping from arrows and items to items. I.e. action(arrow, item) = new_item. For categories the items may or may not be related to the domain and co-domain. Not all categories have actions, but when they do have actions the action must be compatible with arrow composition.

Good general references on category theory, include:

  • Steve Awodey, Category Theory, 2nd Edition, Oxford University Press; 2010.
  • Emily Riehl, Category Theory in Context, Dover, 2016.
  • Saunders Mac Lane, Categories for the Working Mathematician, 2nd Edition, Springer, 1978.

Functions have a very ready category theory interpretation as arrows. Given a function f with domain A and co-domain B, we can think of any triple (f, A', B') as an arrow in a category of functions if A' contained in A and B contained in B'. In this formulation we define the arrow composition of (f, A', B') and (g, C', D') as (f . g, C', B') where f . gis defined to be the function such that for all x in domain x we have:

   (f . g)(x) := f(g(x)) 

We will call the application of a function to a value as an example of an “action.” A function f() “acts on its domain” and f(x) is the action of f on x. For functions we can define the action “apply” as:

   apply(f, x) := f(x)

The extra generalization power we get from moving away from functions to arbitrary arrows (that might not correspond to functions) comes from the following:

  • Arrow composition does not have to be function composition.
  • Arrows can have multiple actions, and act on items that are not elements of their domain.
  • Arrows have a notion of equality, but this notion can differ from having identical actions.

To be a category a few conditions must be met, including: the composition must be associative and we must have some identity arrows. By “associative composition” we mean, it must be the case that for arrows a, b, and c (with appropriate domains and co-domains):

   (a . b) . c = a . (b . c) 

Our action must also associate with arrow composition. That is we must have for values x we must have for co-variant actions:

   apply(a . b, x) = apply(a, apply(b, x))

Or for contra-variant actions:

   apply(a . b, x) = apply(b, apply(a, x))

The idea is: the arrow a . b must have an action equal to the actions of a and b composed as functions. That is: arrow composition and actions can differ from function composition and function application, but they must be at least somewhat similar in that they remain associative.

We now have the background to see that category theory arrows differ from functions in that arrows are more general (we can pick more of their properties) and require a bit more explicit bookkeeping.

Back to sklearn.pipeline.Pipeline

We now have enough notation to attempt a crude category theory description of sklearn.pipeline.Pipeline.

Define our sklearn.pipeline.Pipeline category P as follows:

  • We have only one object called 0. All arrows will have domain and co-domain equal to 0, i.e.: we are not doing any interesting pre-condition checking in this category. This sort of category is called a “monoid.”
  • The arrows of our category are lists of steps. Steps are again Python objects that define .transform(), .fit(), and .fit_transform() methods.
  • Composition a1 . a2 is defined as the list concatenation: a2 + a1. “+” being Python's list concatenate in this case, and the order set to match sklearn.pipeline.Pipeline list order convention.
  • We define an action called “transform_action” defined as:

       transform_action([step1, step2, ..., stepk], x) := 
          stepk.transform(... step2.transform(step1.transform(x)) )

To see this is a category (and a category compatible action) we must check associativity of the composition (which in this case is list concatenation) and associativity of the action with respect to list concatenation.

We can also try to model the .fit_transform() methods. We will not try to model the side-effect that .fit_transform() changes state of the arrows (to have the fit information in each step). But we can at least define an action (with side effects) as follows:

  • We define an action called “fit_transform” defined as:

       fit_transform_action([step1, step2, ..., stepk], x) := 
          stepk.fit_transform(... step2.fit_transform(step1.fit_transform(x)) )

To confirm this is an action (ignoring the side-effects), we would want check is if the following equality holds or not:

  fit_transform_action(a . b, x) =
      fit_transform_action(b, fit_transform_action(a, x))
   

The above should follow by brute pushing notation around (assuming we have defined fit_transform_action correctly, and sufficiently characterized .fit_transform()).

Notice we didn't directly define a “fit_action” action, as it isn't obvious that has a not obvious that has a nice associative realization. This an opportunity for theory to drive design; notation considerations hint that fit_transform() may be more fundamental than, and thus preferred over, fit().

The category theory concepts didn't so-much design sklearn.pipeline.Pipeline, but give us a set of criteria to evaluate sklearn.pipeline.Pipeline design. We trust the category theory point of view is useful as it emphasizes associativity (which is a great propriety to have), and is routinely found to be a set of choices that work in complicated systems. The feeling being: the design points category theory seems to suggest, turn out to be the ones you want down the round.

The data_algebra

Now that we have some terminology, lets get back to the data_algebra

What is the data_algebra?

  • data_algebra is a package for building up complex data manipulation queries data_algebra queries are first class citizens in the Strachey sense (can be: passed as an argument, returned from a function, modified, assigned to a variable, printed, inspected, and traversed as a data structure).
  • The operators are essentially those of the Codd relational algebra (select rows/columns, join, union-all, extend, project, and window functions).
  • Composition is left to right using method chaining.
  • Queries can be realized by transforming to SQL (targeting PostgeSQL, Spark, and other implementations), or as acting on Pandas data (we are hoping to extend this to modin, RAPIDS, and others).
  • The data_algebra has an R sibling package group (rquery/rqdatatable) similar to dplyr.

An introduction to the data_algebra can be found here.

We now have the terminology to concisely state a data_algebra design principle: use general concepts (such as category theory notation) to try and ensure data_algebra transforms are first class citizens (i.e. we can do a lot with them and to them).

The naive functional view

If we were to again take a mere functional view of the data_algebra we would say the data_algebra is a set of functions that operate on data. They translate data frames to new data frames using Codd-inspired operations. We could think of the data_algebra as acting on data on the right, and acting on data_algebra operators on the left.

However, this is not the right abstraction. data_algebra methods primarily map data transforms to data transforms. However even this is a “too functional view”. It makes sense to think of data_algebra operators as arrows, and the whole point of arrows is composition.

The categorical view

The data_algebra can be mapped to a nice category. The idea being something that can be easily mapped to an orderly system, is it self likely an somewhat orderly system.

Good references on the application of category theory to concrete systems (including databases) include:

  • David I. Spivak, Category Theory for the Sciences; The MIT Press, 2014.
  • Brendan Fong, David I. Spivak, An Invitation to Applied Category Theory: Seven Sketches in Compositionality; Cambridge University Press, 2019.

Our data_algebra category D is defined as follows.

  • The objects of our category are single table schemas. By “single table schema” mean mean only the list of column names (and optionally column types) for a set of named tables. We are not modeling invariants, or cross-table relations.
  • The arrows of our category are data_algebra operator chains.
  • Composition of arrows in our category is a very general query composition. We will demonstrate query composition in a bit, but as a hint it is not function composition or list concatination.

Some notes on the category theory interpretation of the data_algebra package can be found here.

Let's demonstrate the above with Python code. The data_algebra allows for the specification of data transforms as first class objects.

First we import some modules and create some example data.

from data_algebra.data_ops import *
import pandas

d = pandas.DataFrame({
    'x': [1, 2, 3],
    'y': [3, 4, 4],
})

d
x y 0 1 3 1 2 4 2 3 4

To specify adding a new derived column z we would write code such as the following.

td = describe_table(d)

a = td.extend(
    { 'z': 'x.mean()' },
    partition_by=['y']
)

a

TableDescription( table_name='data_frame', column_names=[ 'x', 'y']) . extend({ 'z': 'x.mean()'}, partition_by=['y'])

We can let this transform act on data.

a.transform(d)
x y z 0 1 3 1.0 1 2 4 2.5 2 3 4 2.5

We can compose this transform with more operations to create a composite transform.

b = a.extend({
    'ratio': 'y / x'
})

b

TableDescription( table_name='data_frame', column_names=[ 'x', 'y']) . extend({ 'z': 'x.mean()'}, partition_by=['y']) . extend({ 'ratio': 'y / x'})

As a bonus we can also map the above transform to a SQL query representing the same action in databases.

from data_algebra.SQLite import SQLiteModel

print(
    b.to_sql(db_model=SQLiteModel(), pretty=True)
)

SELECT "x",
       "y",
       "z",
       "y" / "x" AS "ratio"
FROM
  (SELECT "x",
          "y",
          avg("x") OVER (PARTITION BY "y") AS "z"
   FROM ("data_frame") "SQ_0") "SQ_1"

All of this is the convenient interface we expect users will want. However, if we asked that all operators specified their expected input schema (or their domain) we have the category D. We don't expect users to do such, but we have code supporting this style of notation to show that the data_algebra is in fact related to a nice category over schemas.

Lets re-write the above queries as formal category arrows.

from data_algebra.arrow import *

a1 = DataOpArrow(a)

print(str(a1))

[ 'data_frame': [ x: <class 'numpy.int64'>, y: <class 'numpy.int64'> ] -> [ x, y, z ] ]

The above is rendering the arrow as just its domain and co-domain. The domain and co-domains are just single-table schemas: lists of column names (possibly with column types).

We can get a more detailed representation of the arrow as follows.

print(a1.__repr__())

DataOpArrow( TableDescription( table_name='data_frame', column_names=[ 'x', 'y']) . extend({ 'z': 'x.mean()'}, partition_by=['y']), free_table_key='data_frame')

Or we can examine the domain and co-domain directly. Here we are using a common category theory trick: associating the object with the identity arrow of the object. So what we are showing as domain and co-domains are actually identity arrows instead of objects.

a1.dom()

DataOpArrow( TableDescription( table_name='', column_names=[ 'x', 'y']), free_table_key='')

a1.cod()

DataOpArrow( TableDescription( table_name='', column_names=[ 'x', 'y', 'z']), free_table_key='')

Now we can write our second transform step as an arrow as follows.

a2 = DataOpArrow(a1.cod_as_table().extend({
    'ratio': 'y / x'
}))

a2

DataOpArrow( TableDescription( table_name='', column_names=[ 'x', 'y', 'z']) . extend({ 'ratio': 'y / x'}), free_table_key='')

We took extra steps, that most users will not want to take, to wrap the second-stage (a2) operations as an arrow. Being an arrow means that we have a domain and co-domain that can be used to check if operations are composable.

A typical user would not work with arrow, but instead work with the data algebra which itself is a shorthand for the arrows. That is: the users may want the power of a category, but they don't want to be the one handling the extra bookkeeping. To add an extra operation a user would work directly with a and just write the following.

a.extend({
    'ratio': 'y / x'
})

TableDescription( table_name='data_frame', column_names=[ 'x', 'y']) . extend({ 'z': 'x.mean()'}, partition_by=['y']) . extend({ 'ratio': 'y / x'})

The above has substantial pre-condition checking and optimizations (as it is merely user facing shorthand for the arrows).

The more cumbersome arrow notation (that requires the specification of pre-conditions) has a payoff: managed arrow composition. That is: complex operator pipelines can be directly combined. We are not limited to extending one operation at a time.

If the co-domain of arrow matches the domain of another arrow we can compose them left to right as follows.

a1.cod() == a2.dom()

True

composite = a1 >> a2

composite

DataOpArrow( TableDescription( table_name='data_frame', column_names=[ 'x', 'y']) . extend({ 'z': 'x.mean()'}, partition_by=['y']) . extend({ 'ratio': 'y / x'}), free_table_key='data_frame')

And when this isn't the case, composition is not allowed. This is exactly what we want as this means the preconditions (exactly which columns are present) for the second arrow are not supplied by the first arrow.

a2.cod() == a1.dom()

False

try:
    a2 >> a1
except ValueError as e:
    print("Caught: " + str(e))

Caught: extra incoming columns: {'ratio', 'z'}

An important point is: for this arrow notation composition is not mere list concatenation or function composition. Here is an example that makes this clear.

b1 = DataOpArrow(TableDescription(column_names=['x', 'y'], table_name=None). \
   extend({
    'x': 'x + 1',
    'y': 7
}))

b1

DataOpArrow( TableDescription( table_name='', column_names=[ 'x', 'y']) . extend({ 'x': 'x + 1', 'y': '7'}), free_table_key='')

b2 = DataOpArrow(TableDescription(column_names=['x', 'y'], table_name=None). \
   extend({
    'y': 9
}))

Now watch what happens when we use “>>” to compose the arrow b1 and b2.

b1 >> b2

DataOpArrow( TableDescription( table_name='', column_names=[ 'x', 'y']) . extend({ 'x': 'x + 1', 'y': '9'}), free_table_key='')

Notice in this special case the composition of b1 and b2 is a single extend node combining the operations and eliminating the dead-value 7. The idea is: the package has some freedom to define composition as long as it is associative. In this case we have an optimization at the compose step so the composition is not list concatenation or function composition.

As we have said, a typical user will not take the time to establish pre-conditions on steps. So they are not so much working with arrows but with operators that can be specialized to arrows. An actual user might build up the above pipeline as follows.


TableDescription(column_names=['x', 'y'], table_name=None). \
   extend({
    'x': 'x + 1',
    'y': 7
    }). \
   extend({
    'y': 9
    })
    

TableDescription( table_name='', column_names=[ 'x', 'y']) . extend({ 'x': 'x + 1', 'y': '9'})

We recently demonstrated this sort of optimization in the R rquery package.

In the above example the user still benefits from the category theory design. As they composed left to right the system was able to add in the pre-conditions for them. The user only needs to set pre-conditions for non-trivial right-hand side pipelines.

Conclusion

The advantage the data_algebra package gets from category theory is: it lets us design the package action (how the package works on data) somewhat independently from operator composition. This gives us a lot more design room and power than a strict function composition or list concatenation theory would give us.