Advanced Data Reshaping in Python and R

Win-Vector Blog 2019-09-04

This note is a simple data wrangling example worked using both the Python data_algebra package and the R cdata package. Both of these packages make data wrangling easy through he use of coordinatized data concepts (relying heavily on Codd’s “rule of access”).

The advantages of data_algebra and cdata are:

  • The user specifies their desired transform declaratively by example and in data. What one does is: work an example, and then write down what you want (we have a tutorial on this here).
  • The transform systems can print what a transform is going to do. This makes reasoning about data transforms much easier.
  • The transforms, as they themselves are written as data, can be easily shared between systems (such as R and Python).

Let’s re-work a small R cdata example, using the Python package data_algebra.

An Example

First we import some modules and packages, and import some notional data.

In [1]:
# https://pandas.pydata.org
import pandas

# PyYAML from https://pyyaml.org
import yaml

# pip install https://github.com/WinVector/data_algebra/raw/master/dist/data_algebra-0.1.3.tar.gz
import data_algebra.cdata
import data_algebra.cdata_impl
import data_algebra.data_ops
import data_algebra.yaml
import data_algebra.SQLite

# ask YAML to write simpler structures
data_algebra.yaml.fix_ordered_dict_yaml_rep()

# read our example data
iris = pandas.read_csv('iris_small.csv')
# print it out
iris
Out[1]:
Sepal.Length Sepal.Width Petal.Length Petal.Width Species id 0 5.1 3.5 1.4 0.2 setosa 0 1 4.9 3.0 1.4 0.2 setosa 1 2 4.7 3.2 1.3 0.2 setosa 2
Our goal is to move from this de-normalized or wide-form (or “model matrix”/”data matrix” form, where each record is exactly one row) into a tall form where records may span more than one row.

Or, more concretely, we want our data to look like the following.

In [2]:
answer = pandas.read_csv("answer.csv")
answer
Out[2]:
id Species Part Measure Value 0 0 setosa Petal Length 1.4 1 0 setosa Petal Width 0.2 2 0 setosa Sepal Length 5.1 3 0 setosa Sepal Width 3.5 4 1 setosa Petal Length 1.4 5 1 setosa Petal Width 0.2 6 1 setosa Sepal Length 4.9 7 1 setosa Sepal Width 3.0 8 2 setosa Petal Length 1.3 9 2 setosa Petal Width 0.2 10 2 setosa Sepal Length 4.7 11 2 setosa Sepal Width 3.2
Notice each row of the original data set is now four rows of the derived one. This “tall form” is often useful for plotting.

This sort of conversion can be called an anti-pivot. In Python these sorts of transforms are specified with pandas.DataFrame.pivot, pandas.pivot_table, pandas.melt, and others. Many R packages supply related operators: reshape, reshape2, data.table, tidyr, cdata, and more. We feel the data_algebra and cdata methods (which only perform data-reshaping, and not aggregation) offer a number of significant advantages (some of which we have already mentioned, and a few more of which we will see demonstrated in this note).

Back to our example.

To specify our desired transform, we build a structure describing what a data record looks like. We can crib this specification from the answer as follows.

In [3]:
control_table = answer.loc[answer.id == 0, ['Part', 'Measure']]
control_table = control_table.reset_index(inplace=False, drop=True)
control_table["Value"] = [control_table['Part'][i] + '.' + control_table['Measure'][i] for 
                            i in range(control_table.shape[0])]
control_table
Out[3]:
Part Measure Value 0 Petal Length Petal.Length 1 Petal Width Petal.Width 2 Sepal Length Sepal.Length 3 Sepal Width Sepal.Width
We can derive the control table from the answer, as we did here, or just type one in directly. The idea is: we can use any method we want to derive the prototype record shape, we are not limited to a sub-query language or methodology from any one package.

For each record we take care to identify what keys identify records (the record_keys) and want parts identify rows within the record (the control_table_keys). We suppress the record_key when writing the control table, as these are exactly the values that do not vary inside each record.

We combine all of these specification into a RecordSpecification as follows:

In [4]:
record_spec = data_algebra.cdata.RecordSpecification(
    control_table,
    control_table_keys = ['Part', 'Measure'],
    record_keys = ['id', 'Species']
    )
record_spec
Out[4]:
RecordSpecification
   record_keys: ['id', 'Species']
   control_table_keys: ['Part', 'Measure']
   control_table:
       Part Measure         Value
   0  Petal  Length  Petal.Length
   1  Petal   Width   Petal.Width
   2  Sepal  Length  Sepal.Length
   3  Sepal   Width   Sepal.Width
In this notation any cells of the control table from columns that are not control_table_keys are “value stand ins”. During data transfrom these cells will be replaced by values coming from the columns named by these cells.

Now we can transform our original row-record oriented data into general block records. To do this we specify a RecordMap using our record specification to describe the outgoing record structure. The incoming record structure is implicitly assumed to be single-row records, unless we specify otherwise (using the blocks_in argument).

In [5]:
mp_to_blocks = data_algebra.cdata_impl.RecordMap(blocks_out=record_spec)
print(str(mp_to_blocks))
Transform row records of the form:
  record_keys: ['id', 'Species']
 ['id', 'Species', 'Petal.Length', 'Petal.Width', 'Sepal.Length', 'Sepal.Width']
to block records of structure:
RecordSpecification
   record_keys: ['id', 'Species']
   control_table_keys: ['Part', 'Measure']
   control_table:
       Part Measure         Value
   0  Petal  Length  Petal.Length
   1  Petal   Width   Petal.Width
   2  Sepal  Length  Sepal.Length
   3  Sepal   Width   Sepal.Width

And we are ready to apply our specified transform.

In [6]:
arranged_blocks = mp_to_blocks.transform(iris)
arranged_blocks
Out[6]:
id Species Part Measure Value 0 0 setosa Petal Length 1.4 1 0 setosa Petal Width 0.2 2 0 setosa Sepal Length 5.1 3 0 setosa Sepal Width 3.5 4 1 setosa Petal Length 1.4 5 1 setosa Petal Width 0.2 6 1 setosa Sepal Length 4.9 7 1 setosa Sepal Width 3.0 8 2 setosa Petal Length 1.3 9 2 setosa Petal Width 0.2 10 2 setosa Sepal Length 4.7 11 2 setosa Sepal Width 3.2
We see the operation has been performed for us. Notice we specify the transform declaratively with data structures carrying deceptions of what we want, instead of having to build a sequence of verbs that realize the transformation.

Inverting the transform

An inverse transform is simply expressed by reversing the roles of the blocks_out and blocks_in arguments. In this case the output is row-records, as we didn’t specify an outgoing block structure with blocks_out.

In [7]:
mp_to_rows = data_algebra.cdata_impl.RecordMap(blocks_in=record_spec)
print(str(mp_to_rows))
Transform block records of structure:
RecordSpecification
   record_keys: ['id', 'Species']
   control_table_keys: ['Part', 'Measure']
   control_table:
       Part Measure         Value
   0  Petal  Length  Petal.Length
   1  Petal   Width   Petal.Width
   2  Sepal  Length  Sepal.Length
   3  Sepal   Width   Sepal.Width
to row records of the form:
  record_keys: ['id', 'Species']
 ['id', 'Species', 'Petal.Length', 'Petal.Width', 'Sepal.Length', 'Sepal.Width']

In [8]:
arranged_rows = mp_to_rows.transform(arranged_blocks)
arranged_rows
Out[8]:
id Species Petal.Length Petal.Width Sepal.Length Sepal.Width 0 0 setosa 1.4 0.2 5.1 3.5 1 1 setosa 1.4 0.2 4.9 3.0 2 2 setosa 1.3 0.2 4.7 3.2

Arbitrary transforms

Arbitrary record to record transforms can be specified by setting both blocks_in (to describe incoming structure) and blocks_out (to describe outgoing structure) at the same time.

Transforms in databases

data_algebra also implements all the transform steps in databases using SQL (via row_recs_to_blocks_query() and blocks_to_row_recs_query()).

These queries can be seen below.

In [9]:
db_model = data_algebra.SQLite.SQLiteModel()
In [10]:
print(db_model.row_recs_to_blocks_query(
    source_view=data_algebra.data_ops.describe_pandas_table(
        iris, 'iris'),
    record_spec=record_spec,
    record_view=data_algebra.data_ops.describe_pandas_table(
        record_spec.control_table, "control_table")
))
SELECT
 a."id" AS "id",
 a."Species" AS "Species",
 b."Part" AS "Part",
 b."Measure" AS "Measure",
 CASE
  WHEN b."Value" = 'Petal.Length' THEN a."Petal.Length"
  WHEN b."Value" = 'Petal.Width' THEN a."Petal.Width"
  WHEN b."Value" = 'Sepal.Length' THEN a."Sepal.Length"
  WHEN b."Value" = 'Sepal.Width' THEN a."Sepal.Width"
  ELSE NULL END AS "Value"
FROM (
  "iris" ) a
CROSS JOIN (
  "control_table" ) b
 ORDER BY a."id", a."Species", b."Part", b."Measure"
In [11]:
print(db_model.blocks_to_row_recs_query(
    source_view=data_algebra.data_ops.describe_pandas_table(
        iris, 'iris'),
    record_spec=record_spec
))
SELECT
 "id" AS "id",
 "Species" AS "Species",
 MAX(CASE WHEN  ( "Part" = 'Petal' )  AND  ( "Measure" = 'Length' )  THEN "Value" ELSE NULL END) AS "Petal.Length",
 MAX(CASE WHEN  ( "Part" = 'Petal' )  AND  ( "Measure" = 'Width' )  THEN "Value" ELSE NULL END) AS "Petal.Width",
 MAX(CASE WHEN  ( "Part" = 'Sepal' )  AND  ( "Measure" = 'Length' )  THEN "Value" ELSE NULL END) AS "Sepal.Length",
 MAX(CASE WHEN  ( "Part" = 'Sepal' )  AND  ( "Measure" = 'Width' )  THEN "Value" ELSE NULL END) AS "Sepal.Width"
FROM (
  "iris"
 )
 GROUP BY "id", "Species"
 ORDER BY "id", "Species"
The use case for SQL queries is: the queries can be used to with “CREATE TABLE table_name AS” to materialize transform results in a database, without round-tripping the data in and out of the database.

As complicated as the queries look, they actually expose some deep truths:

  • The row_recs_to_blocks_query() is essentially a cross-join of the data to the record description. Each combination of data row and record description row builds a new result row.
  • The blocks_to_row_recs_query() is an aggregation. Each set of rows corresponding to a given data record is aggregated into a single result row.
  • Just about any arbitrary record shape to arbitrary record shape can be written as a transform from the first record shape to row-records (record sets that have exactly one row per record), followed by a transform from the row-records to the new format. This transform can preserve column types as in the intermediate form each different record entry has its own column. This is an advantage of using a “thin” intermediate form such as RDF triples.

This leads us to believe that transforming to and from single-row records are in fact fundemental operations, and not just implementation details.

The R cdata version

The data_algebra had been previously implemented in R in the cdata, rquery, and rqdatatable packages. This packages support both in-memory operations and translation of the transforms to SQL.

We would perform the above transforms in R as follows.

First we write out a copy of the transform.

In [12]:
# convert objects to a YAML string
xform_rep = yaml.dump(mp_to_blocks.to_simple_obj())
# write to file
with open("xform_yaml.txt", "wt") as f:
    print(xform_rep, file=f)
Now we we start up our R environment and read in the transform specification and data.

In [13]:
%load_ext rpy2.ipython
In [14]:
%%R

# install.packages("cdata")
library(cdata)
library(yaml)
# https://github.com/WinVector/data_algebra/blob/master/Examples/cdata/cdata_yaml.R
source("cdata_yaml.R")

iris <- read.csv('iris_small.csv')
print(iris)
  Sepal.Length Sepal.Width Petal.Length Petal.Width Species id
1          5.1         3.5          1.4         0.2  setosa  0
2          4.9         3.0          1.4         0.2  setosa  1
3          4.7         3.2          1.3         0.2  setosa  2
In [15]:
%%R

r_yaml <- yaml.load_file("xform_yaml.txt")
transform <- convert_yaml_to_cdata_spec(r_yaml)

print(transform)
{
 row_record <- wrapr::qchar_frame(
   "id"  , "Species", "Petal.Length", "Petal.Width", "Sepal.Length", "Sepal.Width" |
     .   , .        , Petal.Length  , Petal.Width  , Sepal.Length  , Sepal.Width   )
 row_keys <- c('id', 'Species')

 # becomes

 block_record <- wrapr::qchar_frame(
   "id"  , "Species", "Part" , "Measure", "Value"      |
     .   , .        , "Petal", "Length" , Petal.Length |
     .   , .        , "Petal", "Width"  , Petal.Width  |
     .   , .        , "Sepal", "Length" , Sepal.Length |
     .   , .        , "Sepal", "Width"  , Sepal.Width  )
 block_keys <- c('id', 'Species', 'Part', 'Measure')

 # args: c(checkNames = TRUE, checkKeys = FALSE, strict = FALSE, allow_rqdatatable = TRUE)
}

Now that we have recovered the transform, we can use it in R.

In [16]:
%%R

iris %.>% transform
   id Species  Part Measure Value
1   0  setosa Petal  Length   1.4
2   0  setosa Petal   Width   0.2
3   0  setosa Sepal  Length   5.1
4   0  setosa Sepal   Width   3.5
5   1  setosa Petal  Length   1.4
6   1  setosa Petal   Width   0.2
7   1  setosa Sepal  Length   4.9
8   1  setosa Sepal   Width   3.0
9   2  setosa Petal  Length   1.3
10  2  setosa Petal   Width   0.2
11  2  setosa Sepal  Length   4.7
12  2  setosa Sepal   Width   3.2

Cross-language work

As the record transform specifications, both in Python data_algebra and R cata are simple data structures (just the control table, and a few lists of key column names), they can be moved from one language to another by YAML (as we also demonstrated in the logistic scoring example.

data_algebra supplies a write method, so cross-language interoperation is just a matter of adding additional read/write methods.

In [17]:
print(yaml.dump(mp_to_blocks.to_simple_obj()))
type: data_algebra.cdata_impl.RecordMap
blocks_out:
  type: data_algebra.cdata.RecordSpecification
  record_keys:
  - id
  - Species
  control_table_keys:
  - Part
  - Measure
  control_table:
    Part:
    - Petal
    - Petal
    - Sepal
    - Sepal
    Measure:
    - Length
    - Width
    - Length
    - Width
    Value:
    - Petal.Length
    - Petal.Width
    - Sepal.Length
    - Sepal.Width

Conclusion

The cdata and data_algebra systems yield powerful implementations, and deep understanding of the nature of record transformations. They allow one to reshape data quickly and conveniently either in R. Python/Pandas, or even SQL.