Win-Vector LLC announces new “big data in R” tools
Win-Vector Blog 2017-11-29
Win-Vector LLC is proud to introduce two important new tool families (with documentation) in the 0.5.0
version of seplyr
(also now available on CRAN):
-
partition_mutate_se()
/partition_mutate_qt()
: these are query planners/optimizers that work overdplyr::mutate()
assignments. When using big-data systems through R (such as PostgreSQL or Apache Spark) these planners can make your code faster and sequence steps to avoid critical issues (the complementary problems of too long in-mutate dependence chains, of too many mutate steps, and incidental bugs; all explained in the linked tutorials). -
if_else_device()
: provides adplyr::mutate()
based simulation of per-row conditional blocks (including conditional assignment). This allows powerful imperative code (such as often seen in porting from SAS) to be directly and legibly translated into performantdplyr::mutate()
data flow code that works on Spark (via Sparklyr) and databases.
![Blacksmith_working.jpg Blacksmith working](https://i2.wp.com/www.win-vector.com/blog/wp-content/uploads/2017/11/Blacksmith_working.jpg?resize=598%2C398)
Image by Jeff Kubina from Columbia, Maryland – [1], CC BY-SA 2.0, Link
For “big data in R” users these two function families (plus the included support functions and examples) are simple, yet game changing. These tools were developed by Win-Vector LLC to fill gaps identified by Win-Vector and our partners when standing-up production scale R plus Apache Spark projects.
We are happy to share these tools as open source, and very interested in consulting with your teams on developing R/Spark solutions (including porting existing SAS code). For more information please reach out to Win-Vector.
To teams get started we are supplying the following initial documentation, discussion, and examples: