Avoiding embarrassment by testing data assumptions with expectdata

Numbers Rule Your World 2019-10-21

[This article was first published on Dan Garmat's Blog -- R, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Expectdata is an R package that makes it easy to test assumptions about a data frame before conducting analyses. Below is a concise tour of some of the data assumptions expectdata can test for you. For example,

Note: assertr is an ropensci project that aims to have similar functionality. Pros and cons haven’t been evaluated yet, but ropensci is a big pro for assertR.

Check for unexpected duplication

library(expectdata)expect_no_duplicates(mtcars, "cyl")#> [1] "top duplicates..."#> # A tibble: 3 x 2#> # Groups:   cyl [3]#>     cyl     n#>    #> 1     8    14#> 2     4    11#> 3     6     7#> Error: Duplicates detected in column: cyl

The default return_df == TRUE option allows for using these function as part of a dplyr piped expression that is stopped when data assumptions are not kept.

library(dplyr, warn.conflicts = FALSE)library(ggplot2)mtcars %>%   filter(cyl == 4) %>%   expect_no_duplicates("wt", return_df = TRUE) %>%   ggplot(aes(x = wt, y = hp, color = mpg, size = mpg)) +  geom_point()#> [1] "no wt duplicates...OK"

If there are no expectations violated, an “OK” message is printed.

After joining two data sets you may want to verify that no unintended duplication occurred. Expectdata allows comparing pre- and post- processing to ensure they have the same number of rows before continuing.

expect_same_number_of_rows(mtcars, mtcars, return_df = FALSE)#> [1] "Same number of rows...OK"expect_same_number_of_rows(mtcars, iris, show_fails = FALSE, stop_if_fail = FALSE, return_df = FALSE)#> Warning: Different number of rows: 32 vs: 150# can also compare to no df2 to check is zero rowsexpect_same_number_of_rows(mtcars, show_fails = FALSE, stop_if_fail = FALSE, return_df = FALSE) #> Warning: Different number of rows: 32 vs: 0

Can see how the stop_if_fail = FALSE option will turn failed expectations into warnings instead of errors.

Check for existance of problematic rows

Comparing a data frame to an empty, zero-length data frame can also be done more explicitly. If the expectations fail, cases can be shown to begin the next step of exploring why these showed up.

expect_zero_rows(mtcars[mtcars$cyl == 0, ], return_df = TRUE)#> [1] "No rows found as expected...OK"#>  [1] mpg  cyl  disp hp   drat wt   qsec vs   am   gear carb#> <0 rows> (or 0-length row.names)expect_zero_rows(mtcars$cyl[mtcars$cyl == 0])#> [1] "No rows found as expected...OK"#> numeric(0)expect_zero_rows(mtcars, show_fails = TRUE)#>                    mpg cyl disp  hp drat    wt  qsec vs am gear carb#> Mazda RX4         21.0   6  160 110 3.90 2.620 16.46  0  1    4    4#> Mazda RX4 Wag     21.0   6  160 110 3.90 2.875 17.02  0  1    4    4#> Datsun 710        22.8   4  108  93 3.85 2.320 18.61  1  1    4    1#> Hornet 4 Drive    21.4   6  258 110 3.08 3.215 19.44  1  0    3    1#> Hornet Sportabout 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2#> Valiant           18.1   6  225 105 2.76 3.460 20.22  1  0    3    1#> Error: Different number of rows: 32 vs: 0

This works well at the end of a pipeline that starts with a data frame, runs some logic to filter to cases that should not exist, then runs expect_zero_rows() to check no cases exist.

# verify no cars have zero cylindarsmtcars %>%   filter(cyl == 0) %>%   expect_zero_rows(return_df = FALSE)#> [1] "No rows found as expected...OK"

Can also check for NAs in a vector, specific columns of a data frame, or a whole data frame.

expect_no_nas(mtcars, "cyl", return_df = FALSE)#> [1] "Detected 0 NAs...OK"expect_no_nas(mtcars, return_df = FALSE)#> [1] "Detected 0 NAs...OK"expect_no_nas(c(0, 3, 4, 5))#> [1] "Detected 0 NAs...OK"#> [1] 0 3 4 5expect_no_nas(c(0, 3, NA, 5))#> Error: Detected 1 NAs

Several in one dplyr pipe expression:

mtcars %>%   expect_no_nas(return_df = TRUE) %>%   expect_no_duplicates("wt", stop_if_fail = FALSE) %>%   filter(cyl == 4) %>%   expect_zero_rows(show_fails = TRUE)#> [1] "Detected 0 NAs...OK"#> [1] "top duplicates..."#> # A tibble: 2 x 2#> # Groups:   wt [2]#>      wt     n#>    #> 1  3.44     3#> 2  3.57     2#> Warning: Duplicates detected in column: wt#>    mpg cyl  disp hp drat    wt  qsec vs am gear carb#> 1 22.8   4 108.0 93 3.85 2.320 18.61  1  1    4    1#> 2 24.4   4 146.7 62 3.69 3.190 20.00  1  0    4    2#> 3 22.8   4 140.8 95 3.92 3.150 22.90  1  0    4    2#> 4 32.4   4  78.7 66 4.08 2.200 19.47  1  1    4    1#> 5 30.4   4  75.7 52 4.93 1.615 18.52  1  1    4    2#> 6 33.9   4  71.1 65 4.22 1.835 19.90  1  1    4    1#> Error: Different number of rows: 11 vs: 0

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) {var s = d.createElement(t); s.type = 'text/javascript'; s.async = true;s.src = '//cdn.viglink.com/api/vglnk.js';var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Dan Garmat's Blog -- R.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.