R Tip: Use stringsAsFactors = FALSE

Win-Vector Blog 2018-03-17

R tip: use stringsAsFactors = FALSE.

R often uses a concept of factors to re-encode strings. This can be too early and too aggressive. Sometimes a string is just a string.

800px Sigmund Freud by Max Halberstadt cropped

Sigmund Freud, it is often claimed, said: “Sometimes a cigar is just a cigar.”

To avoid problems delay re-encoding of strings by using stringsAsFactors = FALSE when creating data.frames.

Example:

d <- data.frame(label = rep("tbd", 5))d$label[[2]] <- "north"#> Warning in `[[<-.factor`(`*tmp*`, 2, value = structure(c(1L, NA, 1L, 1L, :#> invalid factor level, NA generatedprint(d)#>   label#> 1   tbd#> 2  <NA>#> 3   tbd#> 4   tbd#> 5   tbd

Notice our new value was not copied in!

The fix is easy: use stringsAsFactors = FALSE.

d <- data.frame(label = rep("tbd", 5),                stringsAsFactors = FALSE)d$label[[2]] <- "north"print(d)#>   label#> 1   tbd#> 2 north#> 3   tbd#> 4   tbd#> 5   tbd

As is often the case: base R works okay in default mode and works very well if you judiciously change a few defaults. There is much less need to whole-hog replace R functionality than commonly preached.

Note: the above pattern of pre-building a data.frame and filling values by addressing row/column index sets is a very effective (and under appreciated) way to build up data (often easier and quicker than binding rows or columns).