What’s new in R 4.4.0?

R-bloggers 2024-04-25

[This article was first published on The Jumping Rivers Blog, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

R 4.4.0 (“Puppy Cup”) was released on the 24th April 2024 and it is abeauty. In time-honoured tradition, here we summarise some of thechanges that caught our eyes. R 4.4.0 introduces some cool features (oneof which is experimental) and makes one of our favourite {rlang}operators available in base R. There are a few things you might need tobe aware of regarding handling NULL and complex values.

The full changelog can be found at the r-release ‘NEWS’page and ifyou want to keep up to date with developments in base R, have a look atthe r-devel ‘NEWS’page.

Data comes in all shapes and sizes. It can often be difficult to know where to start. Whatever your problem, Jumping Rivers can help.

A tail-recursive tale

Years ago, before I’d caused my first stack overflow, my Grandad used totell me a daft tale:

It was on a dark and stormy night,And the skipper of the yacht said to Antonio,"Antonio, tell us a tale",So Antonio started as follows...It was on a dark and stormy night,And the skipper of the yacht .... [ad infinitum]

The tale carried on in this way forever. Or at least it would until youwere finally asleep.

At around the same age, I was toying with BASIC programming and couldknock out classics such as

>10 PRINT "Ali stinks!">20 GOTO 10

Burn! Infinite burn!

That was two example processes that demonstrate recursion. Antonio’stale quotes itself recursively, and my older brother will be repeatedlymocked unless someone intervenes.

Recursion is an elegant approach to many programming problems – thisusually takes the form of a function that can call itself. You would useit when you know how to get closer to a solution, but not necessarilyhow to get directly to that solution. And unlike the un-ending examplesabove, when we write recursive solutions to computational problems, weinclude a rule for stopping.

An example from mathematics would be finding zeros for a continuousfunction. The sine function provides a typical example:

Graph of the sine function between 0 and 2*pi

We can see that when x = π, there is a zero for sin(x), but thecomputer doesn’t know that.

One recursive solution to finding the zeros of a function, f(), is thebisection method,which iteratively narrows a range until it finds a point where f(x) isclose enough to zero. Here’s a quick implementation of that algorithm.If you need to perform root-finding in R, please don’t use the followingfunction. stats::uniroot() is much more robust…

bisect = function(f, interval, tolerance, iteration = 1, verbose = FALSE) { if (verbose) { msg = glue::glue( "Iteration {iteration}: Interval [{interval[1]}, {interval[2]}]" ) message(msg) } # Evaluate 'f' at either end of the interval and return # any endpoint where f() is close enough to zero lhs = interval[1]; rhs = interval[2] f_left = f(lhs); f_right = f(rhs) if (abs(f_left) <= tolerance) { return(lhs) } if (abs(f_right) <= tolerance) { return(rhs) } stopifnot(sign(f_left) != sign(f_right)) # Bisect the interval and rerun the algorithm # on the half-interval where y=0 is crossed midpoint = (lhs + rhs) / 2 f_mid = f(midpoint) new_interval = if (sign(f_mid) == sign(f_left)) { c(midpoint, rhs) } else { c(lhs, midpoint) } bisect(f, new_interval, tolerance, iteration + 1, verbose)}

We know that π is somewhere between 3 and 4, so we can find the zeroof sin(x) as follows:

bisect(sin, interval = c(3, 4), tolerance = 1e-4, verbose = TRUE)#> Iteration 1: Interval [3, 4]#> Iteration 2: Interval [3, 3.5]#> Iteration 3: Interval [3, 3.25]#> Iteration 4: Interval [3.125, 3.25]#> Iteration 5: Interval [3.125, 3.1875]#> Iteration 6: Interval [3.125, 3.15625]#> Iteration 7: Interval [3.140625, 3.15625]#> Iteration 8: Interval [3.140625, 3.1484375]#> Iteration 9: Interval [3.140625, 3.14453125]#> Iteration 10: Interval [3.140625, 3.142578125]#> Iteration 11: Interval [3.140625, 3.1416015625]#> [1] 3.141602

It takes 11 iterations to get to a point where sin(x) is within10⁻⁴ of zero. If we tightened the tolerance, had a morecomplicated function, or had a less precise starting range, it mighttake many more iterations to approximate a zero.

Importantly, this is a recursive algorithm - in the last statement ofthe bisect() function body, we call bisect() again. The initial callto bisect() (with interval = c(3, 4)) has to wait until the secondcall to bisect() (interval = c(3, 3.5)) completes before it canreturn (which in turn has to wait for the third call to return). So wehave to wait for 11 calls to bisect() to complete before we get ourresult.

Those function calls get placed on a computational object named thecall stack. For eachfunction call, this stores details about how the function was called andwhere from. While waiting for the first call to bisect() to complete,the call stack grows to include the details about 11 calls tobisect().

Imagine our algorithm didn’t just take 11 function calls to complete,but thousands, or millions. The call stack would get really full andthis would lead to a “stack overflow”error.

We can demonstrate a stack-overflow in R quite easily:

blow_up = function(n, max_iter) { if (n >= max_iter) { return("Finished!") } blow_up(n + 1, max_iter)}

The recursive function behaves nicely when we only use a small number ofiterations:

blow_up(1, max_iter = 100)#> [1] "Finished!"

But the call-stack gets too large and the function fails when we attemptto use too many iterations. Note that we get a warning about the size ofthe call-stack before we actually reach it’s limit, so the R process cancontinue after exploding the call-stack.

blow_up(1, max_iter = 1000000)# Error: C stack usage 7969652 is too close to the limit

In R 4.4, we are getting (experimental) support for tail-callrecursion. This allows us (inmany situations) to write recursive functions that won’t explode thesize of the call stack.

How can that work? In our bisect() example, we still need to make 11calls to bisect() to get a result that is close enough to zero, andthose 11 calls will still need to be put on the call-stack.

Remember the first call to bisect()? It called bisect() as the verylast statement in it’s function body. So the value returned by thesecond call to bisect() was returned to the user without modificationby the first call. So we could return the second call’s value directlyto the user, instead of returning it via the first bisect() call;indeed, we could remove the first call to bisect() from the call stackand put the second call in it’s place. This would prevent the call stackfrom expanding with recursive calls.

The key to this (in R) is to use the new Tailcall() function. Thattells R “you can remove me from the call stack, and put this cat oninstead”. Our final line in bisect() should look like this:

bisect = function(...) { ... snip ... Tailcall(bisect, f, new_interval, tolerance, iteration + 1, verbose)}

Note that you are passing the name of the recursively-called functioninto Tailcall(), rather than a call to that function (bisect ratherthan bisect(...)).

To illustrate that the stack no longer blows up when tail-call recursionis used. Let’s rewrite our blow_up() function:

# R 4.4.0blow_up = function(n, max_iter) { if (n >= max_iter) { return("Finished!") } Tailcall(blow_up, n+1, max_iter)}

We can still successfully use a small number of iterations:

blow_up(1, 100)#> [1] "Finished!"

But now, even a million iterations of the recursive function can beperformed:

blow_up(1, 1000000)#> [1] "Finished!"

Note that the tail-call optimisation only works here, because therecursive call was made as the very last step in the function body. Ifyour function needs to modify the value after the recursive call, youmay not be able to use Tailcall().

Rejecting the NULL

Missing values are everywhere.

In a typical dataset you might have missing values encoded as NA (ifyou’re lucky) and invalid numbers encoded as NaN, you might haveimplicitly missing rows (for example, a specific date missing from atime series) or factor levels that aren’t present in your table. Youmight even have empty vectors, or data-frames with no rows, to contendwith. When writing functions and data-science workflows, where the inputdata may change over time, by programming defensively and handling thesekinds of edge-cases your code will throw up less surprises in the longrun. You don’t want a critical report to fail because a mathematicalfunction you wrote couldn’t handle a missing value.

When programming defensively with R, there is another important form ofmissingness to be cautious of …

The NULLobject.

NULL is an actual object. You can assign it to a variable, combine itwith other values, index into it, pass it into (and return it from) afunction. You can also test whether a value is NULL.

# Assignmentmy_null = NULLmy_null#> NULL# Use in functionsmy_null[1]#> NULLc(NULL, 123)#> [1] 123c(NULL, NULL)#> NULLtoupper(NULL)#> character(0)# Testing NULL-nessis.null(my_null)#> [1] TRUEis.null(1)#> [1] FALSEidentical(my_null, NULL)#> [1] TRUE# Note that the equality operator shouldn't be used to# test NULL-ness:NULL == NULL#> logical(0)

R functions that are solely called for their side-effects (write.csv()or message(), for example) often return a NULL value. Otherfunctions may return NULL as a valid value - one intended forsubsequent use. For example, list-indexing (which is a function call,under the surface) will return NULL if you attempt to access anundefined value:

config = list(user = "Russ")# When the index is present, the associated value is returnedconfig$user#> [1] "Russ"# But when the index is absent, a `NULL` is returnedconfig$url#> NULL

Similarly, you can end up with a NULL output from an incomplete stackof if / else clauses:

language = "Polish"greeting = if (language == "English") { "Hello"} else if (language == "Hawaiian") { "Aloha"}greeting#> NULL

A common use for NULL is as a default argument in a functionsignature. A NULL default is often used for parameters that aren’tcritical to function evaluation. For example, the function signature formatrix() is as follows:

matrix(data = NA, nrow = 1, ncol = 1, byrow = FALSE, dimnames = NULL)

The dimnames parameter isn’t really needed to create a matrix, butwhen a non-NULL value for dimnames is provided, the values are usedto label the row and column names of the created matrix.

matrix(1:4, nrow = 2)#> [,1] [,2]#> [1,] 1 3#> [2,] 2 4matrix(1:4, nrow = 2, dimnames = list(c("2023", "2024"), c("Jan", "Feb")))#> Jan Feb#> 2023 1 3#> 2024 2 4

R 4.4 introduces the %||% operator to help when handling variablesthat are potentially NULL. When working with variables that could beNULL, you might have written code like this:

# Remember there is no 'url' field in our `config` list# Set a default value for the 'url' if one isn't defined in# the configmy_url = if (is.null(config$url)) { "https://www.jumpingrivers.com/blog/"} else { config$url}my_url#> [1] "https://www.jumpingrivers.com/blog/"

Assuming config is a list:

when the url entry is absent from config (or is itself NULL),then config$url will be NULL and the variable my_url will be setto the default value;
but when the url entry is found within config (and isn’t NULL)then that value will be stored in my_url.

That code can now be rewritten as follows:

# R 4.4.0my_url = config$url %||% "https://www.jumpingrivers.com/blog"my_url#> [1] "https://www.jumpingrivers.com/blog"

Note that the left-hand value must evaluate to NULL for the right-handside to be evaluated, and that empty vectors aren’t NULL:

# R 4.4.0NULL %||% 1#> [1] 1c() %||% 1#> [1] 1numeric(0) %||% 1#> numeric(0)

This operator has been available in the {rlang} package for eightyears and is implemented in exactly the same way. So if you have beenusing %||% in your code already, the base-R version of this operatorshould work without any problems, though you may want to wait until youare certain all your users are using R >= 4.4 before switching from{rlang} to the base-R version of %||%.

Any other business

A shorthand hexadecimalformat(common in web-programming) for specifying RGB colours has beenintroduced. So, rather than writing the 6-digit hexcode for a colour“#112233”, you can use “#123”. This only works for those 6-digithexcodes where the digits are repeated in pairs.

Parsing and formatting of complex numbers has been improved. Forexample, as.complex("1i") now returns the complex number 0 + 1i,previously it returned NA.

There are a few other changes related to handling NULL that have beenintroduced in R 4.4. The changes highlight that NULL is quitedifferent from an empty vector. Empty vectors contain nothing, whereasNULL represents nothing. For example, whereas an empty numeric vectoris considered to be an atomic (unnestable) data structure, NULL is nolonger atomic. Also, NCOL(NULL) (the number of columns in a matrixformed from NULL) is now 0, whereas it was formerly 1.

sort_by() a new function for sorting objects based on values in aseparate object. This can be used to sort a data.frame based on it’scolumns (they should be specified as a formula):

mtcars |> sort_by(~ list(cyl, mpg)) |> head()## mpg cyl disp hp drat wt qsec vs am gear carb## Volvo 142E 21.4 4 121.0 109 4.11 2.780 18.60 1 1 4 2## Toyota Corona 21.5 4 120.1 97 3.70 2.465 20.01 1 0 3 1## Datsun 710 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1## Merc 230 22.8 4 140.8 95 3.92 3.150 22.90 1 0 4 2## Merc 240D 24.4 4 146.7 62 3.69 3.190 20.00 1 0 4 2## Porsche 914-2 26.0 4 120.3 91 4.43 2.140 16.70 0 1 5 2

Try the latest version out for yourself

To take away the pain of installing the latest development version of R,you can use docker. To use the devel version of R, you can use thefollowing commands:

docker pull rstudio/r-base:devel-jammydocker run --rm -it rstudio/r-base:devel-jammy

Once R 4.4 is the released version of R and the r-docker repositoryhas been updated, you should use the following command to test out R4.4.

docker pull rstudio/r-base:4.4-jammydocker run --rm -it rstudio/r-base:4.4-jammy