readtextgrid now uses C++ (and ChatGPT helped)

R-bloggers 2025-11-15

[This article was first published on Higher Order Functions, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

In this post, I announce the release of version of 0.2.0 of thereadtextgrid R package, describe the problem thatthe package solves, and share some thoughts on LLM-assisted programming.

Textgrids are a way to annotate audio data

Praat is a program for speech and acoustic analysis that has been around for over 30 years. It includes a scripting language for manipulating and analyzing data and for creatingannotation workflows. Users can annotate intervals or points of time in a sound file using a textgrid object. Here is a screenshot of atextgrid in Praat:

Screenshot of a Praat editor window.

There are three rows in the image, all three of them sharing the samex axis (time).

Amplitude waveform, showing intensity over time
Spectrogram, showing how the intensity (color) at frequencies (y) changes overtime. Red dots mark estimated formants (resonances) in the speech signal.
Textgrid of text annotations for the recording

A user can edit the textgrid by adding or adjusting boundaries andadding annotations, and Praat will save this data to a .TextGrid file.

Other programs can produce .TextGrid files: the textgrid pictured hereis the result of forced alignment, specifically by the Montreal ForcedAligner. I told the program I said “library tidy verse libraryb r m s”, and it looked up the pronunciations of those words and used anacoustic model to estimate the time intervals of each word and eachspeech sound. The aligner produced a .TextGrid file for this alignment.

These textgrids are the bread and butter of some of the research that wedo. For example, our article on speaking/articulation rate in childreninvolved over 30,000 single-sentence .wav files and .TextGrid files. Weused the alignments to determine the duration of time spent speaking, thenumber of vowels in each utterance and hence the speaking rate insyllables per second.

Reading these .TextGrid files into R was cumbersome, so I wrote andreleased readtextgrid, an R package built around one simple function:

library(tidyverse)library(readtextgrid)path_tg <- "_R/data/mfa-out/library-tidyverse-library-brms.TextGrid" data_tg <- read_textgrid(path_tg)data_tg#> # A tibble: 43 × 10#>    file       tier_num tier_name tier_type tier_xmin tier_xmax  xmin  xmax text #>    <chr>         <int> <chr>     <chr>         <dbl>     <dbl> <dbl> <dbl> <chr>#>  1 library-t…        1 words     Interval…         0      3.60  0     0.08 ""   #>  2 library-t…        1 words     Interval…         0      3.60  0.08  0.74 "lib…#>  3 library-t…        1 words     Interval…         0      3.60  0.74  1.12 "tid…#>  4 library-t…        1 words     Interval…         0      3.60  1.12  1.58 "ver…#>  5 library-t…        1 words     Interval…         0      3.60  1.58  1.74 ""   #>  6 library-t…        1 words     Interval…         0      3.60  1.74  2.46 "lib…#>  7 library-t…        1 words     Interval…         0      3.60  2.46  2.72 "b"  #>  8 library-t…        1 words     Interval…         0      3.60  2.72  2.9  "r"  #>  9 library-t…        1 words     Interval…         0      3.60  2.9   3.04 "m"  #> 10 library-t…        1 words     Interval…         0      3.60  3.04  3.46 "s"  #> # ℹ 33 more rows#> # ℹ 1 more variable: annotation_num <int>

The function returns a tidy tibble with one row per annotation. The filename isstored as a column too so that we can lapply() over a directory of files.Annotations are numbered so that we can group_by(text, annotation_num) and have repeated words handled separately.

With this textgrid in R, I can measure speaking rate, for example:

data_tg |>   filter(tier_name == "phones", text != "") |>   summarise(    speaking_time = sum(xmax - xmin),    # vowels have numbers to indicate degree of stress    num_vowels = sum(str_detect(text, "\\d"))  ) |>   mutate(    syllables_per_sec = num_vowels / speaking_time   )#> # A tibble: 1 × 3#>   speaking_time num_vowels syllables_per_sec#>           <dbl>      <int>             <dbl>#> 1          3.22         13              4.04

Or annotate a spectrogram:

library(tidyverse)library(ggplot2)path_spectrogram <- "_R/data/mfa/library-tidyverse-library-brms.csv"data_spectrogram <- readr::read_csv(path_spectrogram)#> Rows: 249366 Columns: 6#> ── Column specification ────────────────────────────────────────────────────────#> Delimiter: ","#> dbl (6): y, x, power, time, frequency, db#> #> ℹ Use `spec()` to retrieve the full column specification for this data.#> ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.data_spectrogram |>   mutate(    # reserve more of the color variation for intensities above 15 dB    db = ifelse(db < 15, 15, db)  ) |>   ggplot() +   aes(x = time, y = frequency) +  geom_raster(aes(fill = db)) +  geom_text(    aes(label = text, x = (xmin + xmax) / 2),    data = data_tg |> filter(tier_name == "words"),    y = 6500,    vjust = 0  )  +  geom_text(    aes(label = text, x = (xmin + xmax) / 2),    data = data_tg |> filter(tier_name == "phones"),    y = 6100,    vjust = 0,    size = 2  )  +  ylim(c(NA, 6600)) +  theme_minimal() +  scale_fill_gradient(low = "white", high = "black") +  guides(fill = "none") +  labs(x = "time [s]", y = "frequency [Hz]")

Spectrogram of me saying ‘library tidyverse library brms’

I released the first version of the package in 2020. This package,notably for me, contains the first hex badge I ever made.

My original `.TextGrid` parser and its problem

Here is what the contents of the .TextGrid file look like. It’s not the wholefile but enough to give a sense of the structure:

path_tg |>   readLines() |>   head(26) |>   c("[... TRUNCATED ... ]") |>   writeLines()#> File type = "ooTextFile"#> Object class = "TextGrid"#> #> xmin = 0 #> xmax = 3.596009 #> tiers? <exists> #> size = 2 #> item []: #>     item [1]:#>         class = "IntervalTier" #>         name = "words" #>         xmin = 0 #>         xmax = 3.596009 #>         intervals: size = 11 #>         intervals [1]:#>             xmin = 0.0 #>             xmax = 0.08 #>             text = "" #>         intervals [2]:#>             xmin = 0.08 #>             xmax = 0.74 #>             text = "library" #>         intervals [3]:#>             xmin = 0.74 #>             xmax = 1.12 #>             text = "tidy" #> [... TRUNCATED ... ]

The first 7 lines provide some metadata about the time range of theaudio and the number of tiers (size = 2). The file then writes out eachtier (item [n] lines) by first giving the class, name, timeduration and number of marks or intervals. Each mark or interval isenumerated with time values xmin, xmax and text values.

Because nearly everything here follows a key = value syntax andbecause sections are split from each other very neatly with item [n]:or interval [n]: lines, I was able to write a simple parser usingregular expressions: Split the file into item [n] sections, splitthose into interval [n] sections, and extract key-value pairs.

This easy approach came with limitations. First, the TextGridspecification was much more flexible. For example, Praatalso provides much less verbose “short” format textgrids which are likea stream of time and text annotations:

path_tg_short <- "_R/data/mfa-out/library-tidyverse-library-brms-short.TextGrid"path_tg_short |>   readLines() |>   head(26) |>   c("[... TRUNCATED ... ]") |>   writeLines()#> File type = "ooTextFile"#> Object class = "TextGrid"#> #> 0#> 3.596009#> <exists>#> 2#> "IntervalTier"#> "words"#> 0#> 3.596009#> 11#> 0#> 0.08#> ""#> 0.08#> 0.74#> "library"#> 0.74#> 1.12#> "tidy"#> 1.12#> 1.58#> "verse"#> 1.58#> 1.74#> [... TRUNCATED ... ]

Everything is in the same order, but the annotations are gone. It turnsout that all of the helpful labels from before were actually commentsthat get ignored. Everything that isn’t a number or a string indouble-quotes (or a <flag>) is a comment.

There are also other quirks (" escapement, ! comments, deviationsbetween the Praat description of the format and the behavior ofpraat.exe). I have them documented as a kind of unofficialspecification in an article on the package website.

But my original regular-expression based parser could only handle theverbose long-format textgrids. I knew this. I put this in a GitHub issuein 2020. And this compatibility oversight was never aproblem for me until I tried a new phonetics tool that defaulted tosaving the textgrids in the short format. Now, readtextgrid could notin fact “read textgrid”.

The new R-based tokenizer

Josef Fruehwald, a linguist with lots ofacoustics/phonetics software, submitted a pull request to implement aproper parser that I eventually rewrote to handle various edge cases andundocumented behavior in the .TextGrid specification. I made anadversarial .TextGrid file 😈 that could still be openedby praat.exe but was meant to be difficult to parse. This was a fundevelopment loop: Make the file harder, update the parser to handle thenew feature, repeat.

Because the essential data in the file are just string tokens andnumber tokens, I needed to make a tokenizer: a pieceof software that reads in characters, groups them into tokens, andfigures out what kind of data the token represents. The initial R-basedversion of the tokenizer did the following:

Read the file character by character
Gather the characters for the current token and keep them when theyform a valid string or number
Shift between three states (in_string, in_strong_comment for !comments, in_escaped_quote)

These three states determine how we interpret spaces, newlines, and "characters. For example, a newline ends a ! comment but a newline canappear in a string so it doesn’t end a string. Moreover, in a comment," is ignored, but in a string, it might be the end of the string or anescaped quote (doubled double-quotes are used for " characters: thestring """a""" has the text "a").

But at a high level, the code was simple:

for (i in seq_along(all_char)) {  # { ... examine current character ... }    # { ... handle comment state ... }    # { ... collect token if we see whitespace and are not in a string  ... }    # { ... handle string and escaped quote state ... }}

The new character-by-character parser worked 🎉. It had conqueredthe adversarial example file, but there was still one more problem. It wasslower than the original regular-expression parser!

tg_lines <- readLines(path_tg)bench::mark(  legacy = readtextgrid:::legacy_read_textgrid_lines(tg_lines),  new_r = readtextgrid:::r_read_textgrid_lines(tg_lines))#> # A tibble: 2 × 6#>   expression      min   median `itr/sec` mem_alloc `gc/sec`#>   <bch:expr> <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>#> 1 legacy       75.1ms   76.1ms      13.1    6.81MB     19.7#> 2 new_r        70.7ms   72.3ms      13.7  590.88KB     10.3

At this point, I asked ChatGPT for tips on speeding up the tokenizer.

Some thoughts about LLMs

the thing about (the current) chatgpt is that it writes like a fucking idiot with excellent grammar
— sarah jeong (@sarahjeong.bsky.social) July 6, 2025 at 7:20 PM

Now, let’s talk about large language models (LLMs). There’s a lot Icould say about them.¹ As a language scientist, I’ll start here: Theyknow syntax. They know which words go together and can generate veryplausible sequences of words. They do not know semantics however. Theydon’t have any firsthand knowledge or experience about what thosesequences express. They can’t introspect about that knowledge orexperience to see whether things “make sense”.² Theydon’t care about the truth or falsity of statements.They just make plausible sequences of words.

Now, it turns out that if you learn how to make sequences of words from anInternet-sized corpus of text, then a lot of the plausible sequences you makewill turn out to be true. If you read 10,000 cookbooks, you could probablyprovide a very classic recipe for scrambled eggs. But because you don’t knowabout sarcasm or can’t draw on your own experience of trying to not ingestnon-food chemicals, you might suggest putting glue on a pizza.

So, as we use an LLM, we need to ask ourselves how much we care aboutthe truth or care about knowing or understanding things. That may soundlike a glib or weird statement: Shouldn’t we always care about thetruth? Well, sometimes we don’t. We just want some syntax; we wantboilerplate or templates to fill out.³For example, I can ask an LLM to “write some unit tests for a functionround_to(xs, unit) that rounds a vector of values to an arbitraryunit” and receive:

test_that("round_to() rounds to nearest multiple of unit", {  expect_equal(round_to(5, 2), 6)  expect_equal(round_to(4.9, 2), 4)  expect_equal(round_to(5.1, 2), 6)  expect_equal(round_to(c(1, 2, 3, 4), 2), c(2, 2, 4, 4))})

These tests are not useful until I plug in the correct values for the expected output.

In other cases, we don’t quite care about truth or comprehension because we canget external corroboration.⁴ When I ask ChatGPT for an obfuscated R script tomake Pac-Man in ggplot2, I can run the code to see if it works without trying todecipher its syntax:

library(ggplot2)ggplot()+geom_polygon(aes(x,y),data=within(data.frame(t(sapply(seq(a<-pi/9,2*pi-a,l<-4e2),function(t)c(cos(t),sin(t))))),{rbind(.,0,0,cos(a),sin(a))->df;x=df[,1];y=df[,2]}),fill="#FF0",col=1)+annotate("point",x=.35,y=.5,size=3)+annotate("point",x=c(1.4,2,2.6),y=0,size=3)+coord_equal(xlim=c(-1.2,3),ylim=c(-1.2,1.2))+theme_void()#> Error in eval(substitute(expr), e): object '.' not found

(Strangely, this is the case where a dot kills Pac-Man.)

Vibes are semantic vapor

When we abandon caring about truth or understanding things and just relyon external corroboration, we are in the realm of vibecoding. I like this termbecause of its insouciant honesty: Truth? Comprehension? We’re justgoing off the vibes. It would be a great help if we used the word moreliberally. A YouTube video called “A vibe history of NES videogames”? Nothanks.⁵

If we lean into vibes, we need to get better at external corroborationand know our programming languages even better. R is a flexibleprogramming language and it does some things that “help” the user thatcan lead to silent bugs. Famously, function arguments and $ will matchpartial names.

# Look at the "Call:" in the outputlm(f = hp ~ cyl, d = mtcars)#> #> Call:#> lm(formula = hp ~ cyl, data = mtcars)#> #> Coefficients:#> (Intercept)          cyl  #>      -51.05        31.96# There is no `m` columnall(mtcars$m == mtcars$mpg)#> [1] TRUE

A student I work with was trying to compute sensitivity and specificity on weighted data. The LLM suggested the following:

# Make some weighted data using frequenciesdata <- pROC::aSAH |>   count(outcome, age, name = "weight")# What the LLM did:pROC::roc(data, "outcome", "age", weights = data$weight)#> Setting levels: control = Good, case = Poor#> Setting direction: controls < cases#> #> Call:#> roc.data.frame(data = data, response = "outcome", predictor = "age",     weights = data$weight)#> #> Data: age in 44 controls (outcome Good) < 30 cases (outcome Poor).#> Area under the curve: 0.5947

This code runs without any problems. It’s wrong, but it runs. The problem is that pROC::roc(...) supports variadic arguments (...):

# Note the dotspROC:::roc |> formals() |> str()#> Dotted pair list of 1#>  $ ...: symbolpROC:::roc.data.frame |> formals() |> str()#> Dotted pair list of 5#>  $ data     : symbol #>  $ response : symbol #>  $ predictor: symbol #>  $ ret      : language c("roc", "coords", "all_coords")#>  $ ...      : symbol

Those ... are for forwarding arguments to other functions that roc() might call internally. Unfortunately,functions by default don’t check the contents of the ... to see if theyhave unsupported arguments. Thus, bad arguments are ignored silently:

# method and weights are not real argumentspROC::roc(data, "outcome", "age", method = fake, weights = fake)#> Setting levels: control = Good, case = Poor#> Setting direction: controls < cases#> #> Call:#> roc.data.frame(data = data, response = "outcome", predictor = "age",     method = fake, weights = fake)#> #> Data: age in 44 controls (outcome Good) < 30 cases (outcome Poor).#> Area under the curve: 0.5947

The LLM hallucinated a weights argument, which is a plausibleargument,⁶ and the ... syntax behavior swallowed it up likePac-Man. It always comes back to Pac-Man. I ended up writing afunctionthat could compute sens and spec on weighted data.

Unfortunately the space of LLM code errors and the space of human errors are not the same, making hard-won code review instincts misfire
— Eugene Vinitsky 🍒 (@eugenevinitsky.bsky.social) November 13, 2025 at 6:21 PM

As users, we can guard against the first two silent problems withoptions(warnPartialMatchArgs, warnPartialMatchDollar), and asdevelopers, we can prevent the second problem withrlang::check_dots_used() and friends. But like I saidat the outset, external corroboration requires us to know even moreabout the language in order to vibe safely.

Syntax and semantics, again

In this mini-position statement on LLM assistance, the two principles Iam trying to develop are:

LLMs know text distributions very well. Use them to generate starter syntax.
LLMs don’t understand anything. It’s all bullshit and vibes.

If we think of LLMs as syntax generators, we can imaginesome pretty good use cases:

Write unit tests for a function that does…
Set up Roxygen docs for this function
Create a function to simulate data for a model of rt ~ group + (1 | id)
Write a Stan program to fit this model. (Mind your priors.)
Spoiler alert: Convert this R loop into C++ code

Still, we need to be mindful of the semantic limitations and skepticalof the output. We should audit the results and make sure we comprehendthem, or admit upfront that this code is running on vibes. In either case, we also need to be vigilant about bugs that could happen silently or bugsthat a machine might make but a human wouldn’t (hallucinations).

One thing I worry about with LLM reliance is skill atrophy. If I keepusing this bot as a crutch, then some of my skills will get weaker. Sam Mehrhas a take I quite like that puts this concern upfront. LLMs arefine for code we don’t feel bothered to learn:

re AI, a PhD student mentioned sheepishly that they used chatgpt for advice on coding up an unusual element in javascript. Almost apologized I'm like no no no you're a psych PhD, not CS, this is exactly what LLMs are for! Doing a so-so job at things you just need done & don't care about learning!
— samuel mehr (@mehr.nz) May 13, 2025 at 10:32 PM

I quite like programming and want to learn. I like to read the releasenotes, dig into the documentation andexperimentwith new modeling features. At the same time, sometimes I just want abash script to unzip all .zip files in a directory. Time was, we wouldfind something from Stack Overflow to adapt for that problem. Now, weask ChatGPT for the code, look it over quick, test it and move on. Thatseems fine. A metacognitive awareness about what is worth learning and what problems are worth solving in a slower methodical way is very useful for an LLM user.

Finally, to be clear—I can’t believe I need to make thisdisclaimer—we should always care about truth and accuracy when wewrite prose and publish it and put our name on it. Vibes are notscientific or scholarly. When I see emails or code documentation withimmaculate formatting and perfect language, my bullshit sensor goes offand I worry that I need to read extra carefully because a smooth-talkingrobot is trying to pull a fast one on me. I don’t use LLMs for writingexcept for proofreading or requests for nitpicking. I have aninstruction in ChatGPT that says not to revise anything I write unlessit sneaks Magic: The Gathering card names into the output. (Alas, itgenerally ignores that diabolic edict of mine.)

AI assistance in readtextgrid

Because the old parser was outperforming the newer, more robust parser, Iasked ChatGPT for ways to make my textgrid parsing faster. For example,one version of the loop collected characters in a vector and thenpaste0()-ed them together. ChatGPT suggested that because we areiterating over character indices we instead use substring() to extract tokens from the text. That worked, and it ranfaster, until it failed a unit test on a character wearing a diacritic. After a few rounds of trying to improve the loop, I asked quitebluntly: “How can we move the tokenize loop into Rcpp or cpp11 with theviewest [sic] headaches possible”.

And it provided some very legible cpp11 code. I had never used C++ withR before. To get started, I had to call onusethis::use_cpp11()to make the necessary boilerplate—you just need syntax sometimes—andI had to troubleshoot the first couple versions of the function becauseof errors. The cpp11documentation is small ina good way. It has examples of converting R code into C++ equivalents,which is precisely the activity that I was up to.

What I liked about the ChatGPT output is how clear the translation was.In the R version, part of the character processing loop is to peek aheadto the next character to see whether " is an escaped quote "" or theend of a string:

# ... in the character processing loop    # Start or close string mode if we see "    if (c_starts_string) {      # Check for "" escapes      peek_c <- all_char[i + 1]      if (peek_c == "\"" & in_string) {        in_escaped_quote <- TRUE      } else {        in_string <- !in_string      }    }# ...

And here is the C++ version of the peek ahead code:

// ... helper functions ...  // Is this a UTF-8 continuation byte? (10xxxxxx)  auto is_cont = [](unsigned char b)->bool {    // Are the first two bits 10?    return (b & 0xC0) == 0x80;  };// ... in the character processing loop ...    if (b == 0x22) { // '"'      // peek ahead to see if we have a double "" escapement      size_t j = i + 1;      // We need the next character, not just the next byte, so we skip      // continuation characters.      while (j < nbytes && is_cont(static_cast<unsigned char>(src[j]))) ++j;      // Use `0x00` dummy character if we are at the end of the string      unsigned char nextb = (j < nbytes) ? static_cast<unsigned char>(src[j]) : 0x00;      if (in_string && nextb == 0x22) {        esc_next = true;    // consume next '"' once      } else {        in_string = !in_string;      }    }// ...

There is a logical correspondence between the lines that I wrote myselfin R and the lines that the LLM provided for C++. The C++ version worksat the level of bytes instead of characters, and that matters:

"é" |> nchar(type = "chars")#> [1] 1"é" |> nchar(type = "bytes")#> [1] 2

But the C++ code makes sense to me. It looks plausible, right? Still,plausible isn’t enough. I asked the LLM a lot of follow-up questions:what does auto do, what is size_t doing, and so on. And I annotatedthe C++ code with comments for my own understanding.

During my auditing, I went down a particular rabbithole to make sure Iunderstood how Unicode bytes get packed into UTF-8 sequences. I learnedhow the character é for example has the codepoint (character number)U+00E9 in Unicode, so it falls in the range of codepoints that need tobe split into two bytes. The scheme for two-byteencoding is

character number               ->               character encodingcodepoint -> 00000yyy yyxxxxxx -> 110yyyyy 10xxxxxx -> UTF-8 bytes00E9      -> 00000000 11101001 -> 11000011 10101001 -> c3 a9

Which we can check by hand:

bitchar_to_raw <- function(xs) {  xs |>     strsplit("") |>     lapply(function(x) as.integer(x) |> rev() |> packBits()) |>     unlist()}bitchar_to_raw(c("11000011", "10101001"))#> [1] c3 a9charToRaw("é")#> [1] c3 a9

In the UTF-8 scheme, bytes that start with 10 are only the second,third and fourth bytes in a character’s encoding—that is, only thecontinuation bytes. Now, at this point, we can comprehend why the C++is checking for continuation characters and why the check forcontinuation characters involves checking the first two bits.

Another rabbithole involved how to parse numbers. At first, the LLMsuggested I use one of R’s own C functions to handle it. That idea seems really powerful to me—wait, now I can tap into what R’s own routines?!—but R’s parser was a bit stricter than what I needed tomatch praat.exe.

This new C++ based tokenizer yielded a huge performance gain:

bench::mark(  legacy = readtextgrid:::legacy_read_textgrid_lines(tg_lines),  new_r = readtextgrid:::r_read_textgrid_lines(tg_lines),  new_cpp = readtextgrid::read_textgrid_lines(tg_lines))#> # A tibble: 3 × 6#>   expression      min   median `itr/sec` mem_alloc `gc/sec`#>   <bch:expr> <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>#> 1 legacy      65.22ms  68.77ms      13.8    6.49MB     4.59#> 2 new_r       72.67ms   88.5ms      11.5  363.33KB     5.74#> 3 new_cpp      3.12ms   3.64ms     272.    96.77KB     4.11

That’s an improvement of 10–15x! Now, I find myself wondering: What elsecould use a cpp11 speed boost?

One downside of adopting cpp11 is that the package needs to compilecode. As a result, I can’t just tell people to try the developer versionof the package withremotes::install_github().CRAN compiles packages so end users don’t face this issue wheninstalling the official released version of packages.

One workaround I adopted was relying on RUniverse which will provide compiledversions of packages hosted on GitHub. Then we change the installationinstructions to:

install.packages(  "readtextgrid",   repos = c("https://tjmahr.r-universe.dev", "https://cloud.r-project.org"))

You might have seen this pattern elsewhere.cmdstanr skips CRAN entirely and onlyuses R Universe.

Parting thoughts

An LLM helped me translate pokey R code into fast C++ code. The code islive now on CRAN, released in readtextgrid 0.2.0. I’mmaybe kind of a C++ developer now? (Nah.)

This kind of code translation strikes me as an easy win for R developers: “I have my version that works right now, but I think it cango faster. Help me convert this to C++.” I took care to make sure Iunderstood the output. The syntax came easy, but the semantics(comprehension and validation) took more time.

If I ask myself, could I have done this translation to C++ without anLLM? The answer is no, not in a reasonable timeframe, certainly not asfast as the two days it took me in this case. That’s a pretty undeniableboost.

Things I won’t talk about: Plagiarism, safety, energy use, hype,undercooked AI features making things slower and dumber, stupid people emboldened by how trivial AI makes everything seem—we won’t need programmers or doctors or historians or whatever is what someone with noexpertise in programming, medicine, history, etc. would say—dumdums tearing down fences,creativity versus productivity, aesthetic homogenization or how I keep seeing the same comic style in YouTube thumbnails, nobody asked for slop, oh they did ask for slop, etc. ↩
There is something introspective about reasoning models which will break a prompt into steps and work through them. But still, I’m thinking about what the ground truth is in this reasoning. The statistical regularities of word patterns? ↩
I think there is a great “tradition”—not sure of the right word here—in learning programming and other tools where we start from astarter template or maybe small sample project and we experimentallytweak the code and iterate until it turns into the thing we want. It’slike scaffoldingbut at a less metaphorical level: Code that sets a foundation for self-directed learning. ↩
I asked ChatGPT for help making a shopping list for a small woodworkingproject, and it offered a cutting plan for the lumber. Sure, why not? Itmessed up the math with a plan that involved cutting off 74 inches ofwood from a 6-foot piece of lumber. My external corroboration in this case was a scrap of wood. ↩
I am still immensely annoyed about a YouTube video that tried to tell me Abadox was a “controversial” NES game. Get out of here. Nobody talked about that game. Show me a newspaper clipping or something. ↩

Let’s count functions with weights arguments in some base Rpackages:

get_funcs_with_weights <- function(pkg) {  ns <- asNamespace(pkg)  ls(ns) |>     lapply(get, envir = ns) |>     setNames(ls(ns)) |>     Filter(f = is.function) |>     lapply(formals) |>     Filter(f = function(x) "weights" %in% names(x)) |>     names()}get_funcs_with_weights("stats")#>  [1] "density.default" "glm"             "glm.fit"         "lm"             #>  [5] "loess"           "nls"             "ppr.default"     "ppr.formula"    #>  [9] "predict.lm"      "predLoess"       "simpleLoess"get_funcs_with_weights("mgcv")#>  [1] "bam"             "bfgs"            "deriv.check"     "deriv.check5"   #>  [5] "efsud"           "efsudr"          "find.null.dev"   "gam"            #>  [9] "gam.fit3"        "gam.fit4"        "gam.fit5"        "gamm"           #> [13] "gammPQL"         "initial.spg"     "jagam"           "mgcv.find.theta"#> [17] "mgcv.get.scale"  "newton"          "scasm"           "score.transect" #> [21] "simplyFit"get_funcs_with_weights("MASS")#> [1] "glm.nb"      "glmmPQL"     "polr"        "rlm.default" "rlm.formula"#> [6] "theta.md"    "theta.ml"    "theta.mm"get_funcs_with_weights("nlme")#>  [1] "gls"               "gnls"              "lme"              #>  [4] "lme.formula"       "lme.groupedData"   "lme.lmList"       #>  [7] "nlme"              "nlme.formula"      "nlme.nlsList"     #> [10] "plot.simulate.lme"

↩

To leave a comment for the author, please follow the link and comment on their blog: Higher Order Functions.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.