Social media 1, ChatGPT 0

Pharyngula 2024-05-01

Way back in February, I made a harsh comment about ChatGPT on Mastodon.

I teach my writing class today. I’m supposed to talk about ChatGPT. Here’s what I will say. NEVER USE CHATGPT. YOU ARE HERE TO LEARN HOW TO WRITE ABOUT SCIENCE, YOU WILL NOT ACCOMPLISH THAT BY USING A GODDAMNED CRUTCH THAT WILL JUST MAKE SHIT UP TO FILL THE SPACE. WRITE. WRITE WITH YOUR BRAIN AND YOUR HANDS. DON’T ASK A DUMB CYBERMONKEY TO DO IT FOR YOU. I have strong opinions on this matter.

Nothing has changed. I still feel that way. Especially in a class that’s supposed to instruct students in writing science papers, ChatGPT is a distraction. I’m not there to help students learn how to write prompts for an AI.

But then some people just noticed my tirade here in April, and I got some belated rebuttals. Here, for instance, kjetiljd defends ChatGPT.

Wow, intense feelings. Have you ever written something, crafted a proper prompt to ask ChatGPT-4 to critique your text? Or asked it to come up with counter-arguments to your point of view? Or asked it to analyze a text in terms of eg. thesis/antithesis/synthesis? Or suggest improvements in readability? You know … done … (semi-)scientific … experiments with it? With carefully crafted prompts my hypothesis is that it can be used to improve both writing and thinking…

Maybe? The flaw in that argument is that ChatGPT will happily make stuff up, so the foundation of its output is on shaky ground. So I said I preferred good sources. I didn’t mention that part of this class was teaching students how to do research using the scientific literature, which makes ChatGPT a cheat to get around learning how to use a library.

I prefer to look up counter-arguments in the scientific literature, rather than consulting a demonstrable bullshit artist, no matter how much it is dressed up in technology.

kjetiljd’s reply is to tell me I should change the focus of my class to be about how to use large language models.

And if I were a student I would probably prefer advice on the use of LLMs from a scientific writing teacher who seemed to have some experience in the field, or at least seemed to … how should I say this … have looked up counter-arguments from the scientific literature …?

I guess I’m just ignorant then. Unfortunately, this class is taught by a group of faculty here, and I had a pile of sources about using ChatGPT as a writing aid, that were included in course’s Canvas page. I didn’t find them convincing.

Sure, I’ve looked at the counter-arguments. They all seem rather self-serving, or more commonly, non-existent.

So kjetiljd hands me some more sources. Ugh.

Here are a few more or less random papers on the topic – they exist, are they all self-serving? https://www.semanticscholar.org/paper/ChatGPT-4-and-Human-Researchers-Are-Equal-in-A-Sikander-Baker/66dcd18c0f48a14815edca1d715fa8be8909cca6 https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10164801/ https://www.semanticscholar.org/paper/Chat

I read the first one, and was unimpressed. They trained ChatGPT on a small set of review articles and then asked it to write a similar review, and then had some people judge on whether it was similar in content and style. Is ChatGPT a dumb cybermonkey? This article says yes.

I was about done at this point, so I just snidely pointed out that scientists scorn papers written by AIs.

Don’t get caught!

https://retractionwatch.com/papers-and-peer-reviews-with-evidence-of-chatgpt-writing/

I was done, but others weren’t. Chaucerburnt analyzed the three articles kjetiljd suggested. They did not fare well.

The first paper describes a trial where researchers took 18 recent human-written articles, got GPT-4 to write alternate introductions to them, and then got eight reviewers to read and rate these introductions.

Some obvious points:

– 18 pairs of articles is not a lot. With only a small number of trials, there’s a significant risk that an inferior method will win a “best of 18” over a superior method by pure luck. – 8 reviewers, likewise, is not a very large number. Important here is that the reviewers were recruited “by convenience sampling in our research network” – that is, not a random sample, but people who were already contacts of the authors. This risks getting a biased set of reviewers whose preferences are likely to coincide with the researchers’. – The samples were reviewed on dimensions of “publishability” (roughly, whether the findings reported are important and novel), “readability”, and “content quality” (here apparently meaning whether they had too much detail, not enough, or just right.)

What’s missing here?

None of the assessment criteria have anything to do with *accuracy*. There’s no fact-checking to evaluate whether the introduction has any connection to reality.

Under the criteria used here, GPT could probably get excellent “publishability” scores by claiming to have a cure for cancer. It could improve “readability” by replacing complex truths with over-simple falsehoods.

And it could improve “content quality” by inventing false details or deleting important true ones in order to get just the right amount of detail, since apparently “quality” doesn’t depend on whether the details are *true*, only on how many there are.

The reviewers weren’t even asked to read the rest of the article and evaluate whether the introduction accurately represented the content.

I daresay the human authors could’ve scored a lot higher on these metrics if they weren’t constrained by the expectation that their content should be truthful – something which this comparison doesn’t reward.

They also note “We removed references from the original articles as GPT-4’s output does not automatically include references, and also since this was beyond the scope of this study.” Because, again, truthfulness is not part of the assessment here.

(FWIW, when I tried similar experiments with an earlier version of GPT, I found it was very happy to include references – I merely had to put something like “including references” in the prompt. The problem was that these references were almost invariably false, citing papers that never existed or which didn’t say what GPT claimed they said.)

I concur, and that was my impression, too. The AI written version was not assessed for originality or accuracy, but only on superficial criteria of plausibility. AI is very good at generating plausible sounding text.

Chaucerburnt went on to look over the other two articles, which I hadn’t bothered to read.

The second article linked – which feels very much like it was itself written by GPT – makes a great many assertions about the ways in which GPT “can help” scientists in writing papers, but is very light on evidence to support that it’s good at these things, or that the time it saves in some areas is greater than the time required to fact-check.

It acknowledges plagiarism as a risk, and then offers suggestions on how to mitigate this: “When using AI-generated text, scientists should properly attribute any sources used in the text. This includes properly citing any direct quotations or paraphrased information”… – this seems more like general advice for human authors than relevant to AI-generated text, where the big problem is *not knowing* when the LLM is quoting/paraphrasing somebody else’s work.

It promotes the use of AI to improve grammar and structure – but the article itself has major structural issues. For instance, it has a subsection on “the risk of plagiarism” followed by “how to avoid the risk of plagiarism”.

But most of the content in “the risk of plagiarism” is in fact stuff that belongs in the “how to avoid” section.

Some of it is repeated between sections – e.g. each of those sections has a paragraph advising authors to use plagiarism-detection software, and another on citing sources.

On the grammatical side, it has a bunch of errors, e.g.:

“AI tools like ChatGPT is capable of…”

“The risk of plagiarism when use AI to write review articles”

“Use ChatGPT to write review article need human oversight”

“Conclusion remarks”

“Are you tired of being criticized by the reviewers and editors on your English writings for not using the standard English, and suggest you to ask a native English speaker to help proofreading or even use the service from a professional English editor?”

(Later on, it contradicts that by noting that “AI-generated text usually requires further editing and formatting…Human oversight is necessary to ensure that the final product meets the necessary requirements and standards.”)

If that paper is indeed written by GPT, it’s a good example of why not to use GPT to write papers.

The third aricle gets the same treatment.

The last of the three papers you linked is a review of other people’s publications about ChatGPT. It’s more of a summary of what other people are saying for and against GPT’s use than an assessment of which of these perspectives are well-informed.

(Of 60 documents included in the study, only 4 are categorised as “research articles”. The most common categories are non-peer-reviewed preprints and editorials/letters to the editor.)

It does note that 58 out of 60 documents expressed concerns about GPT, and states that despite its perceived benefits, “the embrace of this AI chatbot should be conducted with extreme caution considering its potential limitations.”

Not exactly an enthusiastic recommendation for GPT adoption.

Going a step further, Chaucerburnt reassures me that my role in the class is unchallenged.

I’ve seen people use AI for critique, and my impression is that it does more harm than good there.

If a human reviewer tells me that my sentences are too long and complex, there’s a very high probability that they’re saying this because it’s true, at least for them.

If an AI “reviewer” tells me that my sentences are too long and complex, it’s saying it because this is something it’s seen people say in response to critique requests and it’s trying to sound like a human would. Is it actually true, even at the level that a human reviewer’s subjective opinion is true? No way to know.

Beyond that, a lot of it comes down to Barnum statements: https://medium.com/@herbert.roitblat/this-way-to-the-egress-barnum-effect-or-language-understanding-in-gpt-type-models-597c27094f35

Many authors can benefit from generic advice like “consider your target audience”, but we don’t need to waste CPU cycles to give them that.

This term I had a couple of student papers here at the end that would not have benefited from ChatGPT at all. Once a student gets on a roll, you’ll sometimes get sections that go on at length — they’re trying to summarize a concept, and the answer is to keep writing until every possible angle is covered. The role of the editor is to say, “Enough. Cut, cut, cut — try to be more succinct!” I’ve got one term paper that is an ugly mess at 30 pages, but has good content that would make it an “A” paper at 20 pages. ChatGPT doesn’t do that. It can’t do that, because its mission is to generate glurge that mimics other papers, and there’s nobody behind it that understands the content.

Anyway, sometimes social media comes through and you get a bunch of humans writing interesting stuff on both sides of an argument. I’d hate to see how ugly social media could get if AIs were chatting, instead.