THE

Antarctica Starts Here. » Antarctica Starts Here. 2014-08-14

Email yesterday from Bill Benzon:

Here's a blog post about a little bit of linguistic detail in a VERY interesting book: Matthew Jockers, Macroanalysis: Digital Methods & Literary History.

Do you have any thoughts on that detail?

The post in question is "Reading Macroanalysis 4: On the matter of 'the'", New Savanna 8/13/2014, and the "detail" in question is a cited difference in the frequency of the word the  between a collection of of 19th century British novels and a comparable collection of 19th-century American novels:

Chapter 7, “Nationality” is pretty straightforward. I don’t have much to say about it except for a puzzle that Jockers presents at the beginning. He points out that, because British and American writers have different practices concerning the word the, that word is about 5 percent of the word tokens in his corpus of 19th Century British novels, while it is about 6 percent of the tokens in the American novels.

My first thought is to point Bill towards the work of Jamie Pennebaker and other social psychologists, who have consistently found surprisingly large effects of many factors on rates of function-word use.

Thus Michael Cohn, Matthias Mehl, and James Pennebaker, "Linguistic Markers of Psychological Change Surrounding September 11, 2001", Psychological Science 2004:

When people are writing with high psychological distance (compared with low psychological distance), they use longer words and more articles, and avoid present tense and first-person singular.

Or Matthias Mehl & James Pennebaker, "The Sounds of Social Life: A Psychometric Analysis of Students’ Daily Social Environments and Natural Conversations", Journal of Personality and Social Psychology 2003:

The natural conversations and social environments of 52 undergraduates were tracked across two 2-day periods separated by 4 weeks using a computerized tape recorder (the Electronically Activated Recorder [EAR]). The EAR was programmed to record 30-s snippets of ambient sounds approximately every 12 min during participants’ waking hours. Students’ social environments and use of language in their natural conversations were mapped in terms of base rates and temporal stability.

[...]

Consistent with previous research (Pennebaker & King, 1999), men used significantly more big words (words more than six letters long; Mmale 9.4% vs. Mfemale = 8.3%, p <.05), more articles (Mmale = 4.4% vs. Mfemale = 3.5%, p <.01), fewer first-person singular pronouns (Mmale = 6.2% vs. Mfemale = 7.5%, p < .01), and fewer discrepancy words (Mmale = 2.0% vs. Mfemale = 2.5%, p < .05) than women.

In a similar vein, there's e.g. Duyen Nguyen & Susan Fussell, "Lexical Cues of Interaction Involvement in Dyadic Instant Messaging Conversations",Discourse Processes 2014:

In Study 1, an experiment with 60 participants, we manipulated level of involvement in a conversation with a distraction task. We examined how participants' uses of verbal cues such as pronouns were associated with their involvement in text-only IM conversations. We found that use of personal pronouns, assent words, cognitive words, and definite articles were significant indicators of a participant's involvement.

Specifically, the proportion of definite articles (from their Table 4):

 High Involvement ConditionLow Involvement ConditionMean6.024.70Standard deviation4.873.5595% C.I.[5.14, 6.90][4.05, 5.34]

 

We see similarly lawful patterns (though with different base rates) if we look at the influence of sex and age on the average rate of "the" usage in the Fisher conversational telephone speech transcripts:

So we might ask whether Jockers' collections of 19th-century British and American novels are balanced for the authors' sex and age, or for the percentage of dialogue, or for the amount of real-time narration vs. discussion of less immediate things (like the extended passages of natural history in Moby Dick). But then again, maybe there's just some geographical variation, along with everything else.

Note that the reasons for variation in THE frequency are surely various: use of definite descriptions as opposed to pronouns or names ("the parson" vs. "he" vs. "Mr. Samuels"); presence or absence of modifiers ("the door" vs. "the tavern door"); general phrasal choice ("the place she came from" vs. "her home town").

Thus part of the reason for the somewhat more frequent use of THE by male speakers in the Fisher transcripts is probably the somewhat more frequent use by female speakers of e.g. possessive pronouns:

MalesFemalesmy0.461%0.650%your0.211%0.215%her0.062%0.113%his0.058%0.070%our0.079%0.105%their0.132%0.145%TOTAL 1.00% 1.30%