Obama's favored (and disfavored) SOTU words
Lingua Franca 2014-01-29
Lane asked "It would be great if someone had time to find some truly Obama signature phrases, doing the math properly. I'd be curious to know what words he actually does use unusually often."
I have two classes to prepare for today, and a student study break to get ready for (bread and cheese, fruits and nuts, chips and dips, cakes and candies etc., but mostly cleaning up the living room…). So I don't have time to work on the "truly signature phrases" problem — that's a hard problem to solve on the basis of a sample as small as a few years of SOTU messages, anyhow. But there's one thing that I do have time for: calculating the words (or rather, the lexical tokens) that are characteristic of Obama's SOTU messages in contrast to the other post-war SOTUs, against the background of all SOTUs since 1790.
To do this, I used the "weighted log-odds-ratio, information Dirichlet prior" algorithm described on p. 387-8 of Monroe, Colaresi & Quinn "Fightin' Words: : Lexical Feature Selection and Evaluation for Identifying the Content of Political Conflict", Political Analysis 2009. (Tip of the hat to Dan Jurafsky, who told me about this algorithm a couple of years ago.)
The basic idea here is that we have two "lexical histograms" (i.e. word-count lists), taken from two sources X and Y whose patterns of usage we want to contrast. If we just compare naively estimated rates of usage, we're going to end up with a bunch of unreliable comparisons between small counts, say comparing a word that X uses once and Y doesn't use at all, or vice versa. We want to take account of the likely sampling error in our counts, discounting differences that are probably just an accident, and enhancing differences that are genuinely unexpected given the null hypothesis that both X and Y are making random selections from the same vocabulary.
There are many different ways of approaching this problem. Monroe et al. survey several different methods, and the one that I've used is (in my opinion as well as Dan's) a nice balance between effectiveness and ease of application.
In my implementation of the algorithm, you prime the pump with three lexical histograms: source X, source Y, and some relevant background source Z. Then if you give the program a word, it determines a score (that "weighted log-odds ratio"), where positive values mean that the word is favored by source X, 0 means that the word is neutral between X and Y, and negative values mean that the word is favored by source Y.
In this case, I took X as Barack Obama's five SOTU addresses so far (41,508 "words", as per my tokenization), Y as the SOTU addresses of all other presidents since WWII (Truman through George W. Bush), and Z as the all SOTU addresses since 1790. I then fed in all the words in SOTU addresses since Truman, Obama included, and sorted the results according to the weighted log-odds ratio. Here's the positive end of the list (i.e. tokens favored by Obama). Each line presents
WORD XCount (XPerMillion) YCount (YPerMillion) ZCount (ZPerMillion) SCORE
's 471 (11347.2) 1573 (4146.14) 2529 (1602.59) 16.939 jobs 151 (3637.85) 289 (761.751) 439 (278.188) 14.242 why 85 (2047.8) 75 (197.686) 222 (140.678) 13.623 businesses 70 (1686.42) 75 (197.686) 140 (88.7161) 12.093 that 912 (21971.7) 4629 (12201.2) 19589 (12413.3) 11.936 get 98 (2360.99) 171 (450.725) 300 (190.106) 11.873 i'm 61 (1469.6) 79 (208.23) 136 (86.1813) 10.653 don't 56 (1349.14) 82 (216.137) 137 (86.815) 9.752 can't 44 (1060.04) 54 (142.334) 102 (64.636) 9.161 like 83 (1999.61) 199 (524.528) 572 (362.469) 8.541 we'll 42 (1011.85) 61 (160.785) 98 (62.1013) 8.498 innovation 24 (578.202) 13 (34.2656) 40 (25.3475) 7.982 republicans 27 (650.477) 24 (63.2596) 49 (31.0506) 7.862 kids 28 (674.569) 29 (76.4387) 50 (31.6843) 7.761 college 41 (987.761) 70 (184.507) 128 (81.1118) 7.730 because 114 (2746.46) 399 (1051.69) 870 (551.307) 7.634 what 128 (3083.74) 462 (1217.75) 1282 (812.386) 7.547 companies 33 (795.027) 37 (97.5252) 167 (105.826) 7.471 we've 64 (1541.87) 175 (461.268) 251 (159.055) 7.438 democrats 25 (602.294) 24 (63.2596) 47 (29.7833) 7.434
Here's the other end of the list, i.e. the words that Obama has tended to use significantly less than other postwar presidents (according to this algorithm):
the 1840 (44328.8) 23017 (60668.6) 133003 (84282.2) -8.955 of 1022 (24621.8) 13939 (36740.7) 86737 (54964) -8.133 must 53 (1276.86) 1583 (4172.5) 2843 (1801.57) -7.308 in 651 (15683.7) 8433 (22227.8) 33474 (21212) -6.237 peace 8 (192.734) 670 (1766) 1736 (1100.08) -5.699 program 16 (385.468) 618 (1628.93) 730 (462.591) -5.244 federal 22 (530.018) 737 (1942.6) 1315 (833.297) -5.216 freedom 8 (192.734) 472 (1244.11) 691 (437.877) -4.922 which 18 (433.651) 1072 (2825.6) 11029 (6988.93) -4.732 economic 21 (505.927) 614 (1618.39) 852 (539.901) -4.655 billion 9 (216.826) 425 (1120.22) 462 (292.763) -4.603 nations 16 (385.468) 601 (1584.13) 1651 (1046.22) -4.560 world 82 (1975.52) 1369 (3608.43) 2239 (1418.82) -4.508 free 17 (409.56) 554 (1460.24) 1166 (738.878) -4.366 national 17 (409.56) 566 (1491.87) 1837 (1164.08) -4.104 programs 16 (385.468) 440 (1159.76) 470 (297.833) -3.937 hope 7 (168.642) 324 (854.005) 769 (487.305) -3.638 be 175 (4216.05) 2499 (6586.91) 16149 (10233.4) -3.585 war 29 (698.66) 652 (1718.55) 2567 (1626.67) -3.449 provide 8 (192.734) 295 (777.566) 641 (406.193) -3.308 policy 8 (192.734) 333 (877.727) 1199 (759.79) -3.305
The whole list is here.
I don't have time to discuss the results further right now, but it's clear that we're looking at a mixture of effects that are stylistic ('s vs. of, the ongoing decline of which and the, contractions vs. uncontracted forms, …), effects that are rhetorical (why vs. must, …), and effects that are topical (jobs vs. peace, …).