Not-so-conservative science on ‘ultraconservative’ words

Fully (sic) 2013-05-09

The Fully (sic) team writes…

We have a pretty good idea of what humans were up to 15,000 years ago. We know that Homo Sapiens were the last surviving branch of the human species after the extinction of Homo floresiensis. The crazy idea of agriculture was about to take off, and a place that would come to be known as ‘America‘ was possibly settled. But while we know about the migration and lives of our ancestors 15 millennia ago, we don’t know anything about the language they spoke – beyond guessing that they had one (there were likely very many already). Spoken language is a fragile thing, changing and evolving across the generations.

A paper this week published by an international collaboration of scientists aims to peer into the murky depths of linguistic prehistory. They look at 7 language families that range from the western edges of Europe to the Inuit languages of North America. They took a word list of 200 basic words, working on the rationale that these are less likely to change over time than less basic/frequently used words. They then used existing constructed etymologies of what these words might have looked like at the earliest points for all of these language families (which is, itself, controversial), and then worked out whether these cognates could be used to work out the words at even greater time-depth.

The process of using known language data to work out earlier language is a methodology that has been around since the 1780s, when William Jones noticed that Greek, Latin and Sanskrit were all too similar for it to be coincidence. This is how we have come to understand the relationship of various languages in Europe, and to figure out where groups have moved over time. For example, we can tell from language similarities that Romany (the language of European Gypsies) originated in India. Most linguists who work in the area of historical linguistics don’t believe reconstructions of this type are reliable beyond about 6000 years of time depth. This is because words change sufficiently fast that after 6,000-10,000, there are too few words in common to be able to reliably distinguish similarities due to shared ancestry from chance resemblances.

So how have they gone back so far in this paper? Although they looked at 7 language families, many of the words only appeared in four or fewer families, in fact there was an average of 2.3 languages where each cognate was found. So this is only a partial picture, at best. Also, they chose to only look at language families that were next door to each other. It would have been good to throw in another language family that was more distant (such as the Trans-New Guinea family), to see whether similar links could be found between language families that are known to be entirely unrelated – i.e., whether the links they are finding could be coincidental. This would have provided some basis for evaluating the degree of support for their tree. And while they do make some effort to control for this, but when the underlying etymologies from the words are so questionable that it’s already based on a questionable basis.

There is a much more in-depth discussion from Sally Thomason over on Language Log about the limitations of the data, and the problems of this kind of analysis, and readers who aren’t daunted by some of the more technical aspects of the discussion are encouraged to read it.

This isn’t the first time that a group of scientists have tried applying statistical methods to historical language data, and it probably won’t be the last. While linguists acknowledge that we can’t reliably reconstruct anything over about 6000 years old, the seduction of trying to find the ‘ur-language’ – the postulated first human language – will always bring new challengers, and an eager media.