Does big data equal big problems?

Fully (sic) 2016-05-31

You have all been there, the busy after-work pub drinks or the mega loud nightclub: you can’t hear yourself think, let alone talk. It feels like no language other than sign language would have ever evolved in a place like that. But pubs or clubs are thankfully not the only places that we find ourselves in. But what if our environment were different? Different enough to consistently impose certain acoustic pressures on our hearing of sounds? Ian Maddieson and Christophe Coupé presented intriguing results at the Acoustical Society of America conference last week showing a statistical link between ecology (places with denser vegetation or higher temperature) and sound systems (lower frequency sounds, that is, more “sonorous” sounds). While they are sure to cause controversy in the field of linguistics, controversy in itself need not be a bad thing.

The inference of this study is that languages which are spoken in geographical areas that have environmental features which block the accurate transmission of higher frequency sounds may adapt to use lower frequency sounds in order to facilitate improved communication. The data was carefully combed and put together by a team of well-respected researchers, but the real question is, can we really speak of a causal relationship here or is this a mere statistical coincidence?

Remember, kids! Correlation ≠ causation! / Image credit: Tyler Vigen

“Big data” collections such as the inventory of phonemes used in this study are an amazingly rich resource, but like many good tools out there, they also bring with them a host of risks, in particular the potential risk of finding such coincidental correlations ( “inverse sample size” problem), and thereby the tempting risk of seeing relationships where there are none (see this collection for a fun but useful illustration). Before attempting novel and daring anlayses, researchers actually need to already know a lot about what they are testing, namely, which factors to include and which to exclude. This is not easy or straight forward, but vital. For example, a nice illustration can be found here, of how fumbling in the dark leads to silly results: looking at the numbers alone, it turns out that languages spoken by people who take afternoon naps (siestas) tend to have simpler verb forms. Well, errrr actually if you control for the fact that some of these languages are genetically related (i.e., that they are descended from a common ancestor language), then the correlation disappears. Bam!

But not every language study using “big data” is necessarily a house of cards. And for those which are not, they can innovate future research in valuable ways and bring about new leaps of thought. For example, previous work has demonstrated insightful statistical links between population size and morphological complexity (languages spoken by smaller groups tend to make use of more complex morphology, that is, words with more complicated internal structure, more prefixes, suffixes, etc.), between genes and tonal languages (speakers of tonal languages share certain genetic traits), and also between tonal languages and climate (tonal languages are not found in arid climates).

So with summer coming up, how might your own sounds change in different climates? (See here a fun account of how Larry King’s might).

Andreea S. Calude lectures in linguistics at the University of Waikato (New Zealand), loves grammar, studying languages through actual real use (corpus linguistics) and thinks “no word is an island”.

The post Does big data equal big problems? appeared first on Fully (sic).