Move Over, BabelFish: Computer Program Reconstructs Lost Tongues

What if a computer could produce never-before-seen lost languages from their modern descendants in a fraction of the time it takes linguistic experts?

  • Share
  • Read Later
Juan Naharro Gimenez / Getty Images

A replica of the Rosetta Stone is displayed as part of the 'Treasures of the World's Cultures' exhibition at Centro Exposiciones Arte Canal on January 12, 2010 in Madrid, Spain.

You say puh-tay-tow, I say puh-tah-tow, but how did our ancestors use language millennia ago? Typically you’d ask a linguist, but manually reconstructing protolanguages — hypothetical early languages from which extant ones evolved — can be a lengthy, arduous process. What if you could get reasonably close in a fraction of the time by using a computer?

Researchers in Canada and California have done exactly that by designing software that can take rules about how language-related sounds change over time to essentially reverse-engineer the process and recreate the rudiments of lost root languages — a sort of linguistic time machine-meets-Rosetta stone.

(MORE: If Apple Makes a Smartwatch, This Is the Competition)

The idea that language changes over time is obvious enough on contemporary time-scales — just look at dialects. Today some people say “axed” (phonetically) instead of “asked” while others say “howdy” instead of “hello.” When I lived in Denton, Texas, folks said “y’all” instead of “you all,” and the colloquialism “ain’t” (instead of “am not” or “is not”) — the bane of English language formalists everywhere — was widespread by the 18th century.

(Conversely, the historian J.M. Roberts writes in The History of the World that a word like “alcohol” survives more or less in its original form from Sumerian, a language spoken in southern Mesopotamia since the 4th millennium B.C.; so, says Roberts, does the world’s first recipe for beer.)

But tracing back the origins of dialect changes is kid’s stuff compared to constructing entire progenitor languages that existed prior to the earliest extant ones. All we have are the descendant languages and ideas about how sounds change over time: kind of like playing Clue with half the game board and only some of the cards. Compare the features of two or more languages with common ancestors — a process known as the comparative method — and scholars would argue you can get close, but it can be a painstaking process.

Imagine instead taking over 600 existing languages spoken in Asia and the Pacific — precisely what these researchers did — and feeding them to a computer that quickly and accurately reconstructed likely protolanguages from which the modern cognates evolved. In this case, the computer program scanned a database of over 140,000 words, from which it managed to construct a protolanguage the researchers believe may have been spoken around 7,000 years ago. How accurate are we talking? According to the researchers: “Over 85% of the system’s reconstructions are within one character of the manual reconstruction provided by a linguist specializing in Austronesian languages.”

That’s kind of remarkable, even if Alex Bouchard-Côté — one of the researchers and co-author of the related paper “Automated reconstruction of ancient languages using probabilistic models of sound change” published in the journal Proceedings of the National Academy of Science — admits the algorithm is still “doing a basic job right now” (via TechNewsDaily).

Outside geeky academic circles and obscure scholarly journals populated with articles no one generally reads, what’s the practical purpose of reconstituting a dead language?

For starters, knowing how languages changed over time can help us better organize history, say the order in which key events happened. According to Bouchard-Côté, for instance, we might be able to refine our understanding of how Europe was settled: “If you can figure out if the language of the settling population had a word for wheel, then you can get some idea of the order in which things occurred, because you would have some records that show you when the wheel was invented.”

“It’s very time consuming for humans to look at all the data,” says fellow researcher and paper author Dan Klein (via BBC). “There are thousands of languages in the world, with thousands of words each, not to mention all of those languages’ ancestors. It would take hundreds of lifetimes to pore over all those languages, cross-referencing all the different changes that happened across such an expanse of space – and of time. But this is where computers shine.”

But okay, ask the question we’re all burning to know: What happens if you take Vulcans and Romulans — two Star Trek races with their own constructed languages and common ancestry — and feed this program that?

MORE: Nataly Dawn On Her New Album How I Knew Her and Why Kickstarter Fans Can Be Fickle


One of the examples above is in error.  In no way is "axed" the phonetic equivalent of "asked".  "Axed" is just bad English! 

It is as bad as pronouncing "mischievous" "miss chee vee us", when it should be pronounced "miss chiv us"


That's not correct. "Axed" is bad "formal English". Evolutions like this involving transpositions of consonants or replacements of one consonant by another are NORMAL in the history of language. And indeed that is the whole point of the software project described in this article. Would you call Spanish and Italian just "bad Latin!"? Teaching the norms of "formal English" is exceedingly important, and no student should be taught that it's "okay" to speak only in dialect (including using such dialect forms as "axed"). This is not because "formal English" is intrinsically superior as a language, but rather because it is so universal globally that no one should be crippled by being unable to speak it, read it, and write it properly. But formal English does evolve, albeit more slowly than dialects of English, and it will continue to evolve. 


1) I would love to see a comparison between this program and the PIE language postulates after Bopp

2) errrm. 
this is very basic, But it's VULCANS and Romulans who share an ancestry. Klingons are not in their line. That's an embarrassing error. You should not use exemplary language when you are not familiar with the examples used.

mattpeckham moderator 1 Like

@JeffGedney Yep, brain in park when I typed that. Fixed!



Thank you, I am sorry to have been severe about it. I just realized I probably was maladroit about it, and came off a tad unfriendly.
Otherwise a fun article and a concept in which I have long been fascinated. Thanks.


How back can they push this process?  Do they think they might be able to construct a proto-proto-language from several other proto-languages?


Linguists can do this work far more accurately than this software. The software's advantage is that it can process the data from hundreds of obscure languages. The software's value is not quality, but quantity (also consistency, which is maybe more interesting). There are somewhere between 5000 and 8000 extant languages depending on how you count them. That's a lot of work for the small body of historical linguists in the world. But the derived proto-languages number around 100 to 400. Comparing the proto-languages is obviously a much more manageable task. And yes, people do this, but it's a signal-to-noise problem. Eventually, 10,000 or 20,000 years of linguistic evolution will erase all elements of ancestral languages. There may be a time horizon in the history of human language beyond which we cannot see...  



I wonder how that would compare against Mel Brook's reconstruction of Ur-Languages in his 2000 Year Old Man sketches....