Researchers from SOAS University of London and the Max Planck Institute for the Science of Human History have published a new paper in the renowned international journal for historical linguistics, Diachronica. Their paper describes an experiment that illustrates how the classical method for the reconstruction of unattested languages can also be used to predict hitherto undocumented words in poorly described and endangered languages of India.

For a long time, historical linguists have been using the comparative method to reconstruct earlier states of languages that are not attested in written sources. The method consists of the detailed comparison of words in the related descendant languages and allows linguists to infer the ancient pronunciation of words which were never recorded in any form in great detail. That the method can also be used to infer how an undocumented word in a certain language would sound, provided that at least some information on that language, as well as information on related languages is available, has been known for a long time, but so far never explicitly tested.

Two researchers from SOAS University of London and the Max Planck Institute for the Science of Human History have recently published a paper in the renowned international journal for historical linguistics, Diachronica. In the article, they describe the results of an experiment in which they applied the traditional comparative method to explicitly predict the pronunciation of words in eight Western Kho-Bwa linguistic varieties spoken in India. Belonging to the Trans-Himalayan family (also known as Sino-Tibetan and Tibeto-Burman language family), these varieties have not yet been described in much detail and many words had not yet been documented in field work. The scholars started their experiment with an existing etymological dataset of Western Kho-Bwa varieties that was collected during fieldwork in the Indian state of Arunachal Pradesh between 2012 and 2017. Within the dataset, the authors observed multiple gaps in which the word forms for certain concepts were missing.1

The researchers set up a computer-assisted workflow to predict the missing word forms. The classical methods are traditionally applied manually, but the new computational solutions helped the scholars to increase the efficiency and reliability of the process, and all results were later manually checked and refined. To increase the transparency and validity of the experiment, they then registered their predictions online.2


  • 1. “When conducting fieldwork, it is inevitable that you miss out on some words. It’s kind of annoying when you observe that afterwards, but in this case, we realized that this was the perfect opportunity to test how well the methods for linguistic reconstruction actually work,” says Tim Bodt, first author of the study.
  • 2. “Registration is incredibly important in many scientific fields because it ensures that researchers adhere to good scientific practice, but as far as we know it has never been done in historical linguistics,” says Johann-Mattis List, who carried out the computational analyses of the study. “By registering our predictions online, we made sure we could no longer modify our predictions in light of the results we obtained during our subsequent verification process,” Bodt, adds.  

Timotheus A. Bodt, Johann-Mattis List. Reflex predictionDiachronica, 2021; DOI: 10.1075/dia.20009.bod

While analysing lexical data of Western Kho-Bwa languages of the Sino-Tibetan or Trans-Himalayan family with the help of a computer-assisted approach for historical language comparison, we observed gaps in the data where one or more varieties lacked forms for certain concepts. We employed a new workflow, combining manual and automated steps, to predict the most likely phonetic realisations of the missing forms in our data, by making systematic use of the information on sound correspondences in words that were potentially cognate with the missing forms. This procedure yielded a list of hypothetical reflexes of previously identified cognate sets, which we first preregistered as an experiment on the prediction of unattested word forms and then compared with actual word forms elicited during secondary fieldwork. In this study we first describe the workflow which we used to predict hypothetical reflexes and the process of elicitation of actual word forms during fieldwork. We then present the results of our reflex prediction experiment. Based on this experiment, we identify four general benefits of reflex prediction in historical language comparison. These comprise (1) an increased transparency of linguistic research, (2) an increased efficiency of field and source work, (3) an educational aspect which offers teachers and learners a wide plethora of linguistic phenomena, including the regularity of sound change, and (4) the possibility of kindling speakers’ interest in their own linguistic heritage.

Keywords: preregistered researchreflex predictionWestern Kho-Bwapredictionword predictionregularity of sound changecomparative methodcomputer-assisted language comparison