Математически построенное дерево языков

20 July 2004

Да, в основе, лежит, насколько я понял, тот же подход, что и глоттохронология - сравнение словарей, но математика гораздо более серьезная - ряд мощных и проверенных методов, отработанных в эволюционной биологии. Рассматриваются 87 языков по 2449 лексических единиц. Все это позволяет обойти многие узкие места глоттохронологии. Я не специалист, поэтому могу говорить только о том, что есть в самой статье.

Recent advances in computational phylogenetic methods, however, provide possible solutions to the four main problems faced by glottochronology. First, the problem of information loss that comes from converting discrete characters into distances can be overcome by analysing the discrete characters themselves to find the optimal tree(s). Second, the accuracy of tree topology and branch-length estimation can be improved by using models of evolution. Maximum-likelihood methods generally outperform distance and parsimony approaches in situations where there are unequal rates of change14. Moreover, uncertainty in the estimation of tree topology, branch lengths and parameters of the evolutionary model can be estimated using bayesian Markov chain Monte Carlo16 (MCMC) methods in which the frequency distribution of the sample approximates the posterior probability distribution of the trees17. All subsequent analyses can then incorporate this uncertainty. Third, lexical items that are obvious borrowings can be removed from the analysis, and computational methods such as split decomposition18, which do not force the data to fit a tree model, can be used to check for non-tree-like signals in the data. Finally, the assumption of a strict clock can be relaxed by using rate-smoothing algorithms to model rate variation across the tree. The penalized-likelihood19 model allows rate variation between lineages while incorporating a 'roughness penalty' that penalizes changes in rate from branch to branch. This smoothes inferred rate variation across the tree so that the age of any node can be estimated even under conditions of rate heterogeneity.

