Comparative Analysis of Phylogenic Algorithms V. Bayrasheva, R. Faskhutdinov, V. Solovyev Kazan University, Russia. - презентация
Презентация была опубликована
Презентация на тему: " Comparative Analysis of Phylogenic Algorithms V. Bayrasheva, R. Faskhutdinov, V. Solovyev Kazan University, Russia." — Транскрипт:
Comparative Analysis of Phylogenic Algorithms V. Bayrasheva, R. Faskhutdinov, V. Solovyev Kazan University, Russia
Phylogenetic Algorithms Applications in linguistics: trees of language evolution Main algorithms: NJ, UPGMA. They construct evolution trees on the base of matrices of the distances between languages Today NJ is the most popular algorithm Which algorithm is better? Which features are better for using?
Problems - I To build a realistic model of language evolution tree To explore the dependence of the result from the tree structure
Comparing UPGMA and NJ for Sumbanese (Downey et al. 2008)
Model of language evolution tree I consider one of the most complete described trees - tree of language evolution for Turkic family. The ages of all languages were defined by glottochronology method. [Dybo, 2002]. Lengths of all the edges in the tree are calculated and ordered from minimum to maximum.
Lengths of edges of the evolution tree for Turkic languages Y-axis – lengths of the edges
Lengths of edges of the evolution tree for Turkic languages The lengths of most edges – from the second to the 66th – lie on the direct line on the diagram, i.e. edges lengths are random values uniformly distributed on the interval (90, 650). The lengths of these edges are from 90 to 650 years. The average value of the edge length (without 10 longest ones) is 370 years. The declination of the edge length from the average value does not exceed 280 years that equals p = 0.75 of the average length. Similar results are obtained for language families from completely different regions. For Mande – 0.81, for Mayan – 0.73.
Tree structure (typology) balanced unbalanced (distorted)
Measure of distortion The root takes the 0 level and the level of an ancestor is bigger than the level of a descendant by 1. The sum of nodes levels is called a measure of distortion for the tree. It is obvious that the closer a tree is to a complete tree, the smaller is the measure of its distortion.
Algorithms for NJ and UPGMA comparison 1.A random binary tree T with a given number of leaves r is generated, and all the edges are of equal length 1. 2.The measure of distortion for the tree T is calculated. 3.In the generated tree the length of each edge is changed by a random number from interval [– p, + p], where p is a constant smaller or equal to 1. 4.The matrix of distances between leaves is constructed by the data. 5.Trees T-UPGMA and T-NJ are constructed by the distance matrix by methods UPGMA and NJ. 6.The Robinson-Foulds distance between the obtained trees and tree T is calculated.
Experiment 1000 random trees with 15 leaves have been generated and the results have been averaged All the trees were divided on 4 groups by the measure of their distortion with the values of distortion: 31-36, 37-40, 41-45, For each group the averaged Robinson- Foulds distances were calculated
NJ and UPGMA comparison: data Measure of distortion R-F-distance UPGMA R-F-distance NJ ,315, ,415, ,116, ,047,43
Summary for Problem -I For trees with a small measure of distortion (which are close to a complete tree) better results are provided by UPGMA algorithm. NJ is better for unbalanced (distorted) trees Both real examples and modeling by generation of random trees shows that UPGMA is preferable in a number of cases. NJ algorithm is not undoubtedly the best one. In the different situations it is necessary to use the different algorithms
Problem -II To explore the dependence of the result from the choice of features To explore how feature stability influents the result
Data Grammar database Jazyki mira (Languages of the World): 317 languages, 3821 features 503 most informative features (that are found at least in 25 languages but no more than in 300 languages) were selected
Measure of stability 4 measures: Maslovas (2004), Nicholss (1995), Wichmann & Holmans (2009), Solovyevs (2009) Solovyevs measure is the number of feature changing during language group evolution. It can be obtained by Maximal Parsimony algorithm
Experiment For every measure the features were divided into four approximately equal by number of features groups, from the maximum to the minimum degree of stability. For every feature group we constructed evolution trees by NJ algorithm. The Robinson-Foulds distances were calculated between consensus tree and the trees constructed by NJ for all stability groups
Results Stability measure/ Group Number Group 1Group 2Group 3Group 4 Maslovas measure Nicholss measure Wichmanns measure Solovyevs measure Robinson-Foulds distances
Summary for Problem - II Best trees are constructed in the fourth group (with the lowest degree of stability) for all stability measures The obtained result can be explained by the fact that the features from the first three groups are less informative