Comparative Analysis of Phylogenic Algorithms V. Bayrasheva, R. Faskhutdinov, V. Solovyev Kazan University, Russia.

Презентация:



Advertisements
Похожие презентации
Convolutional Codes Mohammad Hanaysheh Mahdi Barhoush.
Advertisements

Business Statistics 1-1 Chapter Two Describing Data: Frequency Distributions and Graphic Presentation GOALS When you have completed this chapter, you will.
Time-Series Analysis and Forecasting – Part IV To read at home.
How can we measure distances in open space. Distances in open space.
Time-Series Analysis and Forecasting Lecture on the 5 th of October.
A new interface model for the Jazyki Mira typological database Oleg Belyaev The research is supported by RFBR grant ( а.
Combination. In mathematics a combination is a way of selecting several things out of a larger group, where (unlike permutations) order does not matter.
Education in the United Kingdom. The education system in the UK is divided into four main parts, primary education, secondary education, further education.
In mathematics, the notion of permutation is used with several slightly different meanings, all related to the act of permuting (rearranging) objects.
SPLAY TREE The basic idea of the splay tree is that every time a node is accessed, it is pushed to the root by a series of tree rotations. This series.
Minimum spanning trees. Minimum Connector Algorithms Kruskals algorithm 1.Select the shortest edge in a network 2.Select the next shortest edge which.
Chap 11-1 Statistics for Business and Economics, 6e © 2007 Pearson Education, Inc. Chapter 11 Hypothesis Testing II Statistics for Business and Economics.
AVL-Trees COMP171 Fall AVL Trees / Slide 2 Balanced binary tree * The disadvantage of a binary search tree is that its height can be as large as.
Here are multiplication tables written in a code. The tables are not in the correct order. Find the digit, represented by each letter.
Ideal Family Were prepared by Iryna Molokova and Ilona Synytsia.
Schrodingers Equation for Three Dimensions. QM in Three Dimensions The one dimensional case was good for illustrating basic features such as quantization.
Chap 9-1 Statistics for Business and Economics, 6e © 2007 Pearson Education, Inc. Chapter 9 Estimation: Additional Topics Statistics for Business and Economics.
Chap 15-1 Statistics for Business and Economics, 6e © 2007 Pearson Education, Inc. Chapter 15 Nonparametric Statistics Statistics for Business and Economics.
Statistics for Business and Economics, 6e © 2007 Pearson Education, Inc. Chap 1-1 Chapter 1 Why Study Statistics? Statistics for Business and Economics.
Chap 7-1 Statistics for Business and Economics, 6e © 2007 Pearson Education, Inc. Chapter 7 Sampling and Sampling Distributions Statistics for Business.
Транксрипт:

Comparative Analysis of Phylogenic Algorithms V. Bayrasheva, R. Faskhutdinov, V. Solovyev Kazan University, Russia

Phylogenetic Algorithms Applications in linguistics: trees of language evolution Main algorithms: NJ, UPGMA. They construct evolution trees on the base of matrices of the distances between languages Today NJ is the most popular algorithm Which algorithm is better? Which features are better for using?

NJ (Saitou, Nei, 1987 ) a = 0.01, c = 0.07

Problems - I To build a realistic model of language evolution tree To explore the dependence of the result from the tree structure

Comparing UPGMA and NJ for Sumbanese (Downey et al. 2008)

Model of language evolution tree I consider one of the most complete described trees - tree of language evolution for Turkic family. The ages of all languages were defined by glottochronology method. [Dybo, 2002]. Lengths of all the edges in the tree are calculated and ordered from minimum to maximum.

Lengths of edges of the evolution tree for Turkic languages Y-axis – lengths of the edges

Lengths of edges of the evolution tree for Turkic languages The lengths of most edges – from the second to the 66th – lie on the direct line on the diagram, i.e. edges lengths are random values uniformly distributed on the interval (90, 650). The lengths of these edges are from 90 to 650 years. The average value of the edge length (without 10 longest ones) is 370 years. The declination of the edge length from the average value does not exceed 280 years that equals p = 0.75 of the average length. Similar results are obtained for language families from completely different regions. For Mande – 0.81, for Mayan – 0.73.

Tree structure (typology) balanced unbalanced (distorted)

Measure of distortion The root takes the 0 level and the level of an ancestor is bigger than the level of a descendant by 1. The sum of nodes levels is called a measure of distortion for the tree. It is obvious that the closer a tree is to a complete tree, the smaller is the measure of its distortion.

Algorithms for NJ and UPGMA comparison 1.A random binary tree T with a given number of leaves r is generated, and all the edges are of equal length 1. 2.The measure of distortion for the tree T is calculated. 3.In the generated tree the length of each edge is changed by a random number from interval [– p, + p], where p is a constant smaller or equal to 1. 4.The matrix of distances between leaves is constructed by the data. 5.Trees T-UPGMA and T-NJ are constructed by the distance matrix by methods UPGMA and NJ. 6.The Robinson-Foulds distance between the obtained trees and tree T is calculated.

Experiment 1000 random trees with 15 leaves have been generated and the results have been averaged All the trees were divided on 4 groups by the measure of their distortion with the values of distortion: 31-36, 37-40, 41-45, For each group the averaged Robinson- Foulds distances were calculated

NJ and UPGMA comparison: data Measure of distortion R-F-distance UPGMA R-F-distance NJ ,315, ,415, ,116, ,047,43

Summary for Problem -I For trees with a small measure of distortion (which are close to a complete tree) better results are provided by UPGMA algorithm. NJ is better for unbalanced (distorted) trees Both real examples and modeling by generation of random trees shows that UPGMA is preferable in a number of cases. NJ algorithm is not undoubtedly the best one. In the different situations it is necessary to use the different algorithms

Problem -II To explore the dependence of the result from the choice of features To explore how feature stability influents the result

Data Grammar database Jazyki mira (Languages of the World): 317 languages, 3821 features 503 most informative features (that are found at least in 25 languages but no more than in 300 languages) were selected

Measure of stability 4 measures: Maslovas (2004), Nicholss (1995), Wichmann & Holmans (2009), Solovyevs (2009) Solovyevs measure is the number of feature changing during language group evolution. It can be obtained by Maximal Parsimony algorithm

Maximal Parsimony

Experiment For every measure the features were divided into four approximately equal by number of features groups, from the maximum to the minimum degree of stability. For every feature group we constructed evolution trees by NJ algorithm. The Robinson-Foulds distances were calculated between consensus tree and the trees constructed by NJ for all stability groups

Results Stability measure/ Group Number Group 1Group 2Group 3Group 4 Maslovas measure Nicholss measure Wichmanns measure Solovyevs measure Robinson-Foulds distances

Summary for Problem - II Best trees are constructed in the fourth group (with the lowest degree of stability) for all stability measures The obtained result can be explained by the fact that the features from the first three groups are less informative

Thank you for attention!