Gaussian Processes in Machine Learning Dmitry P. Vetrov Dorodnicyn Computing Centre of Russian Academy of Sciences.

Презентация:



Advertisements
Похожие презентации
Some ideas of semantic analysis for anaphora resolution Dmitry P. Vetrov Dorodnicyn Computing Centre of RAS.
Advertisements

Ideal Family Were prepared by Iryna Molokova and Ilona Synytsia.
Our parents it is all in our life. They know more us, they us learn us how correctly to live that is possible, and that isn't present.
Goals and values. What are goals? Goals can be anything you want to achieve in a short period of time or in a long time period. Eg, get better grade,
1 Another useful model is autoregressive model. Frequently, we find that the values of a series of financial data at particular points in time are highly.
Correlation. In statistics, dependence refers to any statistical relationship between two random variables or two sets of data. Correlation refers to.
© 2005 Cisco Systems, Inc. All rights reserved. BGP v Customer-to-Provider Connectivity with BGP Connecting a Multihomed Customer to Multiple Service.
Sequences Sequences are patterns. Each pattern or number in a sequence is called a term. The number at the start is called the first term. The term-to-term.
Purposes Working with students Working with teachers Opinion Conclusion.
Normal Distribution. in probability theory, the normal (or Gaussian) distribution is a continuous probability distribution that has a bell-shaped probability.
How can we measure distances in open space. Distances in open space.
It's great to be a teenager. It's fearful to be a teenager. Being a teenager is romantic. It's fun to be a teenager. It's not easy to be young.
Yogesh Mehla Now concept of logic building is not so complex and not so simple. We will not work on how to make logic program in.
5 STEPS TO MAKE YOUR FAMILY HAPPIER YOU NEED THIS!
In mathematics, the notion of permutation is used with several slightly different meanings, all related to the act of permuting (rearranging) objects.
Studying abroad. Many students choose to attend schools or universities outside their home countries. Why do some students study abroad? Use specific.
The category of mood. The category of mood is an explicit verbal category expressing the relation of the action denoted by the predicate to reality as.
Chap 9-1 Statistics for Business and Economics, 6e © 2007 Pearson Education, Inc. Chapter 9 Estimation: Additional Topics Statistics for Business and Economics.
If you are ready for the lesson, let's start. What kinds of schools do you know? Public school State school Boarding school All – boys school All – girls.
The main problem between generations. There are many problems between parents and their children. It can be differences between the views of the younger.
Транксрипт:

Gaussian Processes in Machine Learning Dmitry P. Vetrov Dorodnicyn Computing Centre of Russian Academy of Sciences

Our research group My colleague Dmitry Kropotov, PhD student of MSU Dmitry Kropotov, PhD student of MSUStudents: Nikita Ptashko Nikita Ptashko Oleg Vasiliev Oleg Vasiliev Pavel Tolpegin Pavel Tolpegin Igor Tolstov Igor Tolstov Oleg Kurchin Oleg Kurchin

Overview Kernel selection problem Kernel selection problem Introduction to random processes Introduction to random processes GP regression GP regression Covariance function selection Covariance function selection Classification tasks Classification tasks Open problems Open problems

Kernel selection Kernel methods are state-of-the-art methodology for data-mining tasks. The most known of them SVM (Support Vector Machines) SVM (Support Vector Machines) Logistic regression Logistic regression RVM (Relevance Vector Machines) RVM (Relevance Vector Machines) Kernel Fisher discriminant Kernel Fisher discriminant RBF neural networks RBF neural networks The performance of all such methods depends significantly on the choice of proper kernel. The problem of kernel selection is one of the most intriguing problems in machine learning. To tell the truth nobody knows how to select good kernels…

Random processes I Random process is a set of random values indexed with 1-dimensional parameter (or multi- dimensional in case of random fields):

Random processes II If we fix index we get single random value If we fix random component we get single function of index variable Random processes have dual nature. They may be treated as random functions and have both functional and probabilistic features

Random processes III Main characteristics of random processes Mean function Mean function Covariance function Covariance function Covariance function is symmetric and non- negatively defined. Looks like kernel function in SVM, doesnt it?

Stationary processes Stationary processes have constant characteristics which do not depend on the concrete value index variable. In this case we may assume that process have zero mean The most of random process theory is developed for stationary processes

Gaussian processes Gaussian process (GP) is a process whose all marginal distributions are gaussian GP is uniquely defined by its mean and covariance functions. Without loss of generality we may assume that GP is centered at zero

Examples of GPs Examples of GPs with RBF covariance functions of different width

Observation of GPs What can we tell about the value of GP in the point ? In case of absence of any additional information only this: What can we tell about the value of GP in the point t ? In case of absence of any additional information only this: But what if we knew the values of GP in some points ?

Conditional distribution of GP We may use the values of GP as regression inputs and predict the most likely value of. We may even find its marginal distribution

Parameters of prediction Note that prediction expression is explicit and does not require any optimization. We may rewrite the prediction in the following way

Covariance function selection In case of lack of any prior preferences we may use just traditional maximum likelihood principle for adjusting proper covariance function e.g. if then parameters can be selected as

Illustration to CF selection

Connection to Bayesian inference Assume we are given noised measurements of the process to be predicted. Its true values are. The more is difference between and the less likely are. We want to select such covariance function that the rate of likely process realizations were the largest On one hand this expression can be treated as the evidence of model which is popular mean for model selection among Bayesians. On the other hand it can be shown that it is our likelihood for covariance function which corresponds to GP. On one hand this expression can be treated as the evidence of model which is popular mean for model selection among Bayesians. On the other hand it can be shown that it is our likelihood for covariance function which corresponds to GP f.

Classification problem I It seems that GPs cant be applied to classification tasks because the outputs should be discrete, either +1 or -1. It seems that GPs cant be applied to classification tasks because the outputs y should be discrete, either +1 or -1. Even if we decide to switch to the value of discriminant function rather than to its sign, it is still unclear how to train GP as we do not know the values of discriminant function even for training sample – only its sign.

Classification problem II

Classification problem III There are no explicit equations for GP classifiers predictions We may now calculate the values of GP in the points of training sample… … and with their use estimate the kernel likelihood

Interesting observation Polynomial RVM RBF stationary GP Logistic regression RBF RVM Other kernel approaches… Non-stationary GPs with rich parameterized covariance functions family

Overfitting Maximization of kernel likelihood (or model evidence in alternative Bayesian notation) does not fit the right answers but operates in more abstract terms such as adequacy of GP to the observed data. If the data are noisy then the best GP will be close to white-noise process. It may seem that such approach avoids overfitting but…

Underfitting I If there are many parameters in covariance function to be adjusted their choice via maximum kernel likelihood leads to underfitting when significant regularitites in data are treated as noise. Example: RVM which is non-stationary GP with covariance function and parameters and to be adjusted

Underfitting II Relevance Vector Regression

Thank you!