Extracting Information from High-Dimensional Data: Probabilistic Modeling, Inference and Evaluation
Speaker: Güngör Polatkan
Series: Final Public Orals
Location:
Engineering Quadrangle B327
Date/Time: Monday, August 20, 2012, 1:00 p.m.
- 2:30 p.m.
Abstract:
Data science is an emerging field at the interface of computer science, statistics, mathematics and signal processing. This field is undergoing an explosive growth, mainly due to the widespread use of tools, such as the internet and mobile devices, that lead to the massive accumulation of data from different sources. The sheer size of these data sets requires large scale computational (rather than human-powered) data analysis and decision making, and advances in computing resources are a driving force in this growth. However, the scale and high dimensionality of data are such that, powerful present-day computing resources can only partially address the complexity of the problems -- they need to be paired with advanced techniques and algorithms.
A typical data analysis project consists of several stages: initial exploratory analysis, model building, derivation of inference, visualization, and evaluation. In modern data science, one important problem of the model-building phase is how to incorporate data-specific properties to the models. Early machine learning techniques were designed to work on generic data sets, using parameters specified a priori. However, as the diversity and complexity of the data sets grew, more advanced approaches were needed, tailored to the particular properties of the type of application under study. Such tailoring can take many different forms. For instance, it may be necessary to learn the model parameters from the data (instead of specifying them from the start); one can incorporate prior information (such as sparsity with respect to special representations, which themselves have to be learned); it may be beneficial to make use of relational structure within the data, which can be available in many guises: segmented image patches, citation networks of documents, social networks of friends.
In this talk, we shall visit all these approaches, each time within a probabilistic model built so as to incorporate prior information. More precisely, we shall derive, in a variety of settings, and for different applications, efficient posterior inference algorithms handling large data sets, and use side information to derive superior inference techniques. We demonstrate the efficiency and accuracy of those models and algorithms in the different applications (e.g. image super-resolution, recommendation systems, time series analysis), on both real and synthetic data sets. We evaluate the quality of the results, with both quantitative and human evaluation experiments.

