[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Seminari 6 giugno: Gene Clustering + Text Mining



Si annuncia che presso il Dipartimento di Matematica e Statistica dell'Università di Napoli Federico II, nell'ambito delle attività del Dottorato di Ricerca in Statistica, nella giornata di venerdì 6 giugno p.v. si terranno i seguenti seminari. La partecipazione è aperta a tutti gli interessati.

ore 10:00 - 12:00
Prof. Rebecka Jornsten
Department of Statistics, Rutgers University, NJ, USA

Title: Simultaneous Gene Clustering and Subset Selection for Sample Classification via MDL

The microarray technology allows for the simultaneous monitoring of thousands of genes for each sample. The high-dimensional gene expression data can be used to study similarities of gene expression profiles across different samples to form a gene clustering. These clusters may be indicative of genetic pathways.
Parallel to gene clustering is the important task of sample classification based on all or selected gene expressions.
The gene clustering and sample classification are often undertaken separately, or in a directional manner (one as an
aid for the other). However, such separation of the two tasks may occlude informative structure in the data. Here we present an algorithm for the simultaneous clustering of genes and subset selection of gene clusters for sample classification. We develop a new model selection criterion based on Rissanen's MDL (minimum description length) principle.
For the first time, an MDL code length is given for both explanatory variables (genes) and response variables (sample class labels). The final output of the proposed algorithm is a sparse and interpretable classification model based
on cluster centroids or the closest genes to the centroids. At the same time, these models give competitive test error rates as the best reported methods. Compared with classification models based on single gene selections, our rules are stable in the sense that the number of clusters has a small variability and the centroids of the clusters are well correlated (or consistent) across different cross validation samples.

**********************************

ore 15:00 - 17:00
Prof. Regina Liu
Department of Statistics, Rutgers University, NJ, USA

Title: Statistical Mining of Massive Text Data

Abstract:
The recent advances in computing and data acquisition technologies have made the collection of massive amounts of data a routine practice in many fields. Besides the voluminous size, the types of the data are also often non-standard. Among the several non-standard types of data, we deal with textual data and focus on text classification. Text classification plays an important role in information retrieval, and machine learning. It has become an indispensable tool for mining massive streaming textual data. Given the probabilistic nature of many text classification techniques, the general area of text analysis has become a fertile research ground for statisticians. We discuss some statistical textual classification methods, and evaluate their performance in terms of misclassification rates. We also break down further the misclassification rates and other related findings to suggest ways for the task of feature selections.
This task arises frequently since statisticians are routinely asked to perform data analysis within a sea of unstructured data. In this context, we develop a systematic data mining procedure for exploring large free-style text databases by automatic means, with the purpose of discovering useful features and constructing tracking statistics for measuring performance or risk. The procedure includes text analysis, risk analysis, and nonparametric inference. We use some aviation safety report repositories from the FAA and the NTSB to demonstrate problem statements and applications of our procedure to general risk management and decision-support systems. Some specific constructions of tracking statistics are discussed.