[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Seminari 6 giugno: Gene Clustering + Text Mining
Si annuncia che presso il Dipartimento di Matematica e Statistica
dell'Università di Napoli Federico II, nell'ambito delle attività del
Dottorato di Ricerca in Statistica, nella giornata di venerdì 6 giugno
p.v. si terranno i seguenti seminari. La partecipazione è aperta a
tutti gli interessati.
ore 10:00 - 12:00
Prof. Rebecka Jornsten
Department of Statistics, Rutgers University, NJ, USA
Title: Simultaneous Gene Clustering and Subset Selection for Sample
Classification via MDL
The microarray technology allows for the simultaneous monitoring of
thousands of genes for each sample. The high-dimensional gene expression
data can be used to study similarities of gene expression profiles across
different samples to form a gene clustering. These clusters may be
indicative of genetic pathways.
Parallel to gene clustering is the important task of sample
classification based on all or selected gene expressions.
The gene clustering and sample classification are often undertaken
separately, or in a directional manner (one as an
aid for the other). However, such separation of the two tasks may occlude
informative structure in the data. Here we present an algorithm for the
simultaneous clustering of genes and subset selection of gene clusters
for sample classification. We develop a new model selection criterion
based on Rissanen's MDL (minimum description length) principle.
For the first time, an MDL code length is given for both explanatory
variables (genes) and response variables (sample class labels). The final
output of the proposed algorithm is a sparse and interpretable
classification model based
on cluster centroids or the closest genes to the centroids. At the same
time, these models give competitive test error rates as the best reported
methods. Compared with classification models based on single gene
selections, our rules are stable in the sense that the number of clusters
has a small variability and the centroids of the clusters are well
correlated (or consistent) across different cross validation
samples.
**********************************
ore 15:00 - 17:00
Prof. Regina Liu
Department of Statistics, Rutgers University, NJ, USA
Title: Statistical Mining of Massive Text Data
Abstract:
The recent advances in computing and data acquisition technologies have
made the collection of massive amounts of data a routine practice in many
fields. Besides the voluminous size, the types of the data are also often
non-standard. Among the several non-standard types of data, we deal with
textual data and focus on text classification. Text classification plays
an important role in information retrieval, and machine learning. It has
become an indispensable tool for mining massive streaming textual data.
Given the probabilistic nature of many text classification techniques,
the general area of text analysis has become a fertile research ground
for statisticians. We discuss some statistical textual classification
methods, and evaluate their performance in terms of misclassification
rates. We also break down further the misclassification rates and other
related findings to suggest ways for the task of feature selections.
This task arises frequently since statisticians are routinely asked to
perform data analysis within a sea of unstructured data. In this context,
we develop a systematic data mining procedure for exploring large
free-style text databases by automatic means, with the purpose of
discovering useful features and constructing tracking statistics for
measuring performance or risk. The procedure includes text analysis, risk
analysis, and nonparametric inference. We use some aviation safety report
repositories from the FAA and the NTSB to demonstrate problem statements
and applications of our procedure to general risk management and
decision-support systems. Some specific constructions of tracking
statistics are discussed.