[Forum SIS] Post Doc Fellowship - ORANGE Lab - Lannion - France

Lun 5 Dic 2016 15:22:47 CET

Per conto del collega Fabrice Clerot di ORANGE Lab comunico la seguente posizione di Post-doc
nel gruppo di trattamento statistico dell'informazione di
Orange Labs Lannion

Post-Doc : Co-clustering for large scale data

ref : 0014645
Apply before : 30 Jan 2017

2 avenue Pierre Marzin
22300 LANNION - France contract

Post Doc Apply online at :
https://orange.jobs/jobs/offer.do?do=fiche&id=57600

about the role Scientific objectives
 – expected results Khiops co-clustering is a data exploratory tool that allows to analyse the correlation between two or more categorical or numerical variables. The tool implements a co-clustering method based on a model selection approach named MODL (Boullé, 2006, 2001). This tool (available at www.khiops.com) is widely used in Orange, with application in a variety of problems:
• Marketing, basket analysis: customers with the list of their purchased products (customer x product).
• Web Mining: web log analysis to identify navigation profiles (cookies x web pages).
• Telecommunications: mobile network dimensioning based on call detail records (CDR) analysis (source antenna x target antenna), e.g. exploratory analysis of CRD at a country scale (Guigourès, 2013).
• Text mining: (co)clustering de texts (texts x words).
• Graph mining: tri-clustering of temporal graphs (source nodes x target nodes x time), e.g. analysis of the London cycle dataset (Guigourès et al., 2012).
• Functional data clustering (numerical or categorical time series): TimeSeriesId x time x value or TimeSeriesId x time x event, e.g. curve clustering (Boullé, 2012).
Khiops co-clustering can deal with large size data, with millions of instances and tens of thousands of values per categorical variable, with a sub-quadratic complexity w.r.t. the number of instances. However, the tool can hardly be used with very large scale data, with up to billions of instances and variables having millions of values. For example, this limit is reached in the case of the analysis of CDR at a country scale, when the studied granularity goes from antenna level (application to network dimensioning) to individual customers (marketing application with identification of fine-grained communities and customer experience personalization). In this use case, the graph to summarize is too large to be handled by current co-clustering algorithms. The objective of the post-doc is to extend the co-clustering optimization algorithms to large scale data, given the MODL co-clustering criterion (Boullé, 2011). Among the potential approaches, one might consider to exploit or extend hierarchical partitioning algorithms such as H-metis (Karypis ert al, 2000 ; Selvakkumaran et al, 2006), well suited for very large scale graphs. This would allow to start from an initial « raw » partition, that could then be refined using the standard co-clustering algorithms on sub-parts of the graph. The set of partial coclusterings could then be merged in order to obtain a global co-clustering. This kind of algorithms with three passes (initial co-clustering initial, partial co-clusterings, merge) can be generalized at several levels of hierarchy and allow a parallelization of the algorithms. Instead of using of H-metis for the first pass, one could consider using approximate Singular Value Decomposition (SVD) for the clustering of large scale graphs. More simply, the standard co-clustering algorithm could be applied on a data sample of small size, and then the rest of the data could be projected on the obtained co-clusters, before applying the following passes.

Context

Proposed by Hartigan (1975) as an extension of standard clustering, co-clustering is a data mining technique that aims at identifying the underlying structure between the rows and columns of a data matrix in the form of homogeneous blocks. Whereas the principle of standard clustering is that of grouping similar individuals (observations) with respect to a set of features, the task of co-clustering is to simultaneously group similar individuals with respect to variables and similar variables with respect to observations thus extracting the correspondence structure between the objects and features. Another advantage of coclustering over standard clustering techniques is that its matrix reduction capacity where a large data table can be reduced into a significantly smaller one yet having the same structure as the original matrix. Several approaches have been proposed to extract underlying structures by the means of coclustering techniques (Bock, 1979; Cheng et al, 2000; Dhillon et al, 2003; Xu et al, 2010). These methods differ mainly according to the type of analyzed data (categorical or numerical), the underlying hypothesis, the extraction method and the expected results. Several families of approaches have then been proposed to perform cross-classification:
• Matrix reconstruction based methods that state the problem as that of matrix approximation (Seung et al, 2001 ; Yoo et al, 2010, Xu et al, 2010), CROEUC, CROBIN, CROKI2 (for continuous, binary and contengency data (Govaert, 1983)).
• Probabilistic models: use of latent variable in mixture models to specific each coclustering block (Govaert et al, 2003; Govaert et al, 2013).
• Co-clustering methods based of the MODL approach (Boullé, 2011), that exploit probabilistic models for two or more variables of any type (numerical or categorical), are user-parameter-free and benefit from algorithms with sub-quadratic complexity allowing to deal with large datasets.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.stat.unipg.it/pipermail/sis/attachments/20161205/9fcd668c/attachment-0001.html>