Skip to content

Scala implementation of the Multiple Coclustering model, a Bayesian Non-Parametric method for block clustering of multivariate continuous observations

License

Notifications You must be signed in to change notification settings

EtienneGof/MultiCoclustering

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

29 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

MultiCoclustering

This code implements three model based block clustering methods. These models are based on the Dirichlet Process Mixture Model (DPMM, used for univariate dataset clustering) and extends it to multivariate datasets.

What's inside

The first approach, named "Clustering", performs the inference of a Dirichet Process Mixture Model on a univariate or multivariate dataset. Optionnally, it enables the user to specify a partition of the variables, thus allowing the creation of "blocks", i.e., groups of variables and simulations that independently follow the same distribution. In the "Clustering" case, this partition is not updated.

The second approach, named "Coclustering", estimates both the row-partition (as in the Clustering), but also estimates the variable partition. In this case, the dataset elements that belong to the same row and column cluster (i.e., the same block) follow independently the same distribution.

The last approach, named "Multi-Coclustering", assumes the presence of several coclustering structures. This approach first estimates a partition of variable and, in each dataset corresponding to one variable cluster, estimates a coclustering structure, such that elements in the same coclustering structure and, inside of it, in the same block, follow the same distribution independently.

The three methods and their inter-connection is illustrated below

Fo each methods, the inference is performed with a Collapsed Gibbs Sampler (cf. [1]).

Also, in each case, the user can either choose to provide the Dirichlet Process concentration parameter value or a Gamma hyperprior for this parameter. In the latter case, alpha's value is updated following [2]. Using an hyper-prior is mandatory in the multi-coclustering case.

Quick Setup

The script build.sh is provided to build the scala sources.

See src/pom.xml file for Scala dependencies.

The scripts clusteringExample.scala, coclusteringExample.scala and multiCoclusteringExample.scala illustrate the use of each methods, based on toy datasets.

[1] Neal, R. M. (2000). Markov chain sampling methods for Dirichlet process mixture models. Journal of computational and graphical statistics, 9(2), 249-265.

[2] West, M. (1992). Hyperparameter estimation in Dirichlet process mixture models. ISDS Discussion Paper# 92-A03: Duke University.

About

Scala implementation of the Multiple Coclustering model, a Bayesian Non-Parametric method for block clustering of multivariate continuous observations

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published