Mixture models for counts, model selection, regularisation and estimation.

Sémianire UTC,
20 Mars 2018

Etienne Côme COSYS/GRETTIA/IFSTTAR

Mixture of unigrams

$$ \begin{eqnarray} Z&\sim&\mathcal{M}(1,\mathbf{\nu})\\ X_i|Z_{ik}=1&\sim&\mathcal{M}(L^i,\mathbf{\mu}_k)\\ \end{eqnarray} $$

Text clustering: words frequencies depend on cluster membership
$Z$, latents variables with proportions $\mathbf{\nu}$
$X_i$, word counts for document $i$ with conditional distribution over vocabulary given by $\mu_k$ and document lengths equal to $L^i$.
Derived use : user clustering, OD matrices clustering,...

Mixture of unigrams, estimation

Frequentists --.-------- Bayesian

classical solution : EM no special difficulties
determining $k$ :
- several run of EM for each $k$,
- model selection criterions BIC, AIC, ICL, slope heuristic

Mixture of unigrams, estimation

Frequentists --.-------- Bayesian

$BIC(\theta,D_k)$ asymptotic approximation: $$ -2\ln(p_k(x,\color{green}{\hat{\theta}}))+\color{red}{D_k}\ln(\color{blue}{N})$$

$\color{green}{\hat{\theta}}$, paramaters found via EM
$\color{red}{D_k}$, number of free parameters $(k-1)+k*(V-1)$
$\color{blue}{N}$, sample size
density estimation task

$\Rightarrow$ Drawbacks, several run of EM, asymptotic approximation

Mixture of unigrams, estimation

Frequentists --.-------- Bayesian

$AIC(\color{green}{\theta},\color{red}{D_k})$,asymptotic approximation : $$ -2\ln(p_k(x,\color{green}{\hat{\theta}}))+2\color{red}{D_k}$$

$\color{green}{\hat{\theta}}$, paramaters found via EM
$\color{red}{D_k}$, number of free parameters $(k-1)+k*(V-1)$
density estimation task

$\Rightarrow$ Drawbacks, several run of EM, asymptotic approximation

Mixture of unigrams, estimation

Frequentists --.-------- Bayesian

$ICL(\color{green}{\hat{\theta}},z,\color{red}{D_k})$, asymptotic approximation : $$ -2 \ln(p_k(x,\color{purple}{\hat{z}},\color{green}{\hat{\theta}}))+\color{red}{D_k}\ln(\color{blue}{N})$$

$\color{green}{\hat{\theta}}$, paramaters found via EM
$\color{purple}{\hat{z}}$, clustering found via EM
$\color{red}{D_k}$, number of free parameters $(k-1)+k*(V-1)$
$\color{blue}{N}$, sample size
clustering task

$\Rightarrow$ Drawbacks, several run of EM, asymptotic approximation

Mixture of unigrams, estimation

Frequentists --.-------- Bayesian

Slope heuristic : $$ -\ln(p_k(x,\color{green}{\hat{\theta}}))+2\hat{s}\color{red}{D_k}$$
! Non-asymptotic arguments [Baudry & al 2010]
data-driven penalty

$\Rightarrow$ Drawbacks, several run of EM,

Mixture of unigrams, estimation

Frequentists ----------. Bayesian

Dirichlet are conjugate priors for Multinomials: $$ \begin{eqnarray} \mathbf{\nu}&\sim&\mathcal{D}(\mathbf{\alpha})\\ \mathbf{\mu}_k&\sim&\mathcal{D}(\mathbf{\beta}_k)\\ \end{eqnarray} $$

Prior parameters choices: Jeffrey $\mathbf{\alpha} = (\frac{1}{2},...\frac{1}{2})$ or uniform $\mathbf{\alpha} = (1,...,1)$
Estimation : Gibbs sampling, Collapsed Gibbs sampling
! Mixing, label switching, ...

Mixture of unigrams, estimation

Frequentists ----------. Bayesian

Gibbs sampling [Rigouste & al 2006]

$$\begin{eqnarray} &\sim& Z|\nu, \mu,X\\ &\sim& \nu|Z,X\\ &\sim& \mu|Z,X\\ \end{eqnarray}$$ ! Mixing, label switching, ...

Mixture of unigrams, estimation

Frequentists ----------. Bayesian

Collapsed Gibbs sampling [Rigouste & al 2006]

$$\begin{eqnarray} &\sim& Z|X\\ \end{eqnarray}$$ ! Mixing, label switching, ...

Mixture of unigrams, estimation

Frequentists --------.-- Bayesian, ICL [Biernacki & al 2012]

Exact ICL : $$ p(\mathbf{x},\mathbf{z})=\int_{\mathbf{\Theta}}p(x,z;\mathbf{\theta})p(\theta)d\theta $$ same roots, as collapsed Gibbs, exact integration with conjugate priors.

Mixture of unigrams, estimation

Frequentists --------.-- Bayesian, ICL

Prior parameters choices : Jeffrey $\mathbf{\alpha} = (\frac{1}{2},...\frac{1}{2})$ or uniform $\mathbf{\alpha} = (1,...,1)$

Parameters marginalization

Conjugacy $\Rightarrow$ analytical integration of the parameters : \begin{eqnarray} \log(p(\mathbf{x},\mathbf{z}))&=&\color{RoyalBlue}{\log(p(\mathbf{x}|\mathbf{z}))}+\color{SeaGreen}{\log(p(\mathbf{z}))}\\ &=&\color{RoyalBlue}{\sum_{k=1}^{K}\log(\frac{C(\mathbf{\epsilon}_k)}{C(\mathbf{\beta_k})})}+\color{SeaGreen}{\log(\frac{C(\mathbf{n})}{C(\mathbf{\alpha})})}\\ \end{eqnarray} with $\left\{\tiny{\begin{eqnarray} \color{RoyalBlue}{\mathbf{\epsilon}_k}&=&(\beta_{k1}+\sum_{i=1}^NX_{i1},...,\beta_{kV}+\sum_{i=1}^NX_{iV})\\ \color{SeaGreen}{\mathbf{n} }&=& (\alpha_{1}+\sum_{i=1}^NZ_{i1},...,\alpha_{k}+\sum_{i=1}^NZ_{iK}) \end{eqnarray}}\right\}$ the pseud-count vectors
and $C$ the normalisation term of the dirichlet distribution $C(\mathbf{a})=\frac{\prod_{d=1}^D\Gamma(a_d)}{\Gamma(\sum_{d=1}^Da_d)}$

Models

Graph clustering, Stochastic Block Model

$$ \begin{eqnarray} Z&\sim&\mathcal{M}(1,\mathbf{\nu})\\ X_{ij}|Z_{ik}Z_{jl}=1&\sim&\mathcal{B}(\mathbf{\Pi}_{kl})\\ \end{eqnarray} $$

Graph clustering: links probabilities depend on nodes cluster membership
$Z$, latents variables with proportions $\mathbf{\nu}$
$X_{ij}$, binary links

Models

Degree corrected Stochastic Block Model

$$ \begin{eqnarray} Z&\sim&\mathcal{M}(1,\mathbf{\alpha})\\ X_{ij}|Z_{ik}Z_{jl}=1&\sim&\mathcal{P}(\beta_j\beta_i\Pi_{kl})\\ \end{eqnarray} $$

Graph clustering: links probabilities depend on nodes cluster membership
$Z$, latents variables with proportions $\mathbf{\nu}$
$X_{ij}$, links strength (interaction counts)
$\beta_i$ degree correction for node $i$ (handle larges differences in nodes degree)

SBM estimation

Same algorithms and methods except some additional difficulties :

Latents variables are dependant | X
EM inaproriate
$\Rightarrow$ Variational approximations [Latouche & al 2012]
Gibbs and collapsed Gibs Sampling [Mc Daid & al 2013, Newman & al 2016]

Greedy optimization of ICL

$$\hat{Z}=\arg\max_{Z}ICL_{ex}(X,Z)$$

penalyzed criterion, allow change of $K$
simple algorithm : greedy swap
allow cluster death
start from a random over-segmented solution
merge movements can also be used
// ICM besag

Algorithm, greedy swap

initialize Z with Ki cluster
while possible moves
- select a node, computes swap moves costs
- apply the best positive move if any
- update pseudo counts matrice and vector.
return Z and pseudo counts matrice and vector

Remarks:

$\Rightarrow$ No need for multiple run for each $k$
$\Rightarrow$ ICL depends only on pseudo counts matrice and vector
$\Rightarrow$ Competitive complexity: $\mathcal{O}(NK^2+L)$ vs $\mathcal{O}(K^3L)$, and running time
$\Rightarrow$ Local optimum ! initialization
$\Rightarrow$ Combine with greedy hierachical moves

Remarks

3 regimes : easy, hard, impossible
competitive algorithm on simulated data
low complexity, good running time
local optimum of the combinatorial problem
overfitting still posible (c.f. large scale experiment [Ghasemian 2018])

Main idea

Use the dirichlet over the proportions $\alpha$ to regularize the clustering
Decreasing $\alpha$ will favor simple solutions
Use a hierachical approach by studying merge movements
Simple approximation for $log(\Gamma(x))$ around $0$

Merge Cost with respect to $\alpha$:

$$ \begin{eqnarray} \Delta_{gh}(\alpha)&>&0\\ log(\alpha)&<&\log(\frac{K}{(K-1)})+\\ &&D_{gh}(Z)+log(\Gamma(n_h+n_g))-\log(\Gamma(n_h))-\log(\Gamma(n_g)) \end{eqnarray}$$

Algorithm

hierachical, greedy merging

initialize Z with GreedyICL and $\alpha=1$
while possible moves
- update merge costs and associated $\log(\alpha)$ values
- apply the best move (with largest $\log(\alpha)$ value)
- update pseudo counts matrice and vector, store $\log(\alpha)$, merge tree

Remarks

$\Rightarrow$ complexity $\mathcal{O}(K^3)$
$\Rightarrow$ use only the pseudo counts matrice and vector
$\Rightarrow$ heuristic to follow the pareto front over ICL,$\alpha$

Bibliographie :

Baudry, Maugis, Micel 2010 : Slope Heuristics: Overview and Implementation
Rigouste, Cappé, Yvon 2006 : Inference and Evaluation of the Multinomial Mixture Model for Text Clustering
Biernacki, Celeux, Govaert 2011 : Exact and Monte Carlo Calculations of Integrated Likelihoods for the Latent Class Model
Latouche, Birmelé, Ambroise 2012 : Variational bayesian inference and complexity control for stochastic block models
Mc Daid, Murphy, Hurley 2013 : Improved bayesian inference for the stochastic block model with application to large networks
Cômer, Latouche 2015 : Model selection and clustering in stochastic block models with the exact integrated complete data likelihood
Newman, Reinart 2016 : Estimating the number of communities in a network
Ghasemian, Hosseinmardi, Clauset 2018 : Evaluating Overfit and Underfit in Models of Network Community Structure

Work in progress :

Côme, Jouvin, Latouche, Bouveyron : Hierarchical clustering in stochastic block models with the integrated classification likelihood

Mixture models for counts, model selection, regularisation and estimation.

Sémianire UTC, 20 Mars 2018

Mixture of unigrams

Mixture of unigrams, estimation

Frequentists --.-------- Bayesian

Mixture of unigrams, estimation

Frequentists --.-------- Bayesian

Mixture of unigrams, estimation

Frequentists --.-------- Bayesian

Mixture of unigrams, estimation

Frequentists --.-------- Bayesian

Mixture of unigrams, estimation

Frequentists --.-------- Bayesian

Mixture of unigrams, estimation

Frequentists --.-------- Bayesian

Mixture of unigrams, estimation

Frequentists ----------. Bayesian

Mixture of unigrams, estimation

Frequentists ----------. Bayesian

Gibbs sampling [Rigouste & al 2006]

Mixture of unigrams, estimation

Frequentists ----------. Bayesian

Collapsed Gibbs sampling [Rigouste & al 2006]

Mixture of unigrams, estimation

Frequentists --------.-- Bayesian, ICL [Biernacki & al 2012]

Mixture of unigrams, estimation

Frequentists --------.-- Bayesian, ICL

Parameters marginalization

Models

Graph clustering, Stochastic Block Model

Models

Graph clustering, Stochastic Block Model

Models

Degree corrected Stochastic Block Model

SBM estimation

Same algorithms and methods except some additional difficulties :

Greedy optimization of ICL

Algorithm, greedy swap

Remarks:

Simulations results (small scale problems)

Simulations results (medium scale problems)

Simulations results (large scale problem)

Real data (drawings blogs)

Remarks

Hiearchical, regularisation

Main idea

$log(\Gamma(x))$ around $0$

ICL as a function of $\log(\alpha)$

Merge Cost with respect to $\alpha$:

ICL as a function of $\log(\alpha)$

Algorithm

hierachical, greedy merging

Remarks

Dendogram from greddy merge tree

ICL $\alpha=1$, drawings blogs

Dendogram, drawings blogs

Dendogram, drawings blogs

$\mathbf{\Pi}$, drawings blogs

$\mathbf{\Pi}$, drawings blogs

$\mathbf{\Pi}$, drawings blogs

$\mathbf{\Pi}$, drawings blogs

Overfitting large scale simulaton

Bibliographie :

Work in progress :

Thanks for the attention !

Questions ?

@comeetie

Sémianire UTC,
20 Mars 2018