Model-based clustering of large networks

Duy Q. Vu, David Russell Hunter, Michael Schweinberger

Research output: Contribution to journalArticle

17 Citations (Scopus)

Abstract

We describe a network clustering framework, based on finite mixture models, that can be applied to discrete-valued networks with hundreds of thousands of nodes and billions of edge variables. Relative to other recent model-based clustering work for networks, we introduce a more flexible modeling framework, improve the variational-approximation estimation algorithm, discuss and implement standard error estimation via a parametric bootstrap approach, and apply these methods to much larger data sets than those seen elsewhere in the literature. The more flexible framework is achieved through introducing novel parameterizations of the model, giving varying degrees of parsimony, using exponential family models whose structure may be exploited in various theoretical and algorithmic ways. The algorithms are based on variational generalized EM algorithms, where the E-steps are augmented by a minorization-maximization (MM) idea. The bootstrapped standard error estimates are based on an efficient Monte Carlo network simulation idea. Last, we demonstrate the usefulness of the model-based clustering framework by applying it to a discrete-valued network with more than 131,000 nodes and 17 billion edge variables.

Original languageEnglish (US)
Pages (from-to)1010-1039
Number of pages30
JournalAnnals of Applied Statistics
Volume7
Issue number2
DOIs
StatePublished - Jun 1 2013

Fingerprint

Model-based Clustering
Standard error
Variational Approximation
Finite Mixture Models
Parametric Bootstrap
Parsimony
Network Simulation
Exponential Family
Error Estimation
EM Algorithm
Vertex of a graph
Model structures
Estimation Algorithms
Parameterization
Large Data Sets
Error analysis
Approximation Algorithms
Error Estimates
Monte Carlo Simulation
Clustering

All Science Journal Classification (ASJC) codes

  • Statistics and Probability
  • Modeling and Simulation
  • Statistics, Probability and Uncertainty

Cite this

Vu, Duy Q. ; Hunter, David Russell ; Schweinberger, Michael. / Model-based clustering of large networks. In: Annals of Applied Statistics. 2013 ; Vol. 7, No. 2. pp. 1010-1039.
@article{e17b35ca74b64a56acdb38898a6c0ff0,
title = "Model-based clustering of large networks",
abstract = "We describe a network clustering framework, based on finite mixture models, that can be applied to discrete-valued networks with hundreds of thousands of nodes and billions of edge variables. Relative to other recent model-based clustering work for networks, we introduce a more flexible modeling framework, improve the variational-approximation estimation algorithm, discuss and implement standard error estimation via a parametric bootstrap approach, and apply these methods to much larger data sets than those seen elsewhere in the literature. The more flexible framework is achieved through introducing novel parameterizations of the model, giving varying degrees of parsimony, using exponential family models whose structure may be exploited in various theoretical and algorithmic ways. The algorithms are based on variational generalized EM algorithms, where the E-steps are augmented by a minorization-maximization (MM) idea. The bootstrapped standard error estimates are based on an efficient Monte Carlo network simulation idea. Last, we demonstrate the usefulness of the model-based clustering framework by applying it to a discrete-valued network with more than 131,000 nodes and 17 billion edge variables.",
author = "Vu, {Duy Q.} and Hunter, {David Russell} and Michael Schweinberger",
year = "2013",
month = "6",
day = "1",
doi = "10.1214/12-AOAS617",
language = "English (US)",
volume = "7",
pages = "1010--1039",
journal = "Annals of Applied Statistics",
issn = "1932-6157",
publisher = "Institute of Mathematical Statistics",
number = "2",

}

Model-based clustering of large networks. / Vu, Duy Q.; Hunter, David Russell; Schweinberger, Michael.

In: Annals of Applied Statistics, Vol. 7, No. 2, 01.06.2013, p. 1010-1039.

Research output: Contribution to journalArticle

TY - JOUR

T1 - Model-based clustering of large networks

AU - Vu, Duy Q.

AU - Hunter, David Russell

AU - Schweinberger, Michael

PY - 2013/6/1

Y1 - 2013/6/1

N2 - We describe a network clustering framework, based on finite mixture models, that can be applied to discrete-valued networks with hundreds of thousands of nodes and billions of edge variables. Relative to other recent model-based clustering work for networks, we introduce a more flexible modeling framework, improve the variational-approximation estimation algorithm, discuss and implement standard error estimation via a parametric bootstrap approach, and apply these methods to much larger data sets than those seen elsewhere in the literature. The more flexible framework is achieved through introducing novel parameterizations of the model, giving varying degrees of parsimony, using exponential family models whose structure may be exploited in various theoretical and algorithmic ways. The algorithms are based on variational generalized EM algorithms, where the E-steps are augmented by a minorization-maximization (MM) idea. The bootstrapped standard error estimates are based on an efficient Monte Carlo network simulation idea. Last, we demonstrate the usefulness of the model-based clustering framework by applying it to a discrete-valued network with more than 131,000 nodes and 17 billion edge variables.

AB - We describe a network clustering framework, based on finite mixture models, that can be applied to discrete-valued networks with hundreds of thousands of nodes and billions of edge variables. Relative to other recent model-based clustering work for networks, we introduce a more flexible modeling framework, improve the variational-approximation estimation algorithm, discuss and implement standard error estimation via a parametric bootstrap approach, and apply these methods to much larger data sets than those seen elsewhere in the literature. The more flexible framework is achieved through introducing novel parameterizations of the model, giving varying degrees of parsimony, using exponential family models whose structure may be exploited in various theoretical and algorithmic ways. The algorithms are based on variational generalized EM algorithms, where the E-steps are augmented by a minorization-maximization (MM) idea. The bootstrapped standard error estimates are based on an efficient Monte Carlo network simulation idea. Last, we demonstrate the usefulness of the model-based clustering framework by applying it to a discrete-valued network with more than 131,000 nodes and 17 billion edge variables.

UR - http://www.scopus.com/inward/record.url?scp=84879529119&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=84879529119&partnerID=8YFLogxK

U2 - 10.1214/12-AOAS617

DO - 10.1214/12-AOAS617

M3 - Article

AN - SCOPUS:84879529119

VL - 7

SP - 1010

EP - 1039

JO - Annals of Applied Statistics

JF - Annals of Applied Statistics

SN - 1932-6157

IS - 2

ER -