Leveraging history for faster sampling of online social networks

Zhuojie Zhou, Nan Zhang, Gautam Das

Research output: Chapter in Book/Report/Conference proceedingChapter

6 Citations (Scopus)

Abstract

With a vast amount of data available on online social networks, how to enable efficient analytics over such data has been an increasingly important research problem. Given the sheer size of such social networks, many existing studies resort to sampling techniques that draw random nodes from an online social network through its restrictive web/API interface. While these studies differ widely in analytics tasks supported and algorithmic design, almost all of them use the exact same underlying technique of random walk - a Markov Chain Monte Carlo based method which iteratively transits from one node to its random neighbor. Random walk fits naturally with this problem because, for most online social networks, the only query we can issue through the interface is to retrieve the neighbors of a given node (i.e., no access to the full graph topology). A problem with random walks, however, is the "burn-in" period which requires a large number of transitions/queries before the sampling distribution converges to a stationary value that enables the drawing of samples in a statistically valid manner. In this paper, we consider a novel problem of speeding up the fundamental design of random walks (i.e., reducing the number of queries it requires) without changing the stationary distribution it achieves - thereby enabling a more efficient "drop-in" replacement for existing sampling-based analytics techniques over online social networks. Technically, our main idea is to leverage the history of random walks to construct a higher-ordered Markov chain. We develop two algorithms, Circulated Neighbors and Groupby Neighbors Random Walk (CNRW and GNRW) and rigidly prove that, no matter what the social network topology is, CNRW and GNRW offer better efficiency than baseline random walks while achieving the same stationary distribution. We demonstrate through extensive experiments on real-world social networks and synthetic graphs the superiority of our techniques over the existing ones.

Original languageEnglish (US)
Title of host publicationProceedings of the VLDB Endowment
PublisherAssociation for Computing Machinery
Pages1034-1045
Number of pages12
Edition10
DOIs
StatePublished - Jan 1 2015
Event3rd Workshop on Spatio-Temporal Database Management, STDBM 2006, Co-located with the 32nd International Conference on Very Large Data Bases, VLDB 2006 - Seoul, Korea, Republic of
Duration: Sep 11 2006Sep 11 2006

Publication series

NameProceedings of the VLDB Endowment
Number10
Volume8
ISSN (Electronic)2150-8097

Other

Other3rd Workshop on Spatio-Temporal Database Management, STDBM 2006, Co-located with the 32nd International Conference on Very Large Data Bases, VLDB 2006
CountryKorea, Republic of
CitySeoul
Period9/11/069/11/06

Fingerprint

Sampling
Markov processes
Topology
Application programming interfaces (API)
Interfaces (computer)
Experiments

All Science Journal Classification (ASJC) codes

  • Computer Science (miscellaneous)
  • Computer Science(all)

Cite this

Zhou, Z., Zhang, N., & Das, G. (2015). Leveraging history for faster sampling of online social networks. In Proceedings of the VLDB Endowment (10 ed., pp. 1034-1045). (Proceedings of the VLDB Endowment; Vol. 8, No. 10). Association for Computing Machinery. https://doi.org/10.14778/2794367.2794373
Zhou, Zhuojie ; Zhang, Nan ; Das, Gautam. / Leveraging history for faster sampling of online social networks. Proceedings of the VLDB Endowment. 10. ed. Association for Computing Machinery, 2015. pp. 1034-1045 (Proceedings of the VLDB Endowment; 10).
@inbook{e880df3aa4754be6ad9e70309d9ddbe6,
title = "Leveraging history for faster sampling of online social networks",
abstract = "With a vast amount of data available on online social networks, how to enable efficient analytics over such data has been an increasingly important research problem. Given the sheer size of such social networks, many existing studies resort to sampling techniques that draw random nodes from an online social network through its restrictive web/API interface. While these studies differ widely in analytics tasks supported and algorithmic design, almost all of them use the exact same underlying technique of random walk - a Markov Chain Monte Carlo based method which iteratively transits from one node to its random neighbor. Random walk fits naturally with this problem because, for most online social networks, the only query we can issue through the interface is to retrieve the neighbors of a given node (i.e., no access to the full graph topology). A problem with random walks, however, is the {"}burn-in{"} period which requires a large number of transitions/queries before the sampling distribution converges to a stationary value that enables the drawing of samples in a statistically valid manner. In this paper, we consider a novel problem of speeding up the fundamental design of random walks (i.e., reducing the number of queries it requires) without changing the stationary distribution it achieves - thereby enabling a more efficient {"}drop-in{"} replacement for existing sampling-based analytics techniques over online social networks. Technically, our main idea is to leverage the history of random walks to construct a higher-ordered Markov chain. We develop two algorithms, Circulated Neighbors and Groupby Neighbors Random Walk (CNRW and GNRW) and rigidly prove that, no matter what the social network topology is, CNRW and GNRW offer better efficiency than baseline random walks while achieving the same stationary distribution. We demonstrate through extensive experiments on real-world social networks and synthetic graphs the superiority of our techniques over the existing ones.",
author = "Zhuojie Zhou and Nan Zhang and Gautam Das",
year = "2015",
month = "1",
day = "1",
doi = "10.14778/2794367.2794373",
language = "English (US)",
series = "Proceedings of the VLDB Endowment",
publisher = "Association for Computing Machinery",
number = "10",
pages = "1034--1045",
booktitle = "Proceedings of the VLDB Endowment",
edition = "10",

}

Zhou, Z, Zhang, N & Das, G 2015, Leveraging history for faster sampling of online social networks. in Proceedings of the VLDB Endowment. 10 edn, Proceedings of the VLDB Endowment, no. 10, vol. 8, Association for Computing Machinery, pp. 1034-1045, 3rd Workshop on Spatio-Temporal Database Management, STDBM 2006, Co-located with the 32nd International Conference on Very Large Data Bases, VLDB 2006, Seoul, Korea, Republic of, 9/11/06. https://doi.org/10.14778/2794367.2794373

Leveraging history for faster sampling of online social networks. / Zhou, Zhuojie; Zhang, Nan; Das, Gautam.

Proceedings of the VLDB Endowment. 10. ed. Association for Computing Machinery, 2015. p. 1034-1045 (Proceedings of the VLDB Endowment; Vol. 8, No. 10).

Research output: Chapter in Book/Report/Conference proceedingChapter

TY - CHAP

T1 - Leveraging history for faster sampling of online social networks

AU - Zhou, Zhuojie

AU - Zhang, Nan

AU - Das, Gautam

PY - 2015/1/1

Y1 - 2015/1/1

N2 - With a vast amount of data available on online social networks, how to enable efficient analytics over such data has been an increasingly important research problem. Given the sheer size of such social networks, many existing studies resort to sampling techniques that draw random nodes from an online social network through its restrictive web/API interface. While these studies differ widely in analytics tasks supported and algorithmic design, almost all of them use the exact same underlying technique of random walk - a Markov Chain Monte Carlo based method which iteratively transits from one node to its random neighbor. Random walk fits naturally with this problem because, for most online social networks, the only query we can issue through the interface is to retrieve the neighbors of a given node (i.e., no access to the full graph topology). A problem with random walks, however, is the "burn-in" period which requires a large number of transitions/queries before the sampling distribution converges to a stationary value that enables the drawing of samples in a statistically valid manner. In this paper, we consider a novel problem of speeding up the fundamental design of random walks (i.e., reducing the number of queries it requires) without changing the stationary distribution it achieves - thereby enabling a more efficient "drop-in" replacement for existing sampling-based analytics techniques over online social networks. Technically, our main idea is to leverage the history of random walks to construct a higher-ordered Markov chain. We develop two algorithms, Circulated Neighbors and Groupby Neighbors Random Walk (CNRW and GNRW) and rigidly prove that, no matter what the social network topology is, CNRW and GNRW offer better efficiency than baseline random walks while achieving the same stationary distribution. We demonstrate through extensive experiments on real-world social networks and synthetic graphs the superiority of our techniques over the existing ones.

AB - With a vast amount of data available on online social networks, how to enable efficient analytics over such data has been an increasingly important research problem. Given the sheer size of such social networks, many existing studies resort to sampling techniques that draw random nodes from an online social network through its restrictive web/API interface. While these studies differ widely in analytics tasks supported and algorithmic design, almost all of them use the exact same underlying technique of random walk - a Markov Chain Monte Carlo based method which iteratively transits from one node to its random neighbor. Random walk fits naturally with this problem because, for most online social networks, the only query we can issue through the interface is to retrieve the neighbors of a given node (i.e., no access to the full graph topology). A problem with random walks, however, is the "burn-in" period which requires a large number of transitions/queries before the sampling distribution converges to a stationary value that enables the drawing of samples in a statistically valid manner. In this paper, we consider a novel problem of speeding up the fundamental design of random walks (i.e., reducing the number of queries it requires) without changing the stationary distribution it achieves - thereby enabling a more efficient "drop-in" replacement for existing sampling-based analytics techniques over online social networks. Technically, our main idea is to leverage the history of random walks to construct a higher-ordered Markov chain. We develop two algorithms, Circulated Neighbors and Groupby Neighbors Random Walk (CNRW and GNRW) and rigidly prove that, no matter what the social network topology is, CNRW and GNRW offer better efficiency than baseline random walks while achieving the same stationary distribution. We demonstrate through extensive experiments on real-world social networks and synthetic graphs the superiority of our techniques over the existing ones.

UR - http://www.scopus.com/inward/record.url?scp=84953876918&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=84953876918&partnerID=8YFLogxK

U2 - 10.14778/2794367.2794373

DO - 10.14778/2794367.2794373

M3 - Chapter

AN - SCOPUS:84953876918

T3 - Proceedings of the VLDB Endowment

SP - 1034

EP - 1045

BT - Proceedings of the VLDB Endowment

PB - Association for Computing Machinery

ER -

Zhou Z, Zhang N, Das G. Leveraging history for faster sampling of online social networks. In Proceedings of the VLDB Endowment. 10 ed. Association for Computing Machinery. 2015. p. 1034-1045. (Proceedings of the VLDB Endowment; 10). https://doi.org/10.14778/2794367.2794373