A cloud-based framework for large-scale log mining through Apache Spark And Elasticsearch

Yun Li, Yongyao Jiang, Juan Gu, Mingyue Lu, Manzhu Yu, Edward M. Armstrong, Thomas Huang, David Moroni, Lewis J. McGibbney, Greguska Frank, Chaowei Yang

Research output: Contribution to journalArticle

1 Citation (Scopus)

Abstract

The volume, variety, and velocity of different data, e.g., simulation data, observation data, and social media data, are growing ever faster, posing grand challenges for data discovery. An increasing trend in data discovery is to mine hidden relationships among users and metadata from the web usage logs to support the data discovery process. Web usage log mining is the process of reconstructing sessions from raw logs and finding interesting patterns or implicit linkages. The mining results play an important role in improving quality of search-related components, e.g., ranking, query suggestion, and recommendation. While researches were done in the data discovery domain, collecting and analyzing logs efficiently remains a challenge because (1) the volume of web usage logs continues to grow as long as users access the data; (2) the dynamic volume of logs requires on-demand computing resources for mining tasks; (3) the mining process is compute-intensive and time-intensive. To speed up the mining process, we propose a cloud-based log-mining framework using Apache Spark and Elasticsearch. In addition, a data partition paradigm, logPartitioner, is designed to solve the data imbalance problem in data parallelism. As a proof of concept, oceanographic data search and access logs are chosen to validate performance of the proposed parallel log-mining framework.

Original languageEnglish (US)
Article number1114
JournalApplied Sciences (Switzerland)
Volume9
Issue number6
DOIs
StatePublished - Jan 1 2019

Fingerprint

sparks
Electric sparks
metadata
ranking
data simulation
Metadata
recommendations
linkages
suggestion
partitions
resources
trends

All Science Journal Classification (ASJC) codes

  • Materials Science(all)
  • Instrumentation
  • Engineering(all)
  • Process Chemistry and Technology
  • Computer Science Applications
  • Fluid Flow and Transfer Processes

Cite this

Li, Yun ; Jiang, Yongyao ; Gu, Juan ; Lu, Mingyue ; Yu, Manzhu ; Armstrong, Edward M. ; Huang, Thomas ; Moroni, David ; McGibbney, Lewis J. ; Frank, Greguska ; Yang, Chaowei. / A cloud-based framework for large-scale log mining through Apache Spark And Elasticsearch. In: Applied Sciences (Switzerland). 2019 ; Vol. 9, No. 6.
@article{2fc43f6e79f741cab329b01eba06cbba,
title = "A cloud-based framework for large-scale log mining through Apache Spark And Elasticsearch",
abstract = "The volume, variety, and velocity of different data, e.g., simulation data, observation data, and social media data, are growing ever faster, posing grand challenges for data discovery. An increasing trend in data discovery is to mine hidden relationships among users and metadata from the web usage logs to support the data discovery process. Web usage log mining is the process of reconstructing sessions from raw logs and finding interesting patterns or implicit linkages. The mining results play an important role in improving quality of search-related components, e.g., ranking, query suggestion, and recommendation. While researches were done in the data discovery domain, collecting and analyzing logs efficiently remains a challenge because (1) the volume of web usage logs continues to grow as long as users access the data; (2) the dynamic volume of logs requires on-demand computing resources for mining tasks; (3) the mining process is compute-intensive and time-intensive. To speed up the mining process, we propose a cloud-based log-mining framework using Apache Spark and Elasticsearch. In addition, a data partition paradigm, logPartitioner, is designed to solve the data imbalance problem in data parallelism. As a proof of concept, oceanographic data search and access logs are chosen to validate performance of the proposed parallel log-mining framework.",
author = "Yun Li and Yongyao Jiang and Juan Gu and Mingyue Lu and Manzhu Yu and Armstrong, {Edward M.} and Thomas Huang and David Moroni and McGibbney, {Lewis J.} and Greguska Frank and Chaowei Yang",
year = "2019",
month = "1",
day = "1",
doi = "10.3390/app9061114",
language = "English (US)",
volume = "9",
journal = "Applied Sciences (Switzerland)",
issn = "2076-3417",
publisher = "Multidisciplinary Digital Publishing Institute (MDPI)",
number = "6",

}

Li, Y, Jiang, Y, Gu, J, Lu, M, Yu, M, Armstrong, EM, Huang, T, Moroni, D, McGibbney, LJ, Frank, G & Yang, C 2019, 'A cloud-based framework for large-scale log mining through Apache Spark And Elasticsearch', Applied Sciences (Switzerland), vol. 9, no. 6, 1114. https://doi.org/10.3390/app9061114

A cloud-based framework for large-scale log mining through Apache Spark And Elasticsearch. / Li, Yun; Jiang, Yongyao; Gu, Juan; Lu, Mingyue; Yu, Manzhu; Armstrong, Edward M.; Huang, Thomas; Moroni, David; McGibbney, Lewis J.; Frank, Greguska; Yang, Chaowei.

In: Applied Sciences (Switzerland), Vol. 9, No. 6, 1114, 01.01.2019.

Research output: Contribution to journalArticle

TY - JOUR

T1 - A cloud-based framework for large-scale log mining through Apache Spark And Elasticsearch

AU - Li, Yun

AU - Jiang, Yongyao

AU - Gu, Juan

AU - Lu, Mingyue

AU - Yu, Manzhu

AU - Armstrong, Edward M.

AU - Huang, Thomas

AU - Moroni, David

AU - McGibbney, Lewis J.

AU - Frank, Greguska

AU - Yang, Chaowei

PY - 2019/1/1

Y1 - 2019/1/1

N2 - The volume, variety, and velocity of different data, e.g., simulation data, observation data, and social media data, are growing ever faster, posing grand challenges for data discovery. An increasing trend in data discovery is to mine hidden relationships among users and metadata from the web usage logs to support the data discovery process. Web usage log mining is the process of reconstructing sessions from raw logs and finding interesting patterns or implicit linkages. The mining results play an important role in improving quality of search-related components, e.g., ranking, query suggestion, and recommendation. While researches were done in the data discovery domain, collecting and analyzing logs efficiently remains a challenge because (1) the volume of web usage logs continues to grow as long as users access the data; (2) the dynamic volume of logs requires on-demand computing resources for mining tasks; (3) the mining process is compute-intensive and time-intensive. To speed up the mining process, we propose a cloud-based log-mining framework using Apache Spark and Elasticsearch. In addition, a data partition paradigm, logPartitioner, is designed to solve the data imbalance problem in data parallelism. As a proof of concept, oceanographic data search and access logs are chosen to validate performance of the proposed parallel log-mining framework.

AB - The volume, variety, and velocity of different data, e.g., simulation data, observation data, and social media data, are growing ever faster, posing grand challenges for data discovery. An increasing trend in data discovery is to mine hidden relationships among users and metadata from the web usage logs to support the data discovery process. Web usage log mining is the process of reconstructing sessions from raw logs and finding interesting patterns or implicit linkages. The mining results play an important role in improving quality of search-related components, e.g., ranking, query suggestion, and recommendation. While researches were done in the data discovery domain, collecting and analyzing logs efficiently remains a challenge because (1) the volume of web usage logs continues to grow as long as users access the data; (2) the dynamic volume of logs requires on-demand computing resources for mining tasks; (3) the mining process is compute-intensive and time-intensive. To speed up the mining process, we propose a cloud-based log-mining framework using Apache Spark and Elasticsearch. In addition, a data partition paradigm, logPartitioner, is designed to solve the data imbalance problem in data parallelism. As a proof of concept, oceanographic data search and access logs are chosen to validate performance of the proposed parallel log-mining framework.

UR - http://www.scopus.com/inward/record.url?scp=85063752517&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=85063752517&partnerID=8YFLogxK

U2 - 10.3390/app9061114

DO - 10.3390/app9061114

M3 - Article

AN - SCOPUS:85063752517

VL - 9

JO - Applied Sciences (Switzerland)

JF - Applied Sciences (Switzerland)

SN - 2076-3417

IS - 6

M1 - 1114

ER -