TY - GEN
T1 - Collaborative Cloud Computing Framework for Health Data with Open Source Technologies
AU - Rouzbeh, Fatemeh
AU - Grama, Ananth
AU - Griffin, Paul
AU - Adibuzzaman, Mohammad
N1 - Publisher Copyright:
© 2020 ACM.
PY - 2020/9/21
Y1 - 2020/9/21
N2 - The proliferation of sensor technologies and advancements in data collection methods have enabled the accumulation of very large amounts of data. Increasingly, these datasets are considered for scientific research. However, the design of the system architecture to achieve high performance in terms of parallelization, query processing time, aggregation of heterogeneous data types (e.g., time series, images, structured data, among others), and difficulty in reproducing scientific research remain a major challenge. This is specifically true for health sciences research, where the systems must be i) easy to use with the flexibility to manipulate data at the most granular level, ii) agnostic of programming language kernel, iii) scalable, and iv) compliant with the HIPAA privacy law. In this paper, we review the existing literature for such big data systems for scientific research in health sciences and identify the gaps of the current system landscape. We propose a novel architecture for software-hardware-data ecosystem using open source technologies such as Apache Hadoop, Kubernetes and JupyterHub in a distributed environment. We also evaluate the system using a large clinical data set of 69M patients.
AB - The proliferation of sensor technologies and advancements in data collection methods have enabled the accumulation of very large amounts of data. Increasingly, these datasets are considered for scientific research. However, the design of the system architecture to achieve high performance in terms of parallelization, query processing time, aggregation of heterogeneous data types (e.g., time series, images, structured data, among others), and difficulty in reproducing scientific research remain a major challenge. This is specifically true for health sciences research, where the systems must be i) easy to use with the flexibility to manipulate data at the most granular level, ii) agnostic of programming language kernel, iii) scalable, and iv) compliant with the HIPAA privacy law. In this paper, we review the existing literature for such big data systems for scientific research in health sciences and identify the gaps of the current system landscape. We propose a novel architecture for software-hardware-data ecosystem using open source technologies such as Apache Hadoop, Kubernetes and JupyterHub in a distributed environment. We also evaluate the system using a large clinical data set of 69M patients.
UR - http://www.scopus.com/inward/record.url?scp=85096974197&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85096974197&partnerID=8YFLogxK
U2 - 10.1145/3388440.3412460
DO - 10.1145/3388440.3412460
M3 - Conference contribution
AN - SCOPUS:85096974197
T3 - Proceedings of the 11th ACM International Conference on Bioinformatics, Computational Biology and Health Informatics, BCB 2020
BT - Proceedings of the 11th ACM International Conference on Bioinformatics, Computational Biology and Health Informatics, BCB 2020
PB - Association for Computing Machinery, Inc
T2 - 11th ACM International Conference on Bioinformatics, Computational Biology and Health Informatics, BCB 2020
Y2 - 21 September 2020 through 24 September 2020
ER -