Building a multi-scaled geospatial temporal ecology database from disparate data sources: Fostering open science and data reuse

Patricia A. Soranno, Edward G. Bissell, Kendra S. Cheruvelil, Samuel T. Christel, Sarah M. Collins, C. Emi Fergus, Christopher T. Filstrup, Jean Francois Lapierre, Noah R. Lottig, Samantha K. Oliver, Caren E. Scott, Nicole J. Smith, Scott Stopyak, Shuai Yuan, Mary Tate Bremigan, John A. Downing, Corinna Gries, Emily N. Henry, Nick K. Skaff, Emily H. StanleyCraig A. Stow, Pang Ning Tan, Tyler Wagner, Katherine E. Webster

Research output: Contribution to journalReview article

49 Citations (Scopus)

Abstract

Although there are considerable site-based data for individual or groups of ecosystems, these datasets are widely scattered, have different data formats and conventions, and often have limited accessibility. At the broader scale, national datasets exist for a large number of geospatial features of land, water, and air that are needed to fully understand variation among these ecosystems. However, such datasets originate from different sources and have different spatial and temporal resolutions. By taking an open-science perspective and by combining site-based ecosystem datasets and national geospatial datasets, science gains the ability to ask important research questions related to grand environmental challenges that operate at broad scales. Documentation of such complicated database integration efforts, through peer-reviewed papers, is recommended to foster reproducibility and future use of the integrated database. Here, we describe the major steps, challenges, and considerations in building an integrated database of lake ecosystems, called LAGOS (LAke multi-scaled GeOSpatial and temporal database), that was developed at the sub-continental study extent of 17 US states (1,800,000 km2). LAGOS includes two modules: LAGOSGEO, with geospatial data on every lake with surface area larger than 4 ha in the study extent (~50,000 lakes), including climate, atmospheric deposition, land use/cover, hydrology, geology, and topography measured across a range of spatial and temporal extents; and LAGOSLIMNO, with lake water quality data compiled from ~100 individual datasets for a subset of lakes in the study extent (~10,000 lakes). Procedures for the integration of datasets included: creating a flexible database design; authoring and integrating metadata; documenting data provenance; quantifying spatial measures of geographic data; quality-controlling integrated and derived data; and extensively documenting the database. Our procedures make a large, complex, and integrated database reproducible and extensible, allowing users to ask new research questions with the existing database or through the addition of new data. The largest challenge of this task was the heterogeneity of the data, formats, and metadata. Many steps of data integration need manual input from experts in diverse fields, requiring close collaboration.

Original languageEnglish (US)
Article number28
JournalGigaScience
Volume4
Issue number1
DOIs
StatePublished - Jan 1 2015

Fingerprint

Foster Home Care
Information Storage and Retrieval
Ecology
Lakes
Databases
Ecosystems
Ecosystem
Metadata
Geology
Hydrology
Data integration
Water Quality
Datasets
Climate
Research
Land use
Documentation
Topography
Water quality
Binding Sites

All Science Journal Classification (ASJC) codes

  • Computer Science Applications
  • Health Informatics

Cite this

Soranno, P. A., Bissell, E. G., Cheruvelil, K. S., Christel, S. T., Collins, S. M., Emi Fergus, C., ... Webster, K. E. (2015). Building a multi-scaled geospatial temporal ecology database from disparate data sources: Fostering open science and data reuse. GigaScience, 4(1), [28]. https://doi.org/10.1186/s13742-015-0067-4
Soranno, Patricia A. ; Bissell, Edward G. ; Cheruvelil, Kendra S. ; Christel, Samuel T. ; Collins, Sarah M. ; Emi Fergus, C. ; Filstrup, Christopher T. ; Lapierre, Jean Francois ; Lottig, Noah R. ; Oliver, Samantha K. ; Scott, Caren E. ; Smith, Nicole J. ; Stopyak, Scott ; Yuan, Shuai ; Bremigan, Mary Tate ; Downing, John A. ; Gries, Corinna ; Henry, Emily N. ; Skaff, Nick K. ; Stanley, Emily H. ; Stow, Craig A. ; Tan, Pang Ning ; Wagner, Tyler ; Webster, Katherine E. / Building a multi-scaled geospatial temporal ecology database from disparate data sources : Fostering open science and data reuse. In: GigaScience. 2015 ; Vol. 4, No. 1.
@article{3a302f0283284efcbb99db1ddc0b1e2d,
title = "Building a multi-scaled geospatial temporal ecology database from disparate data sources: Fostering open science and data reuse",
abstract = "Although there are considerable site-based data for individual or groups of ecosystems, these datasets are widely scattered, have different data formats and conventions, and often have limited accessibility. At the broader scale, national datasets exist for a large number of geospatial features of land, water, and air that are needed to fully understand variation among these ecosystems. However, such datasets originate from different sources and have different spatial and temporal resolutions. By taking an open-science perspective and by combining site-based ecosystem datasets and national geospatial datasets, science gains the ability to ask important research questions related to grand environmental challenges that operate at broad scales. Documentation of such complicated database integration efforts, through peer-reviewed papers, is recommended to foster reproducibility and future use of the integrated database. Here, we describe the major steps, challenges, and considerations in building an integrated database of lake ecosystems, called LAGOS (LAke multi-scaled GeOSpatial and temporal database), that was developed at the sub-continental study extent of 17 US states (1,800,000 km2). LAGOS includes two modules: LAGOSGEO, with geospatial data on every lake with surface area larger than 4 ha in the study extent (~50,000 lakes), including climate, atmospheric deposition, land use/cover, hydrology, geology, and topography measured across a range of spatial and temporal extents; and LAGOSLIMNO, with lake water quality data compiled from ~100 individual datasets for a subset of lakes in the study extent (~10,000 lakes). Procedures for the integration of datasets included: creating a flexible database design; authoring and integrating metadata; documenting data provenance; quantifying spatial measures of geographic data; quality-controlling integrated and derived data; and extensively documenting the database. Our procedures make a large, complex, and integrated database reproducible and extensible, allowing users to ask new research questions with the existing database or through the addition of new data. The largest challenge of this task was the heterogeneity of the data, formats, and metadata. Many steps of data integration need manual input from experts in diverse fields, requiring close collaboration.",
author = "Soranno, {Patricia A.} and Bissell, {Edward G.} and Cheruvelil, {Kendra S.} and Christel, {Samuel T.} and Collins, {Sarah M.} and {Emi Fergus}, C. and Filstrup, {Christopher T.} and Lapierre, {Jean Francois} and Lottig, {Noah R.} and Oliver, {Samantha K.} and Scott, {Caren E.} and Smith, {Nicole J.} and Scott Stopyak and Shuai Yuan and Bremigan, {Mary Tate} and Downing, {John A.} and Corinna Gries and Henry, {Emily N.} and Skaff, {Nick K.} and Stanley, {Emily H.} and Stow, {Craig A.} and Tan, {Pang Ning} and Tyler Wagner and Webster, {Katherine E.}",
year = "2015",
month = "1",
day = "1",
doi = "10.1186/s13742-015-0067-4",
language = "English (US)",
volume = "4",
journal = "GigaScience",
issn = "2047-217X",
publisher = "BioMed Central",
number = "1",

}

Soranno, PA, Bissell, EG, Cheruvelil, KS, Christel, ST, Collins, SM, Emi Fergus, C, Filstrup, CT, Lapierre, JF, Lottig, NR, Oliver, SK, Scott, CE, Smith, NJ, Stopyak, S, Yuan, S, Bremigan, MT, Downing, JA, Gries, C, Henry, EN, Skaff, NK, Stanley, EH, Stow, CA, Tan, PN, Wagner, T & Webster, KE 2015, 'Building a multi-scaled geospatial temporal ecology database from disparate data sources: Fostering open science and data reuse', GigaScience, vol. 4, no. 1, 28. https://doi.org/10.1186/s13742-015-0067-4

Building a multi-scaled geospatial temporal ecology database from disparate data sources : Fostering open science and data reuse. / Soranno, Patricia A.; Bissell, Edward G.; Cheruvelil, Kendra S.; Christel, Samuel T.; Collins, Sarah M.; Emi Fergus, C.; Filstrup, Christopher T.; Lapierre, Jean Francois; Lottig, Noah R.; Oliver, Samantha K.; Scott, Caren E.; Smith, Nicole J.; Stopyak, Scott; Yuan, Shuai; Bremigan, Mary Tate; Downing, John A.; Gries, Corinna; Henry, Emily N.; Skaff, Nick K.; Stanley, Emily H.; Stow, Craig A.; Tan, Pang Ning; Wagner, Tyler; Webster, Katherine E.

In: GigaScience, Vol. 4, No. 1, 28, 01.01.2015.

Research output: Contribution to journalReview article

TY - JOUR

T1 - Building a multi-scaled geospatial temporal ecology database from disparate data sources

T2 - Fostering open science and data reuse

AU - Soranno, Patricia A.

AU - Bissell, Edward G.

AU - Cheruvelil, Kendra S.

AU - Christel, Samuel T.

AU - Collins, Sarah M.

AU - Emi Fergus, C.

AU - Filstrup, Christopher T.

AU - Lapierre, Jean Francois

AU - Lottig, Noah R.

AU - Oliver, Samantha K.

AU - Scott, Caren E.

AU - Smith, Nicole J.

AU - Stopyak, Scott

AU - Yuan, Shuai

AU - Bremigan, Mary Tate

AU - Downing, John A.

AU - Gries, Corinna

AU - Henry, Emily N.

AU - Skaff, Nick K.

AU - Stanley, Emily H.

AU - Stow, Craig A.

AU - Tan, Pang Ning

AU - Wagner, Tyler

AU - Webster, Katherine E.

PY - 2015/1/1

Y1 - 2015/1/1

N2 - Although there are considerable site-based data for individual or groups of ecosystems, these datasets are widely scattered, have different data formats and conventions, and often have limited accessibility. At the broader scale, national datasets exist for a large number of geospatial features of land, water, and air that are needed to fully understand variation among these ecosystems. However, such datasets originate from different sources and have different spatial and temporal resolutions. By taking an open-science perspective and by combining site-based ecosystem datasets and national geospatial datasets, science gains the ability to ask important research questions related to grand environmental challenges that operate at broad scales. Documentation of such complicated database integration efforts, through peer-reviewed papers, is recommended to foster reproducibility and future use of the integrated database. Here, we describe the major steps, challenges, and considerations in building an integrated database of lake ecosystems, called LAGOS (LAke multi-scaled GeOSpatial and temporal database), that was developed at the sub-continental study extent of 17 US states (1,800,000 km2). LAGOS includes two modules: LAGOSGEO, with geospatial data on every lake with surface area larger than 4 ha in the study extent (~50,000 lakes), including climate, atmospheric deposition, land use/cover, hydrology, geology, and topography measured across a range of spatial and temporal extents; and LAGOSLIMNO, with lake water quality data compiled from ~100 individual datasets for a subset of lakes in the study extent (~10,000 lakes). Procedures for the integration of datasets included: creating a flexible database design; authoring and integrating metadata; documenting data provenance; quantifying spatial measures of geographic data; quality-controlling integrated and derived data; and extensively documenting the database. Our procedures make a large, complex, and integrated database reproducible and extensible, allowing users to ask new research questions with the existing database or through the addition of new data. The largest challenge of this task was the heterogeneity of the data, formats, and metadata. Many steps of data integration need manual input from experts in diverse fields, requiring close collaboration.

AB - Although there are considerable site-based data for individual or groups of ecosystems, these datasets are widely scattered, have different data formats and conventions, and often have limited accessibility. At the broader scale, national datasets exist for a large number of geospatial features of land, water, and air that are needed to fully understand variation among these ecosystems. However, such datasets originate from different sources and have different spatial and temporal resolutions. By taking an open-science perspective and by combining site-based ecosystem datasets and national geospatial datasets, science gains the ability to ask important research questions related to grand environmental challenges that operate at broad scales. Documentation of such complicated database integration efforts, through peer-reviewed papers, is recommended to foster reproducibility and future use of the integrated database. Here, we describe the major steps, challenges, and considerations in building an integrated database of lake ecosystems, called LAGOS (LAke multi-scaled GeOSpatial and temporal database), that was developed at the sub-continental study extent of 17 US states (1,800,000 km2). LAGOS includes two modules: LAGOSGEO, with geospatial data on every lake with surface area larger than 4 ha in the study extent (~50,000 lakes), including climate, atmospheric deposition, land use/cover, hydrology, geology, and topography measured across a range of spatial and temporal extents; and LAGOSLIMNO, with lake water quality data compiled from ~100 individual datasets for a subset of lakes in the study extent (~10,000 lakes). Procedures for the integration of datasets included: creating a flexible database design; authoring and integrating metadata; documenting data provenance; quantifying spatial measures of geographic data; quality-controlling integrated and derived data; and extensively documenting the database. Our procedures make a large, complex, and integrated database reproducible and extensible, allowing users to ask new research questions with the existing database or through the addition of new data. The largest challenge of this task was the heterogeneity of the data, formats, and metadata. Many steps of data integration need manual input from experts in diverse fields, requiring close collaboration.

UR - http://www.scopus.com/inward/record.url?scp=84979581516&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=84979581516&partnerID=8YFLogxK

U2 - 10.1186/s13742-015-0067-4

DO - 10.1186/s13742-015-0067-4

M3 - Review article

C2 - 26140212

AN - SCOPUS:84979581516

VL - 4

JO - GigaScience

JF - GigaScience

SN - 2047-217X

IS - 1

M1 - 28

ER -