COVIDSeer: Extending the CORD-19 Dataset

Shaurya Rohatgi, Zeba Karishma, Jason Chhay, Sai Raghav Reddy Keesara, Jian Wu, Cornelia Caragea, C. Lee Giles

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract

We develop an enhanced version of CORD-19 dataset released by the Allen Institute for AI. Tools in the SeerSuite project are used to exploit information in original articles not directly provided in the CORD-19 datasets. We add 728 new abstracts, 70,102 figures and 31,446 tables with captions that are not provided in the current data release. We also built a vertical search engine COVIDSeer based on the new dataset we created. COVIDSeer has a relatively simple architecture with features like keyword filtering, and similar paper recommendation. The goal was to provide a system and dataset that can help scientists better navigate through the literature concerning COVID-19. The enriched dataset can serve as a supplement to the existing dataset. The search engine, which offers keyphrase-enhanced search, will hopefully help biomedical and life science researchers, medical students, and the general public to more effectively explore coronavirus-related literature. The entire data set and the system will be made open source.

Original languageEnglish (US)
Title of host publicationProceedings of the ACM Symposium on Document Engineering, DocEng 2020
PublisherAssociation for Computing Machinery, Inc
ISBN (Electronic)9781450380003
DOIs
StatePublished - Sep 29 2020
Event20th ACM Symposium on Document Engineering, DocEng 2020 - Virtual, Online, United States
Duration: Sep 29 2020Oct 1 2020

Publication series

NameProceedings of the ACM Symposium on Document Engineering, DocEng 2020

Conference

Conference20th ACM Symposium on Document Engineering, DocEng 2020
CountryUnited States
CityVirtual, Online
Period9/29/2010/1/20

All Science Journal Classification (ASJC) codes

  • Software
  • Information Systems

Fingerprint Dive into the research topics of 'COVIDSeer: Extending the CORD-19 Dataset'. Together they form a unique fingerprint.

Cite this