News and information surrounding the COVID-19 pandemic is ever-evolving and accumulating. Due to the global relevance and importance, it is critical to be able to parse through the available information in an efficient and reliant manner to gauge scientific progression and understandings surrounding COVID-19. In this research, abstracts from a corpus of scientific articles are evaluated using different Natural Language Processing (NLP) techniques, including Term Frequency-Inverse Document Frequency (TF-IDF), Latent Dirichlet Allocation (LDA), Bidirectional Encoder Representations from Transformers (BERT), and sentiment analysis, to better understand the breadth of extant literature. Results from the analyses show that in the very large corpus datasets, a large group of documents encompasses the overall or dominant general theme. However, the smaller clusters of documents reveal very precise and niche themes. Generalized COVID-19 is the dominant theme present in largest clusters. Smaller clusters include more specific terms (e.g., popular drugs, popular terms, key features/impacts related to COVID). With the resulting clusters, sentiment analysis was run to discover slight fluctuations over time depending on cluster with an overall relatively neutral sentiment. Overall, the precision of the BERT clusters distinguishes niche topics within the large corpus of literature and enables interesting and meaningful text analytics.
All Science Journal Classification (ASJC) codes
- Computer Science(all)