Since the start of the COVID-19 pandemic, scientific and medical journals have published over 100,000 studies on SARS-CoV-2. But according to data scientists who created a machine-learning tool to analyze the deluge of publications, basic lab-based studies on the microbiology of the virus, including research on its pathogenesis and mechanisms of viral transmission, are lacking. Their analysis appears September 16 in the journal Patterns.
“In a crisis like this pandemic, we would expect research outside the lab to happen at a faster pace than lab research,” says first author Anhvinh Doanvo a volunteer data scientist with the COVID-19 Dispersed Volunteer Research Network. “Nevertheless, the relative lack of lab-based studies seems to be unique to SARS-CoV-2, compared to other human coronaviruses. This shortage of lab-based research means that the scientific community may miss key aspects of the virus that could impact our ability to contain this pandemic and to counter future ones.”
The investigators used research abstracts obtained from CORD-19 (COVID-19 Open Research Dataset). CORD-19 is updated daily and includes peer-reviewed studies from PubMed Central, as well as preprints from bioRxiv and medRxiv. At the time they conducted their first analysis at the end of May, the dataset included more than 137,000 studies. The analysis was later updated with data through July 31.
The team used two computational methods to analyze the data. The first was dimensionality reduction, which helps to find big patterns across many documents, such as abstracts from scientific studies, and to identify trends based on those patterns. The second method, topic modeling, allowed them to group the documents into different topics and to compare research on SARS-CoV-2 to research on other coronaviruses. Unlike previous studies that have focused only on keywords, both of these tools enabled them to review the full text