Health Life

Discovering the secrets of viral sequences in COVID-19

From the SARS-CoV-2 genome (a) sequences of nucleotides and amino acids are extracted (b); sequences are then deposited to world-wide open repositories: GENBANK, GISAID, COG-UK (c), and imported to the centralized database at Politecnico, where the search engine ViruSurf is accessed (d). Credit: Politecnico di Milano

Since the beginning of 2020, labs from all around the world have been sequencing the material from positive tests of people affected by COVID-19 and then depositing sequences mostly to three points of collection: GenBank, COG-UK, and GISAID. Rapid exploration of this huge amount of data is important for understanding how the genome of the virus is changing. For enabling fast ‘surfing’ over this data, the research group at Politecnico di Milano led by Prof. Stefano Ceri has developed ViruSurf, a search engine operating on top of a centralized database stored at Politecnico. The database is periodically reloaded from the three sources and as of today contains 200,516 sequences of SARS-CoV-2, the virus causing COVID-19, and 33,256 sequences of other viral species also associated with epidemics affecting humans, such as SARS, MERS, Ebola, and Dengue.

Every sequence is described from four perspectives: the biological features of the virus and the host, the sequencing technology, the project that has produced the original data, the mutations of the whole sequence of nucleotides and of gene-specific amino acids. The advantage provided by ViruSurf is the use of an algorithm for computing viral mutations homogeneously across sources, using cloud computing. The database is optimized for giving quick responses to the surfers.

Among the future developments of ViruSurf, the most important, funded by a six-month-long project by EIT Digital, is a bio-informatic service for ingesting new viral sequences, which highlights the presence of viral mutations associated with enhanced or reduced severity and virulence as they are discovered.