Since December 2019, severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) has affected almost all countries. The unprecedented spreading of this virus has led to the insurgence of many variants that impact protein sequence and structure that need continuous monitoring and analysis of the sequences to understand the genetic evolution and to prevent possible dangerous outcomes. Some variants causing the modification of the structure of the proteins, such as the Spike protein S, need to be monitored. Protein contact networks (PCNs) have been recently proposed as a modelling framework for protein structures. In such a framework, the protein structure is represented as an unweighted graph whose nodes are the central atoms of the backbones (C-α), and edges connect two atoms falling in the spatial distance between 4 and 7 Å. PCN may also be a data-rich representation since we may add to each node/atom biological and topological information. Such formalism enables the possibility of using algorithms from graph theory to analyze the graph. In particular, we refer to graph embedding methods enabling the analysis of such graphs with deep learning methods. In this work, we explore the possibility of embedding PCN using Graph Neural Networks and then analyze in the embedded space each residue to distinguish mutated residues from non-mutated ones. In particular, we analyzed the structure of the Spike protein of the coronavirus. First, we obtained the PCNs of the Spike protein for the wild-type, α, β, and δ variants. Then we used the GraphSage embedding algorithm to obtain an unsupervised embedding. Then we analyzed the point of mutation in the embedded space. Results show the characteristics of the mutation point in the embedding space.
Structural analysis of SARS-CoV-2 Spike protein variants through graph embedding
Guzzi P. H.;Lomoio U.;Puccio B.;Veltri P.
2023-01-01
Abstract
Since December 2019, severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) has affected almost all countries. The unprecedented spreading of this virus has led to the insurgence of many variants that impact protein sequence and structure that need continuous monitoring and analysis of the sequences to understand the genetic evolution and to prevent possible dangerous outcomes. Some variants causing the modification of the structure of the proteins, such as the Spike protein S, need to be monitored. Protein contact networks (PCNs) have been recently proposed as a modelling framework for protein structures. In such a framework, the protein structure is represented as an unweighted graph whose nodes are the central atoms of the backbones (C-α), and edges connect two atoms falling in the spatial distance between 4 and 7 Å. PCN may also be a data-rich representation since we may add to each node/atom biological and topological information. Such formalism enables the possibility of using algorithms from graph theory to analyze the graph. In particular, we refer to graph embedding methods enabling the analysis of such graphs with deep learning methods. In this work, we explore the possibility of embedding PCN using Graph Neural Networks and then analyze in the embedded space each residue to distinguish mutated residues from non-mutated ones. In particular, we analyzed the structure of the Spike protein of the coronavirus. First, we obtained the PCNs of the Spike protein for the wild-type, α, β, and δ variants. Then we used the GraphSage embedding algorithm to obtain an unsupervised embedding. Then we analyzed the point of mutation in the embedded space. Results show the characteristics of the mutation point in the embedding space.I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.