Using ontologies and semantic similarity measures for prioritization of gene regulatory networks

Milano, M; Guzzi, Ph; Cannataro, M

doi:10.7287/peerj.preprints.2259v1

Omics sciences are widely used to analyze diseases at a molecular level. Usually, results of omics experiments are sets of candidate genes potentially involved in different diseases. The interpretation of results and the filtering of candidate genes or proteins selected in an experiment is a challenge in some scenarios. This problem is particularly evident in clinical environments in which researchers are interested in the behavior of few molecules related to some specific disease while results may contains thousands of data and have very relevant dimensions. The filtering requires the use of domain-specific knowledge that is usually encoded into ontologies. Consequently, to filter out false positive genes, different approaches for selecting genes have been introduced. Such approaches are often referred to as Gene prioritization methods. They aim to identify the most related genes to a disease among a larger set of candidates genes, through the use of computational methods. We implemented GoD (Gene ranking based On Diseases), an algorithm that ranks a given set of genes based on ontology annotations. The algorithm orders genes by the semantic similarity computed with respect to a disease among the annotations of each gene and those describing the selected disease.The current version of GoD enables the prioritization of a list of input genes for a selected disease. It uses HPO (Human Phenotype Ontology), GO (Gene Ontology), and DO (Disease Ontology) ontologies for the calculation of the ranking. It takes as input a list of genes or gene products annotated with GO Terms, HPO Terms, DO Terms and a selected disease described regarding annotation of GO, HPO or DO (user may also provide novel annotations). It produces as output the ranking of those genes with respect of the input disease. Package consists of three main functions: hpoGoD (for HPO based prioritization), goGoD (for GO based prioritization), and doGoD (for DO based prioritization). We tested GoD on Gene Regulatory Networks (GRNs). Biological network inference aims to reconstruct network of interactions (or associations) among biological genes starting from experimental observations. We selected three expression datasets: Dataset 1 (GDS3285) , related to breast cancer disease; Dataset 2 (GDS5072), related to prostate cancer disease; and Dataset 3 (GDS5093), related to Dengue virus (DENV) infection. Initially, experimental data are given as input to five GRN inference algorithms, i.e. ARACNE, CLR, MRNET, GENIE3 and GGM, to produce 5 inferred GRN networks. For each inferred GRN, GoD receives as input the list of top genes and produces for each gene a semantic similarity value on a selected disease considering one of the previous ontologies (e.g. Disease Ontology). For each GRN, the genes are ranked and reordered on the basis of the computed semantic similarity and are compared allowing to rank each GRN inference method with respect to the initially selected disease.