An efficient and scalable spark preprocessing methodology for genome wide association studies

IRIS

The importance of the use of high-performance software frameworks to analyze omics data obtained by using High-Throughput (HT) essays is widely recognized. HT methodologies comprise microarrays, Genome-Wide Association Studies (GWAS), and Next Generation Sequencing (NGS), which provide a vast amount of data per a single experiment. Each HT vendor provides to the users only the software frameworks and the proprietary libraries for the annotation, and summarization of raw data. Consequently, the needs of algorithms for the preprocessing and analysis of omics data arise. GWAS aims to highlight the association between genetic variants and diseases by examining single nucleotide polymorphisms (SNPs), which differ in a statistically significant way between cases and controls. The effectiveness of GWAS analysis increases with the number of analyzed samples per single experiment. GWAS data analyzed through the use of statistical methods can detect associations among a single allelic variant and the clinical conditions of samples. To overcome these limitations, and to make it possible to discover multiple associations among allelic variants, it is possible to use Association Rules mining. Consequently, the need for the introduction of scalable Association Rule Mining (ARM) algorithms able to analyze GWAS data arises. Hence, the use of high-performance data analytics framework is needed. For this purpose, we propose a software framework called GARMS (GWAS Association Rule Mining in Spark) built on top of Apache Spark for the preprocessing, and mining of association rules from GWAS data sets. GARMS comprises a two steps analysis methodology: (i) in the first step, the GWAS data are preprocessed, along with the identification of the frequent itemsets; (ii) in the second step, frequent itemsets are employed to mine association rules without scanning the input data. We implemented our algorithm, and we tested it on some synthetic GWAS data sets. Preliminary results confirm that our method may extract relevant association rules from GWAS data reducing the computational time.

An efficient and scalable spark preprocessing methodology for genome wide association studies

Agapito G.;Guzzi P. H.;Cannataro M.

2020-01-01

Abstract

The importance of the use of high-performance software frameworks to analyze omics data obtained by using High-Throughput (HT) essays is widely recognized. HT methodologies comprise microarrays, Genome-Wide Association Studies (GWAS), and Next Generation Sequencing (NGS), which provide a vast amount of data per a single experiment. Each HT vendor provides to the users only the software frameworks and the proprietary libraries for the annotation, and summarization of raw data. Consequently, the needs of algorithms for the preprocessing and analysis of omics data arise. GWAS aims to highlight the association between genetic variants and diseases by examining single nucleotide polymorphisms (SNPs), which differ in a statistically significant way between cases and controls. The effectiveness of GWAS analysis increases with the number of analyzed samples per single experiment. GWAS data analyzed through the use of statistical methods can detect associations among a single allelic variant and the clinical conditions of samples. To overcome these limitations, and to make it possible to discover multiple associations among allelic variants, it is possible to use Association Rules mining. Consequently, the need for the introduction of scalable Association Rule Mining (ARM) algorithms able to analyze GWAS data arises. Hence, the use of high-performance data analytics framework is needed. For this purpose, we propose a software framework called GARMS (GWAS Association Rule Mining in Spark) built on top of Apache Spark for the preprocessing, and mining of association rules from GWAS data sets. GARMS comprises a two steps analysis methodology: (i) in the first step, the GWAS data are preprocessed, along with the identification of the frequent itemsets; (ii) in the second step, frequent itemsets are employed to mine association rules without scanning the input data. We implemented our algorithm, and we tested it on some synthetic GWAS data sets. Preliminary results confirm that our method may extract relevant association rules from GWAS data reducing the computational time.

Scheda breve

Scheda completa

Scheda completa (DC)

	Anno
	
				2020
			
	Codice ISBN
	
				978-1-7281-6582-0
			
	Parole chiave
	
				Apache Spark
Association Rules
Distributed Computing
GWAS
Preprocessing
SNPs
			
	Appare nelle tipologie:
	
				4.1 Contributo in Atti di convegno

File in questo prodotto:

Non ci sono file associati a questo prodotto.

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/20.500.12317/62315

Attenzione

Attenzione! I dati visualizzati non sono stati sottoposti a validazione da parte dell'ateneo

Citazioni

ND

6

4

social impact