Data mining (DM) is increasingly used in the analysis of data generated in life sciences, including biological data produced in several disciplines such as genomics and proteomics, medical data produced in clinical practice, and administrative data produced in health care. The difficulty in mining such data is twofold. First of all, data in life sciences are inherently heterogeneous, spanning from molecular level data to clinical and administrative data. Second, data in life sciences are produced at an increasing rate and data repositories are becoming very large. Thus, the management and analysis of such data is becoming a main bottleneck in biomedical research. The main goal of this paper is to review the main methodologies to mine life sciences data and the ways they are coupled to high-performance infrastructures and systems that result in an efficient analysis. This paper recalls basic concepts of DM, grids, and distributed DM on grids, and reviews main approaches to mine biomedical data on high-performance infrastructures with special focus on the analysis of genomics, proteomics, and interactomics data, and the exploration of magnetic resonance images in neurosciences. The paper can be of interest both to bioinformaticians, who can learn how to exploit high performance infrastructures to mine life sciences data, and to computer scientists, who can address the heterogeneity and the high volumes of life sciences data at the data management, algorithm, and user interface layers.

Data mining and life sciences applications on the grid

Guzzi P;Sarica A;Cannataro M
2013-01-01

Abstract

Data mining (DM) is increasingly used in the analysis of data generated in life sciences, including biological data produced in several disciplines such as genomics and proteomics, medical data produced in clinical practice, and administrative data produced in health care. The difficulty in mining such data is twofold. First of all, data in life sciences are inherently heterogeneous, spanning from molecular level data to clinical and administrative data. Second, data in life sciences are produced at an increasing rate and data repositories are becoming very large. Thus, the management and analysis of such data is becoming a main bottleneck in biomedical research. The main goal of this paper is to review the main methodologies to mine life sciences data and the ways they are coupled to high-performance infrastructures and systems that result in an efficient analysis. This paper recalls basic concepts of DM, grids, and distributed DM on grids, and reviews main approaches to mine biomedical data on high-performance infrastructures with special focus on the analysis of genomics, proteomics, and interactomics data, and the exploration of magnetic resonance images in neurosciences. The paper can be of interest both to bioinformaticians, who can learn how to exploit high performance infrastructures to mine life sciences data, and to computer scientists, who can address the heterogeneity and the high volumes of life sciences data at the data management, algorithm, and user interface layers.
File in questo prodotto:
Non ci sono file associati a questo prodotto.

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/20.500.12317/3210
 Attenzione

Attenzione! I dati visualizzati non sono stati sottoposti a validazione da parte dell'ateneo

Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus 39
  • ???jsp.display-item.citation.isi??? 22
social impact