Running genome wide data analysis using a parallel approach on a cloud platform

Andrea Demartini, Davide Capozzi, Alberto Malovini, Riccardo Bellazzi

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract

Hierarchical Naïve Bayes (HNB) is a multivariate classification algorithm that can be used to forecast the probability of a specific disease by analysing a set of Single Nucleotide Polymorphisms (SNPs). In this paper we present the implementation of HNB using a parallel approach based on the Map-Reduce paradigm built natively on the Hadoop framework, relying on the Amazon Cloud Infrastructure. We tested our approach on two GWAS datasets aimed at identifying the genetic bases of Type 1 (T1D) and Type 2 Diabetes (T2D). Both datasets include individual level data of 1, 900 cases and 1, 500 controls with ~ 420, 000 SNPs. For T2D the best results were obtained using the complete set of SNPs, whereas for T1D the best performances were reached using few SNPs selected through standard univariate association tests. Our cloudbased implementation allows running genome wide simulations cutting down computational time and overall infrastructure costs.

Original languageEnglish
Title of host publicationLecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
PublisherSpringer Verlag
Pages188-192
Number of pages5
Volume9105
ISBN (Print)9783319195506
DOIs
Publication statusPublished - 2015
Event15th Conference on Artificial Intelligence in Medicine, AIME 2015 - Pavia, Italy
Duration: Jun 17 2015Jun 20 2015

Publication series

NameLecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
Volume9105
ISSN (Print)03029743
ISSN (Electronic)16113349

Other

Other15th Conference on Artificial Intelligence in Medicine, AIME 2015
CountryItaly
CityPavia
Period6/17/156/20/15

Fingerprint

Single nucleotide Polymorphism
Nucleotides
Polymorphism
Data analysis
Genome
Genes
Hierarchical Bayes
Diabetes
Medical problems
Infrastructure
MapReduce
Classification Algorithm
Univariate
Forecast
Paradigm
Costs
Simulation

Keywords

  • Cloud computing
  • Data mining algorithm
  • Genome-wide association studies
  • Map reduce

ASJC Scopus subject areas

  • Computer Science(all)
  • Theoretical Computer Science

Cite this

Demartini, A., Capozzi, D., Malovini, A., & Bellazzi, R. (2015). Running genome wide data analysis using a parallel approach on a cloud platform. In Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) (Vol. 9105, pp. 188-192). (Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics); Vol. 9105). Springer Verlag. https://doi.org/10.1007/978-3-319-19551-3_25

Running genome wide data analysis using a parallel approach on a cloud platform. / Demartini, Andrea; Capozzi, Davide; Malovini, Alberto; Bellazzi, Riccardo.

Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics). Vol. 9105 Springer Verlag, 2015. p. 188-192 (Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics); Vol. 9105).

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Demartini, A, Capozzi, D, Malovini, A & Bellazzi, R 2015, Running genome wide data analysis using a parallel approach on a cloud platform. in Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics). vol. 9105, Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), vol. 9105, Springer Verlag, pp. 188-192, 15th Conference on Artificial Intelligence in Medicine, AIME 2015, Pavia, Italy, 6/17/15. https://doi.org/10.1007/978-3-319-19551-3_25
Demartini A, Capozzi D, Malovini A, Bellazzi R. Running genome wide data analysis using a parallel approach on a cloud platform. In Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics). Vol. 9105. Springer Verlag. 2015. p. 188-192. (Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)). https://doi.org/10.1007/978-3-319-19551-3_25
Demartini, Andrea ; Capozzi, Davide ; Malovini, Alberto ; Bellazzi, Riccardo. / Running genome wide data analysis using a parallel approach on a cloud platform. Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics). Vol. 9105 Springer Verlag, 2015. pp. 188-192 (Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)).
@inproceedings{32d84c0684f4443dae5f2e41c29b7cff,
title = "Running genome wide data analysis using a parallel approach on a cloud platform",
abstract = "Hierarchical Na{\"i}ve Bayes (HNB) is a multivariate classification algorithm that can be used to forecast the probability of a specific disease by analysing a set of Single Nucleotide Polymorphisms (SNPs). In this paper we present the implementation of HNB using a parallel approach based on the Map-Reduce paradigm built natively on the Hadoop framework, relying on the Amazon Cloud Infrastructure. We tested our approach on two GWAS datasets aimed at identifying the genetic bases of Type 1 (T1D) and Type 2 Diabetes (T2D). Both datasets include individual level data of 1, 900 cases and 1, 500 controls with ~ 420, 000 SNPs. For T2D the best results were obtained using the complete set of SNPs, whereas for T1D the best performances were reached using few SNPs selected through standard univariate association tests. Our cloudbased implementation allows running genome wide simulations cutting down computational time and overall infrastructure costs.",
keywords = "Cloud computing, Data mining algorithm, Genome-wide association studies, Map reduce",
author = "Andrea Demartini and Davide Capozzi and Alberto Malovini and Riccardo Bellazzi",
year = "2015",
doi = "10.1007/978-3-319-19551-3_25",
language = "English",
isbn = "9783319195506",
volume = "9105",
series = "Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)",
publisher = "Springer Verlag",
pages = "188--192",
booktitle = "Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)",

}

TY - GEN

T1 - Running genome wide data analysis using a parallel approach on a cloud platform

AU - Demartini, Andrea

AU - Capozzi, Davide

AU - Malovini, Alberto

AU - Bellazzi, Riccardo

PY - 2015

Y1 - 2015

N2 - Hierarchical Naïve Bayes (HNB) is a multivariate classification algorithm that can be used to forecast the probability of a specific disease by analysing a set of Single Nucleotide Polymorphisms (SNPs). In this paper we present the implementation of HNB using a parallel approach based on the Map-Reduce paradigm built natively on the Hadoop framework, relying on the Amazon Cloud Infrastructure. We tested our approach on two GWAS datasets aimed at identifying the genetic bases of Type 1 (T1D) and Type 2 Diabetes (T2D). Both datasets include individual level data of 1, 900 cases and 1, 500 controls with ~ 420, 000 SNPs. For T2D the best results were obtained using the complete set of SNPs, whereas for T1D the best performances were reached using few SNPs selected through standard univariate association tests. Our cloudbased implementation allows running genome wide simulations cutting down computational time and overall infrastructure costs.

AB - Hierarchical Naïve Bayes (HNB) is a multivariate classification algorithm that can be used to forecast the probability of a specific disease by analysing a set of Single Nucleotide Polymorphisms (SNPs). In this paper we present the implementation of HNB using a parallel approach based on the Map-Reduce paradigm built natively on the Hadoop framework, relying on the Amazon Cloud Infrastructure. We tested our approach on two GWAS datasets aimed at identifying the genetic bases of Type 1 (T1D) and Type 2 Diabetes (T2D). Both datasets include individual level data of 1, 900 cases and 1, 500 controls with ~ 420, 000 SNPs. For T2D the best results were obtained using the complete set of SNPs, whereas for T1D the best performances were reached using few SNPs selected through standard univariate association tests. Our cloudbased implementation allows running genome wide simulations cutting down computational time and overall infrastructure costs.

KW - Cloud computing

KW - Data mining algorithm

KW - Genome-wide association studies

KW - Map reduce

UR - http://www.scopus.com/inward/record.url?scp=84947933778&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=84947933778&partnerID=8YFLogxK

U2 - 10.1007/978-3-319-19551-3_25

DO - 10.1007/978-3-319-19551-3_25

M3 - Conference contribution

AN - SCOPUS:84947933778

SN - 9783319195506

VL - 9105

T3 - Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)

SP - 188

EP - 192

BT - Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)

PB - Springer Verlag

ER -