BACKGROUND: High throughput sequencing technologies are able to identify the whole genomic variation of an individual. Gene-targeted and whole-exome experiments are mainly focused on coding sequence variants related to a single or multiple nucleotides. The analysis of the biological significance of this multitude of genomic variant is challenging and computational demanding.
RESULTS: We present PaPI, a new machine-learning approach to classify and score human coding variants by estimating the probability to damage their protein-related function. The novelty of this approach consists in using pseudo amino acid composition through which wild and mutated protein sequences are represented in a discrete model. A machine learning classifier has been trained on a set of known deleterious and benign coding variants with the aim to score unobserved variants by taking into account hidden sequence patterns in human genome potentially leading to diseases. We show how the combination of amphiphilic pseudo amino acid composition, evolutionary conservation and homologous proteins based methods outperforms several prediction algorithms and it is also able to score complex variants such as deletions, insertions and indels.
CONCLUSIONS: This paper describes a machine-learning approach to predict the deleteriousness of human coding variants. A freely available web application (http://papi.unipv.it) has been developed with the presented method, able to score up to thousands variants in a single run.
ASJC Scopus subject areas