PaPI: pseudo amino acid composition to score human protein-coding variants

Ivan Limongelli, Simone Marini, Riccardo Bellazzi

Research output: Contribution to journalArticlepeer-review


BACKGROUND: High throughput sequencing technologies are able to identify the whole genomic variation of an individual. Gene-targeted and whole-exome experiments are mainly focused on coding sequence variants related to a single or multiple nucleotides. The analysis of the biological significance of this multitude of genomic variant is challenging and computational demanding.

RESULTS: We present PaPI, a new machine-learning approach to classify and score human coding variants by estimating the probability to damage their protein-related function. The novelty of this approach consists in using pseudo amino acid composition through which wild and mutated protein sequences are represented in a discrete model. A machine learning classifier has been trained on a set of known deleterious and benign coding variants with the aim to score unobserved variants by taking into account hidden sequence patterns in human genome potentially leading to diseases. We show how the combination of amphiphilic pseudo amino acid composition, evolutionary conservation and homologous proteins based methods outperforms several prediction algorithms and it is also able to score complex variants such as deletions, insertions and indels.

CONCLUSIONS: This paper describes a machine-learning approach to predict the deleteriousness of human coding variants. A freely available web application ( has been developed with the presented method, able to score up to thousands variants in a single run.

Original languageEnglish
Pages (from-to)123
Number of pages1
JournalBMC Bioinformatics
Publication statusPublished - 2015

ASJC Scopus subject areas

  • Medicine(all)


Dive into the research topics of 'PaPI: pseudo amino acid composition to score human protein-coding variants'. Together they form a unique fingerprint.

Cite this