A 'non-parametric' version of the naive Bayes classifier

Daniele Soria, Jonathan M. Garibaldi, Federico Ambrogi, Elia M. Biganzoli, Ian O. Ellis

Research output: Contribution to journalArticle

54 Citations (Scopus)

Abstract

Many algorithms have been proposed for the machine learning task of classification. One of the simplest methods, the naive Bayes classifier, has often been found to give good performance despite the fact that its underlying assumptions (of independence and a normal distribution of the variables) are perhaps violated. In previous work, we applied naive Bayes and other standard algorithms to a breast cancer database from Nottingham City Hospital in which the variables are highly non-normal and found that the algorithm performed well when predicting a class that had been derived from the same data. However, when we then applied naive Bayes to predict an alternative clinical variable, it performed much worse than other techniques. This motivated us to propose an alternative method, based on naive Bayes, which removes the requirement for the variables to be normally distributed, but retains the essential structure and other underlying assumptions of the method. We tested our novel algorithm on our breast cancer data and on three UCI datasets which also exhibited strong violations of normality. We found our algorithm outperformed naive Bayes in all four cases and outperformed multinomial logistic regression (MLR) in two cases. We conclude that our method offers a competitive alternative to MLR and naive Bayes when dealing with data sets in which non-normal distributions are observed.

Original languageEnglish
Pages (from-to)775-784
Number of pages10
JournalKnowledge-Based Systems
Volume24
Issue number6
DOIs
Publication statusPublished - Aug 2011

Fingerprint

Classifiers
Logistics
Normal distribution
Learning systems
Classifier
Breast cancer
Multinomial logistic regression

Keywords

  • Breast cancer
  • Logistic regression
  • Naive Bayes
  • Supervised learning
  • UCI data sets

ASJC Scopus subject areas

  • Software
  • Artificial Intelligence
  • Management Information Systems
  • Information Systems and Management

Cite this

A 'non-parametric' version of the naive Bayes classifier. / Soria, Daniele; Garibaldi, Jonathan M.; Ambrogi, Federico; Biganzoli, Elia M.; Ellis, Ian O.

In: Knowledge-Based Systems, Vol. 24, No. 6, 08.2011, p. 775-784.

Research output: Contribution to journalArticle

Soria, Daniele ; Garibaldi, Jonathan M. ; Ambrogi, Federico ; Biganzoli, Elia M. ; Ellis, Ian O. / A 'non-parametric' version of the naive Bayes classifier. In: Knowledge-Based Systems. 2011 ; Vol. 24, No. 6. pp. 775-784.
@article{9039b15812df4f46a727133990ac718c,
title = "A 'non-parametric' version of the naive Bayes classifier",
abstract = "Many algorithms have been proposed for the machine learning task of classification. One of the simplest methods, the naive Bayes classifier, has often been found to give good performance despite the fact that its underlying assumptions (of independence and a normal distribution of the variables) are perhaps violated. In previous work, we applied naive Bayes and other standard algorithms to a breast cancer database from Nottingham City Hospital in which the variables are highly non-normal and found that the algorithm performed well when predicting a class that had been derived from the same data. However, when we then applied naive Bayes to predict an alternative clinical variable, it performed much worse than other techniques. This motivated us to propose an alternative method, based on naive Bayes, which removes the requirement for the variables to be normally distributed, but retains the essential structure and other underlying assumptions of the method. We tested our novel algorithm on our breast cancer data and on three UCI datasets which also exhibited strong violations of normality. We found our algorithm outperformed naive Bayes in all four cases and outperformed multinomial logistic regression (MLR) in two cases. We conclude that our method offers a competitive alternative to MLR and naive Bayes when dealing with data sets in which non-normal distributions are observed.",
keywords = "Breast cancer, Logistic regression, Naive Bayes, Supervised learning, UCI data sets",
author = "Daniele Soria and Garibaldi, {Jonathan M.} and Federico Ambrogi and Biganzoli, {Elia M.} and Ellis, {Ian O.}",
year = "2011",
month = "8",
doi = "10.1016/j.knosys.2011.02.014",
language = "English",
volume = "24",
pages = "775--784",
journal = "Knowledge-Based Systems",
issn = "0950-7051",
publisher = "Elsevier",
number = "6",

}

TY - JOUR

T1 - A 'non-parametric' version of the naive Bayes classifier

AU - Soria, Daniele

AU - Garibaldi, Jonathan M.

AU - Ambrogi, Federico

AU - Biganzoli, Elia M.

AU - Ellis, Ian O.

PY - 2011/8

Y1 - 2011/8

N2 - Many algorithms have been proposed for the machine learning task of classification. One of the simplest methods, the naive Bayes classifier, has often been found to give good performance despite the fact that its underlying assumptions (of independence and a normal distribution of the variables) are perhaps violated. In previous work, we applied naive Bayes and other standard algorithms to a breast cancer database from Nottingham City Hospital in which the variables are highly non-normal and found that the algorithm performed well when predicting a class that had been derived from the same data. However, when we then applied naive Bayes to predict an alternative clinical variable, it performed much worse than other techniques. This motivated us to propose an alternative method, based on naive Bayes, which removes the requirement for the variables to be normally distributed, but retains the essential structure and other underlying assumptions of the method. We tested our novel algorithm on our breast cancer data and on three UCI datasets which also exhibited strong violations of normality. We found our algorithm outperformed naive Bayes in all four cases and outperformed multinomial logistic regression (MLR) in two cases. We conclude that our method offers a competitive alternative to MLR and naive Bayes when dealing with data sets in which non-normal distributions are observed.

AB - Many algorithms have been proposed for the machine learning task of classification. One of the simplest methods, the naive Bayes classifier, has often been found to give good performance despite the fact that its underlying assumptions (of independence and a normal distribution of the variables) are perhaps violated. In previous work, we applied naive Bayes and other standard algorithms to a breast cancer database from Nottingham City Hospital in which the variables are highly non-normal and found that the algorithm performed well when predicting a class that had been derived from the same data. However, when we then applied naive Bayes to predict an alternative clinical variable, it performed much worse than other techniques. This motivated us to propose an alternative method, based on naive Bayes, which removes the requirement for the variables to be normally distributed, but retains the essential structure and other underlying assumptions of the method. We tested our novel algorithm on our breast cancer data and on three UCI datasets which also exhibited strong violations of normality. We found our algorithm outperformed naive Bayes in all four cases and outperformed multinomial logistic regression (MLR) in two cases. We conclude that our method offers a competitive alternative to MLR and naive Bayes when dealing with data sets in which non-normal distributions are observed.

KW - Breast cancer

KW - Logistic regression

KW - Naive Bayes

KW - Supervised learning

KW - UCI data sets

UR - http://www.scopus.com/inward/record.url?scp=79957522106&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=79957522106&partnerID=8YFLogxK

U2 - 10.1016/j.knosys.2011.02.014

DO - 10.1016/j.knosys.2011.02.014

M3 - Article

VL - 24

SP - 775

EP - 784

JO - Knowledge-Based Systems

JF - Knowledge-Based Systems

SN - 0950-7051

IS - 6

ER -