Comparison of the accuracy of human readers versus machine-learning algorithms for pigmented skin lesion classification

an open, web-based, international, diagnostic study

Philipp Tschandl, Noel Codella, Bengü Nisa Akay, Giuseppe Argenziano, Ralph P. Braun, Horacio Cabo, David Gutman, Allan Halpern, Brian Helba, Rainer Hofmann-Wellenhof, Aimilios Lallas, Jan Lapins, Caterina Longo, Josep Malvehy, Michael A. Marchetti, Ashfaq Marghoob, Scott Menzies, Amanda Oakley, John Paoli, Susana Puig & 8 others Christoph Rinner, Cliff Rosendahl, Alon Scope, Christoph Sinz, H. Peter Soyer, Luc Thomas, Iris Zalaudek, Harald Kittler

Research output: Contribution to journalArticle

6 Citations (Scopus)

Abstract

Background: Whether machine-learning algorithms can diagnose all pigmented skin lesions as accurately as human experts is unclear. The aim of this study was to compare the diagnostic accuracy of state-of-the-art machine-learning algorithms with human readers for all clinically relevant types of benign and malignant pigmented skin lesions. Methods: For this open, web-based, international, diagnostic study, human readers were asked to diagnose dermatoscopic images selected randomly in 30-image batches from a test set of 1511 images. The diagnoses from human readers were compared with those of 139 algorithms created by 77 machine-learning labs, who participated in the International Skin Imaging Collaboration 2018 challenge and received a training set of 10 015 images in advance. The ground truth of each lesion fell into one of seven predefined disease categories: intraepithelial carcinoma including actinic keratoses and Bowen's disease; basal cell carcinoma; benign keratinocytic lesions including solar lentigo, seborrheic keratosis and lichen planus-like keratosis; dermatofibroma; melanoma; melanocytic nevus; and vascular lesions. The two main outcomes were the differences in the number of correct specific diagnoses per batch between all human readers and the top three algorithms, and between human experts and the top three algorithms. Findings: Between Aug 4, 2018, and Sept 30, 2018, 511 human readers from 63 countries had at least one attempt in the reader study. 283 (55·4%) of 511 human readers were board-certified dermatologists, 118 (23·1%) were dermatology residents, and 83 (16·2%) were general practitioners. When comparing all human readers with all machine-learning algorithms, the algorithms achieved a mean of 2·01 (95% CI 1·97 to 2·04; p<0·0001) more correct diagnoses (17·91 [SD 3·42] vs 19·92 [4·27]). 27 human experts with more than 10 years of experience achieved a mean of 18·78 (SD 3·15) correct answers, compared with 25·43 (1·95) correct answers for the top three machine algorithms (mean difference 6·65, 95% CI 6·06–7·25; p<0·0001). The difference between human experts and the top three algorithms was significantly lower for images in the test set that were collected from sources not included in the training set (human underperformance of 11·4%, 95% CI 9·9–12·9 vs 3·6%, 0·8–6·3; p<0·0001). Interpretation: State-of-the-art machine-learning classifiers outperformed human experts in the diagnosis of pigmented skin lesions and should have a more important role in clinical practice. However, a possible limitation of these algorithms is their decreased performance for out-of-distribution images, which should be addressed in future research. Funding: None.

Original languageEnglish
Pages (from-to)938-947
Number of pages10
JournalThe Lancet Oncology
Volume20
Issue number7
DOIs
Publication statusPublished - Jul 1 2019

Fingerprint

Skin
Machine Learning
Seborrheic Keratosis
Lentigo
Benign Fibrous Histiocytoma
Bowen's Disease
Actinic Keratosis
Keratosis
Pigmented Nevus
Lichen Planus
Basal Cell Carcinoma
Carcinoma in Situ
Dermatology
General Practitioners
Blood Vessels
Melanoma

ASJC Scopus subject areas

  • Oncology

Cite this

Comparison of the accuracy of human readers versus machine-learning algorithms for pigmented skin lesion classification : an open, web-based, international, diagnostic study. / Tschandl, Philipp; Codella, Noel; Akay, Bengü Nisa; Argenziano, Giuseppe; Braun, Ralph P.; Cabo, Horacio; Gutman, David; Halpern, Allan; Helba, Brian; Hofmann-Wellenhof, Rainer; Lallas, Aimilios; Lapins, Jan; Longo, Caterina; Malvehy, Josep; Marchetti, Michael A.; Marghoob, Ashfaq; Menzies, Scott; Oakley, Amanda; Paoli, John; Puig, Susana; Rinner, Christoph; Rosendahl, Cliff; Scope, Alon; Sinz, Christoph; Soyer, H. Peter; Thomas, Luc; Zalaudek, Iris; Kittler, Harald.

In: The Lancet Oncology, Vol. 20, No. 7, 01.07.2019, p. 938-947.

Research output: Contribution to journalArticle

Tschandl, P, Codella, N, Akay, BN, Argenziano, G, Braun, RP, Cabo, H, Gutman, D, Halpern, A, Helba, B, Hofmann-Wellenhof, R, Lallas, A, Lapins, J, Longo, C, Malvehy, J, Marchetti, MA, Marghoob, A, Menzies, S, Oakley, A, Paoli, J, Puig, S, Rinner, C, Rosendahl, C, Scope, A, Sinz, C, Soyer, HP, Thomas, L, Zalaudek, I & Kittler, H 2019, 'Comparison of the accuracy of human readers versus machine-learning algorithms for pigmented skin lesion classification: an open, web-based, international, diagnostic study', The Lancet Oncology, vol. 20, no. 7, pp. 938-947. https://doi.org/10.1016/S1470-2045(19)30333-X
Tschandl, Philipp ; Codella, Noel ; Akay, Bengü Nisa ; Argenziano, Giuseppe ; Braun, Ralph P. ; Cabo, Horacio ; Gutman, David ; Halpern, Allan ; Helba, Brian ; Hofmann-Wellenhof, Rainer ; Lallas, Aimilios ; Lapins, Jan ; Longo, Caterina ; Malvehy, Josep ; Marchetti, Michael A. ; Marghoob, Ashfaq ; Menzies, Scott ; Oakley, Amanda ; Paoli, John ; Puig, Susana ; Rinner, Christoph ; Rosendahl, Cliff ; Scope, Alon ; Sinz, Christoph ; Soyer, H. Peter ; Thomas, Luc ; Zalaudek, Iris ; Kittler, Harald. / Comparison of the accuracy of human readers versus machine-learning algorithms for pigmented skin lesion classification : an open, web-based, international, diagnostic study. In: The Lancet Oncology. 2019 ; Vol. 20, No. 7. pp. 938-947.
@article{e44abae6306e4e03ac95dc937109a954,
title = "Comparison of the accuracy of human readers versus machine-learning algorithms for pigmented skin lesion classification: an open, web-based, international, diagnostic study",
abstract = "Background: Whether machine-learning algorithms can diagnose all pigmented skin lesions as accurately as human experts is unclear. The aim of this study was to compare the diagnostic accuracy of state-of-the-art machine-learning algorithms with human readers for all clinically relevant types of benign and malignant pigmented skin lesions. Methods: For this open, web-based, international, diagnostic study, human readers were asked to diagnose dermatoscopic images selected randomly in 30-image batches from a test set of 1511 images. The diagnoses from human readers were compared with those of 139 algorithms created by 77 machine-learning labs, who participated in the International Skin Imaging Collaboration 2018 challenge and received a training set of 10 015 images in advance. The ground truth of each lesion fell into one of seven predefined disease categories: intraepithelial carcinoma including actinic keratoses and Bowen's disease; basal cell carcinoma; benign keratinocytic lesions including solar lentigo, seborrheic keratosis and lichen planus-like keratosis; dermatofibroma; melanoma; melanocytic nevus; and vascular lesions. The two main outcomes were the differences in the number of correct specific diagnoses per batch between all human readers and the top three algorithms, and between human experts and the top three algorithms. Findings: Between Aug 4, 2018, and Sept 30, 2018, 511 human readers from 63 countries had at least one attempt in the reader study. 283 (55·4{\%}) of 511 human readers were board-certified dermatologists, 118 (23·1{\%}) were dermatology residents, and 83 (16·2{\%}) were general practitioners. When comparing all human readers with all machine-learning algorithms, the algorithms achieved a mean of 2·01 (95{\%} CI 1·97 to 2·04; p<0·0001) more correct diagnoses (17·91 [SD 3·42] vs 19·92 [4·27]). 27 human experts with more than 10 years of experience achieved a mean of 18·78 (SD 3·15) correct answers, compared with 25·43 (1·95) correct answers for the top three machine algorithms (mean difference 6·65, 95{\%} CI 6·06–7·25; p<0·0001). The difference between human experts and the top three algorithms was significantly lower for images in the test set that were collected from sources not included in the training set (human underperformance of 11·4{\%}, 95{\%} CI 9·9–12·9 vs 3·6{\%}, 0·8–6·3; p<0·0001). Interpretation: State-of-the-art machine-learning classifiers outperformed human experts in the diagnosis of pigmented skin lesions and should have a more important role in clinical practice. However, a possible limitation of these algorithms is their decreased performance for out-of-distribution images, which should be addressed in future research. Funding: None.",
author = "Philipp Tschandl and Noel Codella and Akay, {Beng{\"u} Nisa} and Giuseppe Argenziano and Braun, {Ralph P.} and Horacio Cabo and David Gutman and Allan Halpern and Brian Helba and Rainer Hofmann-Wellenhof and Aimilios Lallas and Jan Lapins and Caterina Longo and Josep Malvehy and Marchetti, {Michael A.} and Ashfaq Marghoob and Scott Menzies and Amanda Oakley and John Paoli and Susana Puig and Christoph Rinner and Cliff Rosendahl and Alon Scope and Christoph Sinz and Soyer, {H. Peter} and Luc Thomas and Iris Zalaudek and Harald Kittler",
year = "2019",
month = "7",
day = "1",
doi = "10.1016/S1470-2045(19)30333-X",
language = "English",
volume = "20",
pages = "938--947",
journal = "The Lancet Oncology",
issn = "1470-2045",
publisher = "Lancet Publishing Group",
number = "7",

}

TY - JOUR

T1 - Comparison of the accuracy of human readers versus machine-learning algorithms for pigmented skin lesion classification

T2 - an open, web-based, international, diagnostic study

AU - Tschandl, Philipp

AU - Codella, Noel

AU - Akay, Bengü Nisa

AU - Argenziano, Giuseppe

AU - Braun, Ralph P.

AU - Cabo, Horacio

AU - Gutman, David

AU - Halpern, Allan

AU - Helba, Brian

AU - Hofmann-Wellenhof, Rainer

AU - Lallas, Aimilios

AU - Lapins, Jan

AU - Longo, Caterina

AU - Malvehy, Josep

AU - Marchetti, Michael A.

AU - Marghoob, Ashfaq

AU - Menzies, Scott

AU - Oakley, Amanda

AU - Paoli, John

AU - Puig, Susana

AU - Rinner, Christoph

AU - Rosendahl, Cliff

AU - Scope, Alon

AU - Sinz, Christoph

AU - Soyer, H. Peter

AU - Thomas, Luc

AU - Zalaudek, Iris

AU - Kittler, Harald

PY - 2019/7/1

Y1 - 2019/7/1

N2 - Background: Whether machine-learning algorithms can diagnose all pigmented skin lesions as accurately as human experts is unclear. The aim of this study was to compare the diagnostic accuracy of state-of-the-art machine-learning algorithms with human readers for all clinically relevant types of benign and malignant pigmented skin lesions. Methods: For this open, web-based, international, diagnostic study, human readers were asked to diagnose dermatoscopic images selected randomly in 30-image batches from a test set of 1511 images. The diagnoses from human readers were compared with those of 139 algorithms created by 77 machine-learning labs, who participated in the International Skin Imaging Collaboration 2018 challenge and received a training set of 10 015 images in advance. The ground truth of each lesion fell into one of seven predefined disease categories: intraepithelial carcinoma including actinic keratoses and Bowen's disease; basal cell carcinoma; benign keratinocytic lesions including solar lentigo, seborrheic keratosis and lichen planus-like keratosis; dermatofibroma; melanoma; melanocytic nevus; and vascular lesions. The two main outcomes were the differences in the number of correct specific diagnoses per batch between all human readers and the top three algorithms, and between human experts and the top three algorithms. Findings: Between Aug 4, 2018, and Sept 30, 2018, 511 human readers from 63 countries had at least one attempt in the reader study. 283 (55·4%) of 511 human readers were board-certified dermatologists, 118 (23·1%) were dermatology residents, and 83 (16·2%) were general practitioners. When comparing all human readers with all machine-learning algorithms, the algorithms achieved a mean of 2·01 (95% CI 1·97 to 2·04; p<0·0001) more correct diagnoses (17·91 [SD 3·42] vs 19·92 [4·27]). 27 human experts with more than 10 years of experience achieved a mean of 18·78 (SD 3·15) correct answers, compared with 25·43 (1·95) correct answers for the top three machine algorithms (mean difference 6·65, 95% CI 6·06–7·25; p<0·0001). The difference between human experts and the top three algorithms was significantly lower for images in the test set that were collected from sources not included in the training set (human underperformance of 11·4%, 95% CI 9·9–12·9 vs 3·6%, 0·8–6·3; p<0·0001). Interpretation: State-of-the-art machine-learning classifiers outperformed human experts in the diagnosis of pigmented skin lesions and should have a more important role in clinical practice. However, a possible limitation of these algorithms is their decreased performance for out-of-distribution images, which should be addressed in future research. Funding: None.

AB - Background: Whether machine-learning algorithms can diagnose all pigmented skin lesions as accurately as human experts is unclear. The aim of this study was to compare the diagnostic accuracy of state-of-the-art machine-learning algorithms with human readers for all clinically relevant types of benign and malignant pigmented skin lesions. Methods: For this open, web-based, international, diagnostic study, human readers were asked to diagnose dermatoscopic images selected randomly in 30-image batches from a test set of 1511 images. The diagnoses from human readers were compared with those of 139 algorithms created by 77 machine-learning labs, who participated in the International Skin Imaging Collaboration 2018 challenge and received a training set of 10 015 images in advance. The ground truth of each lesion fell into one of seven predefined disease categories: intraepithelial carcinoma including actinic keratoses and Bowen's disease; basal cell carcinoma; benign keratinocytic lesions including solar lentigo, seborrheic keratosis and lichen planus-like keratosis; dermatofibroma; melanoma; melanocytic nevus; and vascular lesions. The two main outcomes were the differences in the number of correct specific diagnoses per batch between all human readers and the top three algorithms, and between human experts and the top three algorithms. Findings: Between Aug 4, 2018, and Sept 30, 2018, 511 human readers from 63 countries had at least one attempt in the reader study. 283 (55·4%) of 511 human readers were board-certified dermatologists, 118 (23·1%) were dermatology residents, and 83 (16·2%) were general practitioners. When comparing all human readers with all machine-learning algorithms, the algorithms achieved a mean of 2·01 (95% CI 1·97 to 2·04; p<0·0001) more correct diagnoses (17·91 [SD 3·42] vs 19·92 [4·27]). 27 human experts with more than 10 years of experience achieved a mean of 18·78 (SD 3·15) correct answers, compared with 25·43 (1·95) correct answers for the top three machine algorithms (mean difference 6·65, 95% CI 6·06–7·25; p<0·0001). The difference between human experts and the top three algorithms was significantly lower for images in the test set that were collected from sources not included in the training set (human underperformance of 11·4%, 95% CI 9·9–12·9 vs 3·6%, 0·8–6·3; p<0·0001). Interpretation: State-of-the-art machine-learning classifiers outperformed human experts in the diagnosis of pigmented skin lesions and should have a more important role in clinical practice. However, a possible limitation of these algorithms is their decreased performance for out-of-distribution images, which should be addressed in future research. Funding: None.

UR - http://www.scopus.com/inward/record.url?scp=85068050449&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=85068050449&partnerID=8YFLogxK

U2 - 10.1016/S1470-2045(19)30333-X

DO - 10.1016/S1470-2045(19)30333-X

M3 - Article

VL - 20

SP - 938

EP - 947

JO - The Lancet Oncology

JF - The Lancet Oncology

SN - 1470-2045

IS - 7

ER -