Abstract
In the study of complex genetic diseases, the identification of subgroups of patients sharing similar genetic characteristics represents a challenging task, for example, to improve treatment decision. One type of genetic lesion, frequently investigated in such disorders, is the change of the DNA copy number (CN) at specific genomic traits. Non-negative Matrix Factorization (NMF) is a standard technique to reduce the dimensionality of a data set and to cluster data samples, while keeping its most relevant information in meaningful components. Thus, it can be used to discover subgroups of patients from CN profiles. It is however computationally impractical for very high dimensional data, such as CN microarray data. Deciding the most suitable number of subgroups is also a challenging problem. The aim of this work is to derive a procedure to compact high dimensional data, in order to improve NMF applicability without compromising the quality of the clustering. This is particularly important for analyzing high-resolution microarray data. Many commonly used quality measures, as well as our own measures, are employed to decide the number of subgroups and to assess the quality of the results. Our measures are based on the idea of identifying robust subgroups, inspired by biologically/clinically relevance instead of simply aiming at well-separated clusters. We evaluate our procedure using four real independent data sets. In these data sets, our method was able to find accurate subgroups with individual molecular and clinical features and outperformed the standard NMF in terms of accuracy in the factorization fitness function. Hence, it can be useful for the discovery of subgroups of patients with similar CN profiles in the study of heterogeneous diseases.
Original language | English |
---|---|
Article number | e79720 |
Journal | PLoS One |
Volume | 8 |
Issue number | 11 |
DOIs | |
Publication status | Published - Nov 20 2013 |
Fingerprint
ASJC Scopus subject areas
- Agricultural and Biological Sciences(all)
- Biochemistry, Genetics and Molecular Biology(all)
- Medicine(all)
Cite this
Discovering subgroups of patients from DNA copy number data using NMF on compacted matrices. / De Campos, Cassio P.; Rancoita, Paola M V; Kwee, Ivo; Zucca, Emanuele; Zaffalon, Marco; Bertoni, Francesco.
In: PLoS One, Vol. 8, No. 11, e79720, 20.11.2013.Research output: Contribution to journal › Article
}
TY - JOUR
T1 - Discovering subgroups of patients from DNA copy number data using NMF on compacted matrices
AU - De Campos, Cassio P.
AU - Rancoita, Paola M V
AU - Kwee, Ivo
AU - Zucca, Emanuele
AU - Zaffalon, Marco
AU - Bertoni, Francesco
PY - 2013/11/20
Y1 - 2013/11/20
N2 - In the study of complex genetic diseases, the identification of subgroups of patients sharing similar genetic characteristics represents a challenging task, for example, to improve treatment decision. One type of genetic lesion, frequently investigated in such disorders, is the change of the DNA copy number (CN) at specific genomic traits. Non-negative Matrix Factorization (NMF) is a standard technique to reduce the dimensionality of a data set and to cluster data samples, while keeping its most relevant information in meaningful components. Thus, it can be used to discover subgroups of patients from CN profiles. It is however computationally impractical for very high dimensional data, such as CN microarray data. Deciding the most suitable number of subgroups is also a challenging problem. The aim of this work is to derive a procedure to compact high dimensional data, in order to improve NMF applicability without compromising the quality of the clustering. This is particularly important for analyzing high-resolution microarray data. Many commonly used quality measures, as well as our own measures, are employed to decide the number of subgroups and to assess the quality of the results. Our measures are based on the idea of identifying robust subgroups, inspired by biologically/clinically relevance instead of simply aiming at well-separated clusters. We evaluate our procedure using four real independent data sets. In these data sets, our method was able to find accurate subgroups with individual molecular and clinical features and outperformed the standard NMF in terms of accuracy in the factorization fitness function. Hence, it can be useful for the discovery of subgroups of patients with similar CN profiles in the study of heterogeneous diseases.
AB - In the study of complex genetic diseases, the identification of subgroups of patients sharing similar genetic characteristics represents a challenging task, for example, to improve treatment decision. One type of genetic lesion, frequently investigated in such disorders, is the change of the DNA copy number (CN) at specific genomic traits. Non-negative Matrix Factorization (NMF) is a standard technique to reduce the dimensionality of a data set and to cluster data samples, while keeping its most relevant information in meaningful components. Thus, it can be used to discover subgroups of patients from CN profiles. It is however computationally impractical for very high dimensional data, such as CN microarray data. Deciding the most suitable number of subgroups is also a challenging problem. The aim of this work is to derive a procedure to compact high dimensional data, in order to improve NMF applicability without compromising the quality of the clustering. This is particularly important for analyzing high-resolution microarray data. Many commonly used quality measures, as well as our own measures, are employed to decide the number of subgroups and to assess the quality of the results. Our measures are based on the idea of identifying robust subgroups, inspired by biologically/clinically relevance instead of simply aiming at well-separated clusters. We evaluate our procedure using four real independent data sets. In these data sets, our method was able to find accurate subgroups with individual molecular and clinical features and outperformed the standard NMF in terms of accuracy in the factorization fitness function. Hence, it can be useful for the discovery of subgroups of patients with similar CN profiles in the study of heterogeneous diseases.
UR - http://www.scopus.com/inward/record.url?scp=84894283376&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=84894283376&partnerID=8YFLogxK
U2 - 10.1371/journal.pone.0079720
DO - 10.1371/journal.pone.0079720
M3 - Article
C2 - 24278162
AN - SCOPUS:84894283376
VL - 8
JO - PLoS One
JF - PLoS One
SN - 1932-6203
IS - 11
M1 - e79720
ER -