MINT and IntAct contribute to the Second BioCreative challenge

Serving the text-mining community with high quality molecular interaction data

Andrew Chatr-Aryamontri, Samuel Kerrien, Jyoti Khadake, Sandra Orchard, Arnaud Ceol, Luana Licata, Luisa Castagnoli, Stefano Costa, Cathy Derow, Rachael Huntley, Bruno Aranda, Catherine Leroy, Dave Thorneycroft, Rolf Apweiler, Gianni Cesareni, Henning Hermjakob

Research output: Contribution to journalArticle

25 Citations (Scopus)

Abstract

Background: In the absence of consolidated pipelines to archive biological data electronically, information dispersed in the literature must be captured by manual annotation. Unfortunately, manual annotation is time consuming and the coverage of published interaction data is therefore far from complete. The use of text-mining tools to identify relevant publications and to assist in the initial information extraction could help to improve the efficiency of the curation process and, as a consequence, the database coverage of data available in the literature. The 2006 BioCreative competition was aimed at evaluating text-mining procedures in comparison with manual annotation of protein-protein interactions. Results: To aid the BioCreative protein-protein interaction task, IntAct and MINT (Molecular INTeraction) provided both the training and the test datasets. Data from both databases are comparable because they were curated according to the same standards. During the manual curation process, the major cause of data loss in mining the articles for information was ambiguity in the mapping of the gene names to stable UniProtKB database identifiers. It was also observed that most of the information about interactions was contained only within the full-text of the publication; hence, text mining of protein-protein interaction data will require the analysis of the full-text of the articles and cannot be restricted to the abstract. Conclusion: The development of text-mining tools to extract protein-protein interaction information may increase the literature coverage achieved by manual curation. To support the text-mining community, databases will highlight those sentences within the articles that describe the interactions. These will supply data-miners with a high quality dataset for algorithm development. Furthermore, the dictionary of terms created by the BioCreative competitors could enrich the synonym list of the PSI-MI (Proteomics Standards Initiative-Molecular Interactions) controlled vocabulary, which is used by both databases to annotate their data content.

Original languageEnglish
Article numberS5
JournalGenome Biology
Volume9
Issue numberSUPPL. 2
DOIs
Publication statusPublished - Sep 1 2008

Fingerprint

Data Mining
protein-protein interactions
protein
Databases
Proteins
Publications
Controlled Vocabulary
Molecular Sequence Annotation
Chromosome Mapping
Information Storage and Retrieval
proteomics
chromosome mapping
Proteomics
Names
extracts
gene
testing

ASJC Scopus subject areas

  • Genetics
  • Cell Biology
  • Ecology, Evolution, Behavior and Systematics

Cite this

Chatr-Aryamontri, A., Kerrien, S., Khadake, J., Orchard, S., Ceol, A., Licata, L., ... Hermjakob, H. (2008). MINT and IntAct contribute to the Second BioCreative challenge: Serving the text-mining community with high quality molecular interaction data. Genome Biology, 9(SUPPL. 2), [S5]. https://doi.org/10.1186/gb-2008-9-s2-s5

MINT and IntAct contribute to the Second BioCreative challenge : Serving the text-mining community with high quality molecular interaction data. / Chatr-Aryamontri, Andrew; Kerrien, Samuel; Khadake, Jyoti; Orchard, Sandra; Ceol, Arnaud; Licata, Luana; Castagnoli, Luisa; Costa, Stefano; Derow, Cathy; Huntley, Rachael; Aranda, Bruno; Leroy, Catherine; Thorneycroft, Dave; Apweiler, Rolf; Cesareni, Gianni; Hermjakob, Henning.

In: Genome Biology, Vol. 9, No. SUPPL. 2, S5, 01.09.2008.

Research output: Contribution to journalArticle

Chatr-Aryamontri, A, Kerrien, S, Khadake, J, Orchard, S, Ceol, A, Licata, L, Castagnoli, L, Costa, S, Derow, C, Huntley, R, Aranda, B, Leroy, C, Thorneycroft, D, Apweiler, R, Cesareni, G & Hermjakob, H 2008, 'MINT and IntAct contribute to the Second BioCreative challenge: Serving the text-mining community with high quality molecular interaction data', Genome Biology, vol. 9, no. SUPPL. 2, S5. https://doi.org/10.1186/gb-2008-9-s2-s5
Chatr-Aryamontri, Andrew ; Kerrien, Samuel ; Khadake, Jyoti ; Orchard, Sandra ; Ceol, Arnaud ; Licata, Luana ; Castagnoli, Luisa ; Costa, Stefano ; Derow, Cathy ; Huntley, Rachael ; Aranda, Bruno ; Leroy, Catherine ; Thorneycroft, Dave ; Apweiler, Rolf ; Cesareni, Gianni ; Hermjakob, Henning. / MINT and IntAct contribute to the Second BioCreative challenge : Serving the text-mining community with high quality molecular interaction data. In: Genome Biology. 2008 ; Vol. 9, No. SUPPL. 2.
@article{91fb95ea248146ffbcbc52a686bc4197,
title = "MINT and IntAct contribute to the Second BioCreative challenge: Serving the text-mining community with high quality molecular interaction data",
abstract = "Background: In the absence of consolidated pipelines to archive biological data electronically, information dispersed in the literature must be captured by manual annotation. Unfortunately, manual annotation is time consuming and the coverage of published interaction data is therefore far from complete. The use of text-mining tools to identify relevant publications and to assist in the initial information extraction could help to improve the efficiency of the curation process and, as a consequence, the database coverage of data available in the literature. The 2006 BioCreative competition was aimed at evaluating text-mining procedures in comparison with manual annotation of protein-protein interactions. Results: To aid the BioCreative protein-protein interaction task, IntAct and MINT (Molecular INTeraction) provided both the training and the test datasets. Data from both databases are comparable because they were curated according to the same standards. During the manual curation process, the major cause of data loss in mining the articles for information was ambiguity in the mapping of the gene names to stable UniProtKB database identifiers. It was also observed that most of the information about interactions was contained only within the full-text of the publication; hence, text mining of protein-protein interaction data will require the analysis of the full-text of the articles and cannot be restricted to the abstract. Conclusion: The development of text-mining tools to extract protein-protein interaction information may increase the literature coverage achieved by manual curation. To support the text-mining community, databases will highlight those sentences within the articles that describe the interactions. These will supply data-miners with a high quality dataset for algorithm development. Furthermore, the dictionary of terms created by the BioCreative competitors could enrich the synonym list of the PSI-MI (Proteomics Standards Initiative-Molecular Interactions) controlled vocabulary, which is used by both databases to annotate their data content.",
author = "Andrew Chatr-Aryamontri and Samuel Kerrien and Jyoti Khadake and Sandra Orchard and Arnaud Ceol and Luana Licata and Luisa Castagnoli and Stefano Costa and Cathy Derow and Rachael Huntley and Bruno Aranda and Catherine Leroy and Dave Thorneycroft and Rolf Apweiler and Gianni Cesareni and Henning Hermjakob",
year = "2008",
month = "9",
day = "1",
doi = "10.1186/gb-2008-9-s2-s5",
language = "English",
volume = "9",
journal = "Genome Biology",
issn = "1474-760X",
publisher = "BioMed Central Ltd.",
number = "SUPPL. 2",

}

TY - JOUR

T1 - MINT and IntAct contribute to the Second BioCreative challenge

T2 - Serving the text-mining community with high quality molecular interaction data

AU - Chatr-Aryamontri, Andrew

AU - Kerrien, Samuel

AU - Khadake, Jyoti

AU - Orchard, Sandra

AU - Ceol, Arnaud

AU - Licata, Luana

AU - Castagnoli, Luisa

AU - Costa, Stefano

AU - Derow, Cathy

AU - Huntley, Rachael

AU - Aranda, Bruno

AU - Leroy, Catherine

AU - Thorneycroft, Dave

AU - Apweiler, Rolf

AU - Cesareni, Gianni

AU - Hermjakob, Henning

PY - 2008/9/1

Y1 - 2008/9/1

N2 - Background: In the absence of consolidated pipelines to archive biological data electronically, information dispersed in the literature must be captured by manual annotation. Unfortunately, manual annotation is time consuming and the coverage of published interaction data is therefore far from complete. The use of text-mining tools to identify relevant publications and to assist in the initial information extraction could help to improve the efficiency of the curation process and, as a consequence, the database coverage of data available in the literature. The 2006 BioCreative competition was aimed at evaluating text-mining procedures in comparison with manual annotation of protein-protein interactions. Results: To aid the BioCreative protein-protein interaction task, IntAct and MINT (Molecular INTeraction) provided both the training and the test datasets. Data from both databases are comparable because they were curated according to the same standards. During the manual curation process, the major cause of data loss in mining the articles for information was ambiguity in the mapping of the gene names to stable UniProtKB database identifiers. It was also observed that most of the information about interactions was contained only within the full-text of the publication; hence, text mining of protein-protein interaction data will require the analysis of the full-text of the articles and cannot be restricted to the abstract. Conclusion: The development of text-mining tools to extract protein-protein interaction information may increase the literature coverage achieved by manual curation. To support the text-mining community, databases will highlight those sentences within the articles that describe the interactions. These will supply data-miners with a high quality dataset for algorithm development. Furthermore, the dictionary of terms created by the BioCreative competitors could enrich the synonym list of the PSI-MI (Proteomics Standards Initiative-Molecular Interactions) controlled vocabulary, which is used by both databases to annotate their data content.

AB - Background: In the absence of consolidated pipelines to archive biological data electronically, information dispersed in the literature must be captured by manual annotation. Unfortunately, manual annotation is time consuming and the coverage of published interaction data is therefore far from complete. The use of text-mining tools to identify relevant publications and to assist in the initial information extraction could help to improve the efficiency of the curation process and, as a consequence, the database coverage of data available in the literature. The 2006 BioCreative competition was aimed at evaluating text-mining procedures in comparison with manual annotation of protein-protein interactions. Results: To aid the BioCreative protein-protein interaction task, IntAct and MINT (Molecular INTeraction) provided both the training and the test datasets. Data from both databases are comparable because they were curated according to the same standards. During the manual curation process, the major cause of data loss in mining the articles for information was ambiguity in the mapping of the gene names to stable UniProtKB database identifiers. It was also observed that most of the information about interactions was contained only within the full-text of the publication; hence, text mining of protein-protein interaction data will require the analysis of the full-text of the articles and cannot be restricted to the abstract. Conclusion: The development of text-mining tools to extract protein-protein interaction information may increase the literature coverage achieved by manual curation. To support the text-mining community, databases will highlight those sentences within the articles that describe the interactions. These will supply data-miners with a high quality dataset for algorithm development. Furthermore, the dictionary of terms created by the BioCreative competitors could enrich the synonym list of the PSI-MI (Proteomics Standards Initiative-Molecular Interactions) controlled vocabulary, which is used by both databases to annotate their data content.

UR - http://www.scopus.com/inward/record.url?scp=51049101706&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=51049101706&partnerID=8YFLogxK

U2 - 10.1186/gb-2008-9-s2-s5

DO - 10.1186/gb-2008-9-s2-s5

M3 - Article

VL - 9

JO - Genome Biology

JF - Genome Biology

SN - 1474-760X

IS - SUPPL. 2

M1 - S5

ER -