The CHEMDNER corpus of chemicals and drugs and its annotation principles

Martin Krallinger, Obdulia Rabal, Florian Leitner, Miguel Vazquez, David Salgado, Zhiyong Lu, Robert Leaman, Yanan Lu, Donghong Ji, Daniel M. Lowe, Roger A. Sayle, Riza Theresa Batista-Navarro, Rafal Rak, Torsten Huber, Tim Rocktäschel, Sérgio Matos, David Campos, Buzhou Tang, Hua Xu, Tsendsuren MunkhdalaiKeun Ho Ryu, S. V. Ramanan, Senthil Nathan, Slavko Žitnik, Marko Bajec, Lutz Weber, Matthias Irmer, Saber A. Akhondi, Jan A. Kors, Shuo Xu, Xin An, Utpal Kumar Sikdar, Asif Ekbal, Masaharu Yoshioka, Thaer M. Dieb, Miji Choi, Karin Verspoor, Madian Khabsa, C. Lee Giles, Hongfang Liu, Komandur Elayavilli Ravikumar, Andre Lamurias, Francisco M. Couto, Hong Jie Dai, Richard Tzong Han Tsai, Caglar Ata, Tolga Can, Anabel Usié, Rui Alves, Isabel Segura-Bedmar, Paloma Martínez, Julen Oyarzabal, Alfonso Valencia

Research output: Contribution to journalArticle

53 Citations (Scopus)

Abstract

The automatic extraction of chemical information from text requires the recognition of chemical entity mentions as one of its key steps. When developing supervised named entity recognition (NER) systems, the availability of a large, manually annotated text corpus is desirable. Furthermore, large corpora permit the robust evaluation and comparison of different approaches that detect chemicals in documents. We present the CHEMDNER corpus, a collection of 10,000 PubMed abstracts that contain a total of 84,355 chemical entity mentions labeled manually by expert chemistry literature curators, following annotation guidelines specifically defined for this task. The abstracts of the CHEMDNER corpus were selected to be representative for all major chemical disciplines. Each of the chemical entity mentions was manually labeled according to its structure-associated chemical entity mention (SACEM) class: abbreviation, family, formula, identifier, multiple, systematic and trivial. The difficulty and consistency of tagging chemicals in text was measured using an agreement study between annotators, obtaining a percentage agreement of 91. For a subset of the CHEMDNER corpus (the test set of 3,000 abstracts) we provide not only the Gold Standard manual annotations, but also mentions automatically detected by the 26 teams that participated in the BioCreative IV CHEMDNER chemical mention recognition task. In addition, we release the CHEMDNER silver standard corpus of automatically extracted mentions from 17,000 randomly selected PubMed abstracts. A version of the CHEMDNER corpus in the BioC format has been generated as well. We propose a standard for required minimum information about entity annotations for the construction of domain specific corpora on chemical and drug entities. The CHEMDNER corpus and annotation guidelines are available at: http://www.biocreative.org/resources/biocreative-iv/chemdner-corpus/.

Original languageEnglish (US)
Article numberS2
JournalJournal of Cheminformatics
Volume7
DOIs
StatePublished - Jan 1 2015

Fingerprint

annotations
drugs
drug
Pharmaceutical Preparations
gold standard
chemistry
expert
evaluation
resources
abbreviations
marking
set theory
format
availability
Availability

All Science Journal Classification (ASJC) codes

  • Computer Science Applications
  • Physical and Theoretical Chemistry
  • Computer Graphics and Computer-Aided Design
  • Library and Information Sciences

Cite this

Krallinger, M., Rabal, O., Leitner, F., Vazquez, M., Salgado, D., Lu, Z., ... Valencia, A. (2015). The CHEMDNER corpus of chemicals and drugs and its annotation principles. Journal of Cheminformatics, 7, [S2]. https://doi.org/10.1186/1758-2946-7-S1-S2
Krallinger, Martin ; Rabal, Obdulia ; Leitner, Florian ; Vazquez, Miguel ; Salgado, David ; Lu, Zhiyong ; Leaman, Robert ; Lu, Yanan ; Ji, Donghong ; Lowe, Daniel M. ; Sayle, Roger A. ; Batista-Navarro, Riza Theresa ; Rak, Rafal ; Huber, Torsten ; Rocktäschel, Tim ; Matos, Sérgio ; Campos, David ; Tang, Buzhou ; Xu, Hua ; Munkhdalai, Tsendsuren ; Ryu, Keun Ho ; Ramanan, S. V. ; Nathan, Senthil ; Žitnik, Slavko ; Bajec, Marko ; Weber, Lutz ; Irmer, Matthias ; Akhondi, Saber A. ; Kors, Jan A. ; Xu, Shuo ; An, Xin ; Sikdar, Utpal Kumar ; Ekbal, Asif ; Yoshioka, Masaharu ; Dieb, Thaer M. ; Choi, Miji ; Verspoor, Karin ; Khabsa, Madian ; Giles, C. Lee ; Liu, Hongfang ; Ravikumar, Komandur Elayavilli ; Lamurias, Andre ; Couto, Francisco M. ; Dai, Hong Jie ; Tsai, Richard Tzong Han ; Ata, Caglar ; Can, Tolga ; Usié, Anabel ; Alves, Rui ; Segura-Bedmar, Isabel ; Martínez, Paloma ; Oyarzabal, Julen ; Valencia, Alfonso. / The CHEMDNER corpus of chemicals and drugs and its annotation principles. In: Journal of Cheminformatics. 2015 ; Vol. 7.
@article{466784ba6a544936a479b3208366e6b3,
title = "The CHEMDNER corpus of chemicals and drugs and its annotation principles",
abstract = "The automatic extraction of chemical information from text requires the recognition of chemical entity mentions as one of its key steps. When developing supervised named entity recognition (NER) systems, the availability of a large, manually annotated text corpus is desirable. Furthermore, large corpora permit the robust evaluation and comparison of different approaches that detect chemicals in documents. We present the CHEMDNER corpus, a collection of 10,000 PubMed abstracts that contain a total of 84,355 chemical entity mentions labeled manually by expert chemistry literature curators, following annotation guidelines specifically defined for this task. The abstracts of the CHEMDNER corpus were selected to be representative for all major chemical disciplines. Each of the chemical entity mentions was manually labeled according to its structure-associated chemical entity mention (SACEM) class: abbreviation, family, formula, identifier, multiple, systematic and trivial. The difficulty and consistency of tagging chemicals in text was measured using an agreement study between annotators, obtaining a percentage agreement of 91. For a subset of the CHEMDNER corpus (the test set of 3,000 abstracts) we provide not only the Gold Standard manual annotations, but also mentions automatically detected by the 26 teams that participated in the BioCreative IV CHEMDNER chemical mention recognition task. In addition, we release the CHEMDNER silver standard corpus of automatically extracted mentions from 17,000 randomly selected PubMed abstracts. A version of the CHEMDNER corpus in the BioC format has been generated as well. We propose a standard for required minimum information about entity annotations for the construction of domain specific corpora on chemical and drug entities. The CHEMDNER corpus and annotation guidelines are available at: http://www.biocreative.org/resources/biocreative-iv/chemdner-corpus/.",
author = "Martin Krallinger and Obdulia Rabal and Florian Leitner and Miguel Vazquez and David Salgado and Zhiyong Lu and Robert Leaman and Yanan Lu and Donghong Ji and Lowe, {Daniel M.} and Sayle, {Roger A.} and Batista-Navarro, {Riza Theresa} and Rafal Rak and Torsten Huber and Tim Rockt{\"a}schel and S{\'e}rgio Matos and David Campos and Buzhou Tang and Hua Xu and Tsendsuren Munkhdalai and Ryu, {Keun Ho} and Ramanan, {S. V.} and Senthil Nathan and Slavko Žitnik and Marko Bajec and Lutz Weber and Matthias Irmer and Akhondi, {Saber A.} and Kors, {Jan A.} and Shuo Xu and Xin An and Sikdar, {Utpal Kumar} and Asif Ekbal and Masaharu Yoshioka and Dieb, {Thaer M.} and Miji Choi and Karin Verspoor and Madian Khabsa and Giles, {C. Lee} and Hongfang Liu and Ravikumar, {Komandur Elayavilli} and Andre Lamurias and Couto, {Francisco M.} and Dai, {Hong Jie} and Tsai, {Richard Tzong Han} and Caglar Ata and Tolga Can and Anabel Usi{\'e} and Rui Alves and Isabel Segura-Bedmar and Paloma Mart{\'i}nez and Julen Oyarzabal and Alfonso Valencia",
year = "2015",
month = "1",
day = "1",
doi = "10.1186/1758-2946-7-S1-S2",
language = "English (US)",
volume = "7",
journal = "Journal of Cheminformatics",
issn = "1758-2946",
publisher = "Chemistry Central",

}

Krallinger, M, Rabal, O, Leitner, F, Vazquez, M, Salgado, D, Lu, Z, Leaman, R, Lu, Y, Ji, D, Lowe, DM, Sayle, RA, Batista-Navarro, RT, Rak, R, Huber, T, Rocktäschel, T, Matos, S, Campos, D, Tang, B, Xu, H, Munkhdalai, T, Ryu, KH, Ramanan, SV, Nathan, S, Žitnik, S, Bajec, M, Weber, L, Irmer, M, Akhondi, SA, Kors, JA, Xu, S, An, X, Sikdar, UK, Ekbal, A, Yoshioka, M, Dieb, TM, Choi, M, Verspoor, K, Khabsa, M, Giles, CL, Liu, H, Ravikumar, KE, Lamurias, A, Couto, FM, Dai, HJ, Tsai, RTH, Ata, C, Can, T, Usié, A, Alves, R, Segura-Bedmar, I, Martínez, P, Oyarzabal, J & Valencia, A 2015, 'The CHEMDNER corpus of chemicals and drugs and its annotation principles', Journal of Cheminformatics, vol. 7, S2. https://doi.org/10.1186/1758-2946-7-S1-S2

The CHEMDNER corpus of chemicals and drugs and its annotation principles. / Krallinger, Martin; Rabal, Obdulia; Leitner, Florian; Vazquez, Miguel; Salgado, David; Lu, Zhiyong; Leaman, Robert; Lu, Yanan; Ji, Donghong; Lowe, Daniel M.; Sayle, Roger A.; Batista-Navarro, Riza Theresa; Rak, Rafal; Huber, Torsten; Rocktäschel, Tim; Matos, Sérgio; Campos, David; Tang, Buzhou; Xu, Hua; Munkhdalai, Tsendsuren; Ryu, Keun Ho; Ramanan, S. V.; Nathan, Senthil; Žitnik, Slavko; Bajec, Marko; Weber, Lutz; Irmer, Matthias; Akhondi, Saber A.; Kors, Jan A.; Xu, Shuo; An, Xin; Sikdar, Utpal Kumar; Ekbal, Asif; Yoshioka, Masaharu; Dieb, Thaer M.; Choi, Miji; Verspoor, Karin; Khabsa, Madian; Giles, C. Lee; Liu, Hongfang; Ravikumar, Komandur Elayavilli; Lamurias, Andre; Couto, Francisco M.; Dai, Hong Jie; Tsai, Richard Tzong Han; Ata, Caglar; Can, Tolga; Usié, Anabel; Alves, Rui; Segura-Bedmar, Isabel; Martínez, Paloma; Oyarzabal, Julen; Valencia, Alfonso.

In: Journal of Cheminformatics, Vol. 7, S2, 01.01.2015.

Research output: Contribution to journalArticle

TY - JOUR

T1 - The CHEMDNER corpus of chemicals and drugs and its annotation principles

AU - Krallinger, Martin

AU - Rabal, Obdulia

AU - Leitner, Florian

AU - Vazquez, Miguel

AU - Salgado, David

AU - Lu, Zhiyong

AU - Leaman, Robert

AU - Lu, Yanan

AU - Ji, Donghong

AU - Lowe, Daniel M.

AU - Sayle, Roger A.

AU - Batista-Navarro, Riza Theresa

AU - Rak, Rafal

AU - Huber, Torsten

AU - Rocktäschel, Tim

AU - Matos, Sérgio

AU - Campos, David

AU - Tang, Buzhou

AU - Xu, Hua

AU - Munkhdalai, Tsendsuren

AU - Ryu, Keun Ho

AU - Ramanan, S. V.

AU - Nathan, Senthil

AU - Žitnik, Slavko

AU - Bajec, Marko

AU - Weber, Lutz

AU - Irmer, Matthias

AU - Akhondi, Saber A.

AU - Kors, Jan A.

AU - Xu, Shuo

AU - An, Xin

AU - Sikdar, Utpal Kumar

AU - Ekbal, Asif

AU - Yoshioka, Masaharu

AU - Dieb, Thaer M.

AU - Choi, Miji

AU - Verspoor, Karin

AU - Khabsa, Madian

AU - Giles, C. Lee

AU - Liu, Hongfang

AU - Ravikumar, Komandur Elayavilli

AU - Lamurias, Andre

AU - Couto, Francisco M.

AU - Dai, Hong Jie

AU - Tsai, Richard Tzong Han

AU - Ata, Caglar

AU - Can, Tolga

AU - Usié, Anabel

AU - Alves, Rui

AU - Segura-Bedmar, Isabel

AU - Martínez, Paloma

AU - Oyarzabal, Julen

AU - Valencia, Alfonso

PY - 2015/1/1

Y1 - 2015/1/1

N2 - The automatic extraction of chemical information from text requires the recognition of chemical entity mentions as one of its key steps. When developing supervised named entity recognition (NER) systems, the availability of a large, manually annotated text corpus is desirable. Furthermore, large corpora permit the robust evaluation and comparison of different approaches that detect chemicals in documents. We present the CHEMDNER corpus, a collection of 10,000 PubMed abstracts that contain a total of 84,355 chemical entity mentions labeled manually by expert chemistry literature curators, following annotation guidelines specifically defined for this task. The abstracts of the CHEMDNER corpus were selected to be representative for all major chemical disciplines. Each of the chemical entity mentions was manually labeled according to its structure-associated chemical entity mention (SACEM) class: abbreviation, family, formula, identifier, multiple, systematic and trivial. The difficulty and consistency of tagging chemicals in text was measured using an agreement study between annotators, obtaining a percentage agreement of 91. For a subset of the CHEMDNER corpus (the test set of 3,000 abstracts) we provide not only the Gold Standard manual annotations, but also mentions automatically detected by the 26 teams that participated in the BioCreative IV CHEMDNER chemical mention recognition task. In addition, we release the CHEMDNER silver standard corpus of automatically extracted mentions from 17,000 randomly selected PubMed abstracts. A version of the CHEMDNER corpus in the BioC format has been generated as well. We propose a standard for required minimum information about entity annotations for the construction of domain specific corpora on chemical and drug entities. The CHEMDNER corpus and annotation guidelines are available at: http://www.biocreative.org/resources/biocreative-iv/chemdner-corpus/.

AB - The automatic extraction of chemical information from text requires the recognition of chemical entity mentions as one of its key steps. When developing supervised named entity recognition (NER) systems, the availability of a large, manually annotated text corpus is desirable. Furthermore, large corpora permit the robust evaluation and comparison of different approaches that detect chemicals in documents. We present the CHEMDNER corpus, a collection of 10,000 PubMed abstracts that contain a total of 84,355 chemical entity mentions labeled manually by expert chemistry literature curators, following annotation guidelines specifically defined for this task. The abstracts of the CHEMDNER corpus were selected to be representative for all major chemical disciplines. Each of the chemical entity mentions was manually labeled according to its structure-associated chemical entity mention (SACEM) class: abbreviation, family, formula, identifier, multiple, systematic and trivial. The difficulty and consistency of tagging chemicals in text was measured using an agreement study between annotators, obtaining a percentage agreement of 91. For a subset of the CHEMDNER corpus (the test set of 3,000 abstracts) we provide not only the Gold Standard manual annotations, but also mentions automatically detected by the 26 teams that participated in the BioCreative IV CHEMDNER chemical mention recognition task. In addition, we release the CHEMDNER silver standard corpus of automatically extracted mentions from 17,000 randomly selected PubMed abstracts. A version of the CHEMDNER corpus in the BioC format has been generated as well. We propose a standard for required minimum information about entity annotations for the construction of domain specific corpora on chemical and drug entities. The CHEMDNER corpus and annotation guidelines are available at: http://www.biocreative.org/resources/biocreative-iv/chemdner-corpus/.

UR - http://www.scopus.com/inward/record.url?scp=84925647188&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=84925647188&partnerID=8YFLogxK

U2 - 10.1186/1758-2946-7-S1-S2

DO - 10.1186/1758-2946-7-S1-S2

M3 - Article

C2 - 25810773

AN - SCOPUS:84925647188

VL - 7

JO - Journal of Cheminformatics

JF - Journal of Cheminformatics

SN - 1758-2946

M1 - S2

ER -