Reducing noise in labels and features for a real world dataset: Application of NLP corpus annotation methods

Rebecca J. Passonneau, Cynthia Rudin, Axinia Radeva, Zhi An Liu

Research output: Chapter in Book/Report/Conference proceedingConference contribution

5 Citations (Scopus)

Abstract

This paper illustrates how a combination of information extraction, machine learning, and NLP corpus annotation practice was applied to a problem of ranking vulnerability of structures (service boxes, manholes) in the Manhattan electrical grid. By adapting NLP corpus annotation methods to the task of knowledge transfer from domain experts, we compensated for the lack of operational definitions of components of the model, such as serious event. The machine learning depended on the ticket classes, but it was not the end goal. Rather, our rule-based document classification determines both the labels of examples and their feature representations. Changes in our classification of events led to improvements in our model, as reflected in the AUC scores for the full ranked list of over 51K structures. The improvements for the very top of the ranked list, which is of most importance for prioritizing work on the electrical grid, affected one in every four or five structures.

Original languageEnglish (US)
Title of host publicationComputational Linguistics and Intelligent Text Processing - 10th International Conference, CICLing 2009, Proceedings
Pages86-97
Number of pages12
DOIs
StatePublished - Jul 21 2009
Event10th International Conference on Computational Linguistics and Intelligent Text Processing, CICLing 2009 - Mexico City, Mexico
Duration: Mar 1 2009Mar 7 2009

Publication series

NameLecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
Volume5449 LNCS
ISSN (Print)0302-9743
ISSN (Electronic)1611-3349

Other

Other10th International Conference on Computational Linguistics and Intelligent Text Processing, CICLing 2009
CountryMexico
CityMexico City
Period3/1/093/7/09

Fingerprint

Annotation
Learning systems
Labels
Machine Learning
Grid
Document Classification
Knowledge Transfer
Information Extraction
Vulnerability
Ranking
Model
Corpus
Class

All Science Journal Classification (ASJC) codes

  • Theoretical Computer Science
  • Computer Science(all)

Cite this

Passonneau, R. J., Rudin, C., Radeva, A., & Liu, Z. A. (2009). Reducing noise in labels and features for a real world dataset: Application of NLP corpus annotation methods. In Computational Linguistics and Intelligent Text Processing - 10th International Conference, CICLing 2009, Proceedings (pp. 86-97). (Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics); Vol. 5449 LNCS). https://doi.org/10.1007/978-3-642-00382-0_7
Passonneau, Rebecca J. ; Rudin, Cynthia ; Radeva, Axinia ; Liu, Zhi An. / Reducing noise in labels and features for a real world dataset : Application of NLP corpus annotation methods. Computational Linguistics and Intelligent Text Processing - 10th International Conference, CICLing 2009, Proceedings. 2009. pp. 86-97 (Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)).
@inproceedings{da85c343206c4d1390754fc802cdbd0c,
title = "Reducing noise in labels and features for a real world dataset: Application of NLP corpus annotation methods",
abstract = "This paper illustrates how a combination of information extraction, machine learning, and NLP corpus annotation practice was applied to a problem of ranking vulnerability of structures (service boxes, manholes) in the Manhattan electrical grid. By adapting NLP corpus annotation methods to the task of knowledge transfer from domain experts, we compensated for the lack of operational definitions of components of the model, such as serious event. The machine learning depended on the ticket classes, but it was not the end goal. Rather, our rule-based document classification determines both the labels of examples and their feature representations. Changes in our classification of events led to improvements in our model, as reflected in the AUC scores for the full ranked list of over 51K structures. The improvements for the very top of the ranked list, which is of most importance for prioritizing work on the electrical grid, affected one in every four or five structures.",
author = "Passonneau, {Rebecca J.} and Cynthia Rudin and Axinia Radeva and Liu, {Zhi An}",
year = "2009",
month = "7",
day = "21",
doi = "10.1007/978-3-642-00382-0_7",
language = "English (US)",
isbn = "3642003818",
series = "Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)",
pages = "86--97",
booktitle = "Computational Linguistics and Intelligent Text Processing - 10th International Conference, CICLing 2009, Proceedings",

}

Passonneau, RJ, Rudin, C, Radeva, A & Liu, ZA 2009, Reducing noise in labels and features for a real world dataset: Application of NLP corpus annotation methods. in Computational Linguistics and Intelligent Text Processing - 10th International Conference, CICLing 2009, Proceedings. Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), vol. 5449 LNCS, pp. 86-97, 10th International Conference on Computational Linguistics and Intelligent Text Processing, CICLing 2009, Mexico City, Mexico, 3/1/09. https://doi.org/10.1007/978-3-642-00382-0_7

Reducing noise in labels and features for a real world dataset : Application of NLP corpus annotation methods. / Passonneau, Rebecca J.; Rudin, Cynthia; Radeva, Axinia; Liu, Zhi An.

Computational Linguistics and Intelligent Text Processing - 10th International Conference, CICLing 2009, Proceedings. 2009. p. 86-97 (Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics); Vol. 5449 LNCS).

Research output: Chapter in Book/Report/Conference proceedingConference contribution

TY - GEN

T1 - Reducing noise in labels and features for a real world dataset

T2 - Application of NLP corpus annotation methods

AU - Passonneau, Rebecca J.

AU - Rudin, Cynthia

AU - Radeva, Axinia

AU - Liu, Zhi An

PY - 2009/7/21

Y1 - 2009/7/21

N2 - This paper illustrates how a combination of information extraction, machine learning, and NLP corpus annotation practice was applied to a problem of ranking vulnerability of structures (service boxes, manholes) in the Manhattan electrical grid. By adapting NLP corpus annotation methods to the task of knowledge transfer from domain experts, we compensated for the lack of operational definitions of components of the model, such as serious event. The machine learning depended on the ticket classes, but it was not the end goal. Rather, our rule-based document classification determines both the labels of examples and their feature representations. Changes in our classification of events led to improvements in our model, as reflected in the AUC scores for the full ranked list of over 51K structures. The improvements for the very top of the ranked list, which is of most importance for prioritizing work on the electrical grid, affected one in every four or five structures.

AB - This paper illustrates how a combination of information extraction, machine learning, and NLP corpus annotation practice was applied to a problem of ranking vulnerability of structures (service boxes, manholes) in the Manhattan electrical grid. By adapting NLP corpus annotation methods to the task of knowledge transfer from domain experts, we compensated for the lack of operational definitions of components of the model, such as serious event. The machine learning depended on the ticket classes, but it was not the end goal. Rather, our rule-based document classification determines both the labels of examples and their feature representations. Changes in our classification of events led to improvements in our model, as reflected in the AUC scores for the full ranked list of over 51K structures. The improvements for the very top of the ranked list, which is of most importance for prioritizing work on the electrical grid, affected one in every four or five structures.

UR - http://www.scopus.com/inward/record.url?scp=67650535515&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=67650535515&partnerID=8YFLogxK

U2 - 10.1007/978-3-642-00382-0_7

DO - 10.1007/978-3-642-00382-0_7

M3 - Conference contribution

AN - SCOPUS:67650535515

SN - 3642003818

SN - 9783642003813

T3 - Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)

SP - 86

EP - 97

BT - Computational Linguistics and Intelligent Text Processing - 10th International Conference, CICLing 2009, Proceedings

ER -

Passonneau RJ, Rudin C, Radeva A, Liu ZA. Reducing noise in labels and features for a real world dataset: Application of NLP corpus annotation methods. In Computational Linguistics and Intelligent Text Processing - 10th International Conference, CICLing 2009, Proceedings. 2009. p. 86-97. (Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)). https://doi.org/10.1007/978-3-642-00382-0_7