Combining machine learning with linguistic heuristics for Chinese word segmentation

Research output: Chapter in Book/Report/Conference proceedingConference contribution

2 Scopus citations

Abstract

This paper describes a hybrid model that combines machine learning with linguistic heuristics for integrating unknown word identification with Chinese word segmentation. The model consists of two components: a position-of-character (POC) tagging component that annotates each character in a sentence with a POC tag that indicates its position in a word, and a merging component that transforms a POC-tagged character sequence into a word-segmented sentence. The tagging component uses a support vector machine based tagger to produce an initial tagging of the text and a transformation-based tagger to improve the initial tagging. In addition to the POC tags assigned to the characters, the merging component incorporates a number of linguistic and statistical heuristics to detect words with regular internal structures, recognize long words, and filter non-words. Experiments show that, without resorting to a separate unknown word identification mechanism, the model achieves an F-score of 95.0% for word segmentation and a competitive recall of 74.8% for unknown word recognition.

Original languageEnglish (US)
Title of host publicationProceedings of the Twentieth International Florida Artificial Intelligence Research Society Conference, FLAIRS 2007
Pages241-246
Number of pages6
StatePublished - Dec 28 2007
Event20th International Florida Artificial Intelligence Research Society Conference, FLAIRS 2007 - Key West, FL, United States
Duration: May 7 2007May 9 2007

Publication series

NameProceedings of the Twentieth International Florida Artificial Intelligence Research Society Conference, FLAIRS 2007

Other

Other20th International Florida Artificial Intelligence Research Society Conference, FLAIRS 2007
CountryUnited States
CityKey West, FL
Period5/7/075/9/07

All Science Journal Classification (ASJC) codes

  • Artificial Intelligence
  • Software

Fingerprint Dive into the research topics of 'Combining machine learning with linguistic heuristics for Chinese word segmentation'. Together they form a unique fingerprint.

Cite this