ECON: An approach to extract content from web news page

Yan Guo, Huifeng Tang, Linhai Song, Yu Wang, Guodong Ding

Research output: Chapter in Book/Report/Conference proceedingConference contribution

17 Citations (Scopus)

Abstract

This paper provides a simple but effective approach, named ECON, to fully-automatically extract content from Web news page. ECON uses a DOM tree to represent the Web news page and leverages the substantial features of the DOM tree. ECON finds a snippet-node by which a part of the content of news is wrapped firstly, then backtracks from the snippet-node until a summary-node is found, and the entire content of news is wrapped by the summary-node. During the process of backtracking, ECON removes noise. Experimental results showed that ECON can achieve high accuracy and fully satisfy the requirements for scalable extraction. Moreover, ECON can be applied to Web news page written in many popular languages such as Chinese, English, French, German, Italian, Japanese, Portuguese, Russian, Spanish, Arabic. ECON can be implemented much easily.

Original languageEnglish (US)
Title of host publicationAdvances in Web Technologies and Applications - Proceedings of the 12th Asia-Pacific Web Conference, APWeb 2010
Pages314-320
Number of pages7
DOIs
StatePublished - Jul 9 2010
Event12th International Asia Pacific Web Conference, APWeb 2010 - Busan, Korea, Republic of
Duration: Apr 6 2010Apr 8 2010

Publication series

NameAdvances in Web Technologies and Applications - Proceedings of the 12th Asia-Pacific Web Conference, APWeb 2010

Other

Other12th International Asia Pacific Web Conference, APWeb 2010
CountryKorea, Republic of
CityBusan
Period4/6/104/8/10

Fingerprint

Websites

All Science Journal Classification (ASJC) codes

  • Computer Networks and Communications
  • Computer Science Applications

Cite this

Guo, Y., Tang, H., Song, L., Wang, Y., & Ding, G. (2010). ECON: An approach to extract content from web news page. In Advances in Web Technologies and Applications - Proceedings of the 12th Asia-Pacific Web Conference, APWeb 2010 (pp. 314-320). [5474120] (Advances in Web Technologies and Applications - Proceedings of the 12th Asia-Pacific Web Conference, APWeb 2010). https://doi.org/10.1109/APWeb.2010.11
Guo, Yan ; Tang, Huifeng ; Song, Linhai ; Wang, Yu ; Ding, Guodong. / ECON : An approach to extract content from web news page. Advances in Web Technologies and Applications - Proceedings of the 12th Asia-Pacific Web Conference, APWeb 2010. 2010. pp. 314-320 (Advances in Web Technologies and Applications - Proceedings of the 12th Asia-Pacific Web Conference, APWeb 2010).
@inproceedings{22696f3200aa4511a7dcdac43fee8cdd,
title = "ECON: An approach to extract content from web news page",
abstract = "This paper provides a simple but effective approach, named ECON, to fully-automatically extract content from Web news page. ECON uses a DOM tree to represent the Web news page and leverages the substantial features of the DOM tree. ECON finds a snippet-node by which a part of the content of news is wrapped firstly, then backtracks from the snippet-node until a summary-node is found, and the entire content of news is wrapped by the summary-node. During the process of backtracking, ECON removes noise. Experimental results showed that ECON can achieve high accuracy and fully satisfy the requirements for scalable extraction. Moreover, ECON can be applied to Web news page written in many popular languages such as Chinese, English, French, German, Italian, Japanese, Portuguese, Russian, Spanish, Arabic. ECON can be implemented much easily.",
author = "Yan Guo and Huifeng Tang and Linhai Song and Yu Wang and Guodong Ding",
year = "2010",
month = "7",
day = "9",
doi = "10.1109/APWeb.2010.11",
language = "English (US)",
isbn = "9780769540122",
series = "Advances in Web Technologies and Applications - Proceedings of the 12th Asia-Pacific Web Conference, APWeb 2010",
pages = "314--320",
booktitle = "Advances in Web Technologies and Applications - Proceedings of the 12th Asia-Pacific Web Conference, APWeb 2010",

}

Guo, Y, Tang, H, Song, L, Wang, Y & Ding, G 2010, ECON: An approach to extract content from web news page. in Advances in Web Technologies and Applications - Proceedings of the 12th Asia-Pacific Web Conference, APWeb 2010., 5474120, Advances in Web Technologies and Applications - Proceedings of the 12th Asia-Pacific Web Conference, APWeb 2010, pp. 314-320, 12th International Asia Pacific Web Conference, APWeb 2010, Busan, Korea, Republic of, 4/6/10. https://doi.org/10.1109/APWeb.2010.11

ECON : An approach to extract content from web news page. / Guo, Yan; Tang, Huifeng; Song, Linhai; Wang, Yu; Ding, Guodong.

Advances in Web Technologies and Applications - Proceedings of the 12th Asia-Pacific Web Conference, APWeb 2010. 2010. p. 314-320 5474120 (Advances in Web Technologies and Applications - Proceedings of the 12th Asia-Pacific Web Conference, APWeb 2010).

Research output: Chapter in Book/Report/Conference proceedingConference contribution

TY - GEN

T1 - ECON

T2 - An approach to extract content from web news page

AU - Guo, Yan

AU - Tang, Huifeng

AU - Song, Linhai

AU - Wang, Yu

AU - Ding, Guodong

PY - 2010/7/9

Y1 - 2010/7/9

N2 - This paper provides a simple but effective approach, named ECON, to fully-automatically extract content from Web news page. ECON uses a DOM tree to represent the Web news page and leverages the substantial features of the DOM tree. ECON finds a snippet-node by which a part of the content of news is wrapped firstly, then backtracks from the snippet-node until a summary-node is found, and the entire content of news is wrapped by the summary-node. During the process of backtracking, ECON removes noise. Experimental results showed that ECON can achieve high accuracy and fully satisfy the requirements for scalable extraction. Moreover, ECON can be applied to Web news page written in many popular languages such as Chinese, English, French, German, Italian, Japanese, Portuguese, Russian, Spanish, Arabic. ECON can be implemented much easily.

AB - This paper provides a simple but effective approach, named ECON, to fully-automatically extract content from Web news page. ECON uses a DOM tree to represent the Web news page and leverages the substantial features of the DOM tree. ECON finds a snippet-node by which a part of the content of news is wrapped firstly, then backtracks from the snippet-node until a summary-node is found, and the entire content of news is wrapped by the summary-node. During the process of backtracking, ECON removes noise. Experimental results showed that ECON can achieve high accuracy and fully satisfy the requirements for scalable extraction. Moreover, ECON can be applied to Web news page written in many popular languages such as Chinese, English, French, German, Italian, Japanese, Portuguese, Russian, Spanish, Arabic. ECON can be implemented much easily.

UR - http://www.scopus.com/inward/record.url?scp=77954300056&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=77954300056&partnerID=8YFLogxK

U2 - 10.1109/APWeb.2010.11

DO - 10.1109/APWeb.2010.11

M3 - Conference contribution

AN - SCOPUS:77954300056

SN - 9780769540122

T3 - Advances in Web Technologies and Applications - Proceedings of the 12th Asia-Pacific Web Conference, APWeb 2010

SP - 314

EP - 320

BT - Advances in Web Technologies and Applications - Proceedings of the 12th Asia-Pacific Web Conference, APWeb 2010

ER -

Guo Y, Tang H, Song L, Wang Y, Ding G. ECON: An approach to extract content from web news page. In Advances in Web Technologies and Applications - Proceedings of the 12th Asia-Pacific Web Conference, APWeb 2010. 2010. p. 314-320. 5474120. (Advances in Web Technologies and Applications - Proceedings of the 12th Asia-Pacific Web Conference, APWeb 2010). https://doi.org/10.1109/APWeb.2010.11