CiteSeerX: AI in a digital library search engine

Jian Wu, Kyle Williams, Hung Hsuan Chen, Madian Khabsa, Cornelia Caragea, Alexander Ororbia, Douglas Jordan, C. Lee Giles

Research output: Chapter in Book/Report/Conference proceedingConference contribution

14 Citations (Scopus)

Abstract

CitcSeerX is a digital library search engine that provides access to more than 4 million academic documents with nearly a million users and millions of hits per day. Artificial intelligence (AI) technologies are used in many components of CiteSeerX e.g. to accurately extract metadata, intelligently crawl the web, and ingest documents. We present key AI technologies used in the following components: document classification and deduplication, document and citation clustering, automatic metadata extraction and indexing, and author disambiguation. These AI technologies have been developed by CiteSeerX group members over the past 5-6 years. We also show the usage status, payoff, development challenges, main design concepts, and deployment and maintenance requirements. While it is challenging to rebuild a system like CiteSeerX from scratch, many of these AI technologies are transferable to other digital libraries and/or search engines.

Original languageEnglish (US)
Title of host publicationProceedings of the National Conference on Artificial Intelligence
PublisherAI Access Foundation
Pages2930-2937
Number of pages8
ISBN (Electronic)9781577356806
StatePublished - Jan 1 2014
Event28th AAAI Conference on Artificial Intelligence, AAAI 2014, 26th Innovative Applications of Artificial Intelligence Conference, IAAI 2014 and the 5th Symposium on Educational Advances in Artificial Intelligence, EAAI 2014 - Quebec City, Canada
Duration: Jul 27 2014Jul 31 2014

Publication series

NameProceedings of the National Conference on Artificial Intelligence
Volume4

Other

Other28th AAAI Conference on Artificial Intelligence, AAAI 2014, 26th Innovative Applications of Artificial Intelligence Conference, IAAI 2014 and the 5th Symposium on Educational Advances in Artificial Intelligence, EAAI 2014
CountryCanada
CityQuebec City
Period7/27/147/31/14

Fingerprint

Digital libraries
Search engines
Artificial intelligence
Metadata

All Science Journal Classification (ASJC) codes

  • Software
  • Artificial Intelligence

Cite this

Wu, J., Williams, K., Chen, H. H., Khabsa, M., Caragea, C., Ororbia, A., ... Giles, C. L. (2014). CiteSeerX: AI in a digital library search engine. In Proceedings of the National Conference on Artificial Intelligence (pp. 2930-2937). (Proceedings of the National Conference on Artificial Intelligence; Vol. 4). AI Access Foundation.
Wu, Jian ; Williams, Kyle ; Chen, Hung Hsuan ; Khabsa, Madian ; Caragea, Cornelia ; Ororbia, Alexander ; Jordan, Douglas ; Giles, C. Lee. / CiteSeerX : AI in a digital library search engine. Proceedings of the National Conference on Artificial Intelligence. AI Access Foundation, 2014. pp. 2930-2937 (Proceedings of the National Conference on Artificial Intelligence).
@inproceedings{65e594577b34484d9ef25bd93e54e64b,
title = "CiteSeerX: AI in a digital library search engine",
abstract = "CitcSeerX is a digital library search engine that provides access to more than 4 million academic documents with nearly a million users and millions of hits per day. Artificial intelligence (AI) technologies are used in many components of CiteSeerX e.g. to accurately extract metadata, intelligently crawl the web, and ingest documents. We present key AI technologies used in the following components: document classification and deduplication, document and citation clustering, automatic metadata extraction and indexing, and author disambiguation. These AI technologies have been developed by CiteSeerX group members over the past 5-6 years. We also show the usage status, payoff, development challenges, main design concepts, and deployment and maintenance requirements. While it is challenging to rebuild a system like CiteSeerX from scratch, many of these AI technologies are transferable to other digital libraries and/or search engines.",
author = "Jian Wu and Kyle Williams and Chen, {Hung Hsuan} and Madian Khabsa and Cornelia Caragea and Alexander Ororbia and Douglas Jordan and Giles, {C. Lee}",
year = "2014",
month = "1",
day = "1",
language = "English (US)",
series = "Proceedings of the National Conference on Artificial Intelligence",
publisher = "AI Access Foundation",
pages = "2930--2937",
booktitle = "Proceedings of the National Conference on Artificial Intelligence",
address = "United States",

}

Wu, J, Williams, K, Chen, HH, Khabsa, M, Caragea, C, Ororbia, A, Jordan, D & Giles, CL 2014, CiteSeerX: AI in a digital library search engine. in Proceedings of the National Conference on Artificial Intelligence. Proceedings of the National Conference on Artificial Intelligence, vol. 4, AI Access Foundation, pp. 2930-2937, 28th AAAI Conference on Artificial Intelligence, AAAI 2014, 26th Innovative Applications of Artificial Intelligence Conference, IAAI 2014 and the 5th Symposium on Educational Advances in Artificial Intelligence, EAAI 2014, Quebec City, Canada, 7/27/14.

CiteSeerX : AI in a digital library search engine. / Wu, Jian; Williams, Kyle; Chen, Hung Hsuan; Khabsa, Madian; Caragea, Cornelia; Ororbia, Alexander; Jordan, Douglas; Giles, C. Lee.

Proceedings of the National Conference on Artificial Intelligence. AI Access Foundation, 2014. p. 2930-2937 (Proceedings of the National Conference on Artificial Intelligence; Vol. 4).

Research output: Chapter in Book/Report/Conference proceedingConference contribution

TY - GEN

T1 - CiteSeerX

T2 - AI in a digital library search engine

AU - Wu, Jian

AU - Williams, Kyle

AU - Chen, Hung Hsuan

AU - Khabsa, Madian

AU - Caragea, Cornelia

AU - Ororbia, Alexander

AU - Jordan, Douglas

AU - Giles, C. Lee

PY - 2014/1/1

Y1 - 2014/1/1

N2 - CitcSeerX is a digital library search engine that provides access to more than 4 million academic documents with nearly a million users and millions of hits per day. Artificial intelligence (AI) technologies are used in many components of CiteSeerX e.g. to accurately extract metadata, intelligently crawl the web, and ingest documents. We present key AI technologies used in the following components: document classification and deduplication, document and citation clustering, automatic metadata extraction and indexing, and author disambiguation. These AI technologies have been developed by CiteSeerX group members over the past 5-6 years. We also show the usage status, payoff, development challenges, main design concepts, and deployment and maintenance requirements. While it is challenging to rebuild a system like CiteSeerX from scratch, many of these AI technologies are transferable to other digital libraries and/or search engines.

AB - CitcSeerX is a digital library search engine that provides access to more than 4 million academic documents with nearly a million users and millions of hits per day. Artificial intelligence (AI) technologies are used in many components of CiteSeerX e.g. to accurately extract metadata, intelligently crawl the web, and ingest documents. We present key AI technologies used in the following components: document classification and deduplication, document and citation clustering, automatic metadata extraction and indexing, and author disambiguation. These AI technologies have been developed by CiteSeerX group members over the past 5-6 years. We also show the usage status, payoff, development challenges, main design concepts, and deployment and maintenance requirements. While it is challenging to rebuild a system like CiteSeerX from scratch, many of these AI technologies are transferable to other digital libraries and/or search engines.

UR - http://www.scopus.com/inward/record.url?scp=84908200442&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=84908200442&partnerID=8YFLogxK

M3 - Conference contribution

AN - SCOPUS:84908200442

T3 - Proceedings of the National Conference on Artificial Intelligence

SP - 2930

EP - 2937

BT - Proceedings of the National Conference on Artificial Intelligence

PB - AI Access Foundation

ER -

Wu J, Williams K, Chen HH, Khabsa M, Caragea C, Ororbia A et al. CiteSeerX: AI in a digital library search engine. In Proceedings of the National Conference on Artificial Intelligence. AI Access Foundation. 2014. p. 2930-2937. (Proceedings of the National Conference on Artificial Intelligence).