TY - GEN
T1 - A metadata generation system for scanned scientific volumes
AU - Lu, Xiaonan
AU - Kahle, Brewster
AU - Wang, James Z.
AU - Giles, C. Lee
PY - 2008
Y1 - 2008
N2 - Large scale digitization projects have been conducted at digital libraries to preserve cultural artifacts and to provide permanent access. The increasing amount of digitized resources, including scanned books and scientific publications, requires development of tools and methods that will efficiently analyze and manage large collections of digitized resources. In this work, we tackle the problem of extracting metadata from scanned volumes of journals. Our goal is to extract information describing internal structures and content of scanned volumes, which is necessary for providing effective content access functionalities to digital library users. We propose methods for automatically generating volume level, issue level, and article level metadata based on format and text features extracted from OCRed text. We show the performance of our system on scanned bound historical documents nearly two centuries old. We have developed the system and integrated it into an operational digital library, the Internet Archive, for real-world usage.
AB - Large scale digitization projects have been conducted at digital libraries to preserve cultural artifacts and to provide permanent access. The increasing amount of digitized resources, including scanned books and scientific publications, requires development of tools and methods that will efficiently analyze and manage large collections of digitized resources. In this work, we tackle the problem of extracting metadata from scanned volumes of journals. Our goal is to extract information describing internal structures and content of scanned volumes, which is necessary for providing effective content access functionalities to digital library users. We propose methods for automatically generating volume level, issue level, and article level metadata based on format and text features extracted from OCRed text. We show the performance of our system on scanned bound historical documents nearly two centuries old. We have developed the system and integrated it into an operational digital library, the Internet Archive, for real-world usage.
UR - http://www.scopus.com/inward/record.url?scp=57649210210&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=57649210210&partnerID=8YFLogxK
U2 - 10.1145/1378889.1378918
DO - 10.1145/1378889.1378918
M3 - Conference contribution
AN - SCOPUS:57649210210
SN - 9781595939982
T3 - Proceedings of the ACM International Conference on Digital Libraries
SP - 167
EP - 176
BT - JCDL'08
T2 - 8th ACM/IEEE-CS Joint Conference on Digital Libraries 2008, JCDL'08
Y2 - 16 June 2008 through 20 June 2008
ER -