Identifying and Merging Related Bibliographic Records

Identifying and Merging Related Bibliographic Records

Identifying and Merging Related Bibliographic Records

by Jeremy A. Hylton

The conversion to HTML failed to include figures and tables. I will make them available as soon as possible.

Abstract

Bibliographic records freely available on the Internet can be used to construct a high-quality, digital finding aid that provides the ability to discover paper and electronic documents. The key challenge to providing such a service is integrating mixed-quality bibliographic records, coming from multiple sources and in multiple formats. This thesis describes an algorithm that automatically identifies records that refer to the same work and clusters them together; the algorithm clusters records for which both author and title match. It tolerates errors and cataloging variations within the records by using a full-text search engine and an $n$-gram-based approximate string matching algorithm to build the clusters. The algorithm identifies more than 90 percent of the related records and includes incorrect records in less than 1 percent of the clusters. It has been used to construct a 250,000-record collection of the computer science literature. This thesis also presents preliminary work on automatic linking between bibliographic records and copies of documents available on the Internet.

  1. Introduction
  2. Cataloging and the Computer Science Collection
  3. Identifying Related Records
  4. Merging Related Records
  5. Presenting Relations and Clusters
  6. Automatic linking
  7. Conclusions and Future Directions
  8. References


Jeremy A Hylton
Mon Feb 19 15:33:12 1996