Identifying and Merging Related Bibliographic Records

Master of Engineering Thesis
MIT Department of Electrical Engineering and Computer Science
Submitted February 13, 1996

Thesis Supervisor: Prof. Jerome H. Saltzer

Abstract

Bibliographic records freely available on the Internet can be used to construct a high-quality, digital finding aid that provides the ability to discover paper and electronic documents. The key challenge to providing such a service is integrating mixed-quality bibliographic records, coming from multiple sources and in multiple formats. This thesis describes an algorithm that automatically identifies records that refer to the same work and clusters them together; the algorithm clusters records for which both author and title match. It tolerates errors and cataloging variations within the records by using a full-text search engine and an $n$-gram-based approximate string matching algorithm to build the clusters. The algorithm identifies more than 90 percent of the related records and includes incorrect records in less than 1 percent of the clusters. It has been used to construct a 250,000-record collection of the computer science literature. This thesis also presents preliminary work on automatic linking between bibliographic records and copies of documents available on the Internet.

Availability

Postscript and gzipped Postscript.

Page images in DEC SRC's Lectern format. See the Virtual Paper project for more information and a viewer.

Text version. (Converted using LaTeX2HTML; this one doesn't have the figures.)

LCS Technical Report MIT/LCS/TR-678. gzipped Postscript.

DIFWICS

The Digital Index for Works in Computer Science demonstrates the results of the thesis. It is an index of 255,000 bibliographic records, which have been processed to identify records that describe the same work.

last updated 4/16/96 by jeremy links updated 11/97 by jhs Return to Library 2000 home page.