Introduction

The growth in the volume of information available via computer networks has increased both the usefulness of network information and the difficulty of managing it. The information freely available today offers the opportunity to provide library-like services. Organizing the information, which is available from many sources and in many forms, is one of the key challenges of building a useful information service. This thesis presents a system for integrating bibliographic information from many heterogeneous sources that identifies related bibliographic records and presents them together. The system has been used to construct the Digital Index for Works in Computer Science (DIFWICS), a 240,000-record catalog of the computer science literature.

What is a digital library?

The term digital library is used widely, but little consensus exists about what exactly a digital library is. Before discussing the kind of information sources used to construct the library and how these sources were integrated, it is useful to clarify the term and explain how the particular vision of a digital library affected the design of the system.

The digital library envisioned here is a computer system that provides a single point of access to an organized collection of information. The digital library is, of course, not a single system, but a loose federation of many systems and services; nonetheless, in many cases it should appear to operate as a single system. Providing interoperability among these systems remains a major research topic [25].

One kind of interoperability that is sometimes overlooked is the interoperation between physical and digital objects in the library. Most documents exist today only on paper, and digital libraries must provide access to both paper and digital documents to be able to satisfy most users' needs.

The components of the digital library serve many purposes, including the storage of information in digital repositories, providing aids to discovering information that satisfies users' needs, and locating the information that users want. The integrated database of bibliographic records addresses the discovery process. It offers an index of document citations and abstracts.

Integrating bibliographic databases

The emphasis of the research reported here is to make effective use of the diversity of bibliographic information freely available on the Internet. At least 450,000 bibliographic records describing the computer science literature are freely available, mostly in the form of Bibtex citations; they describe a large part of the computer science literature, but they very widely in quality and accuracy. Nonetheless, when combined with papers available from researchers' Web pages and servers and with on-line library catalogs, these records provide enough raw data to build a fairly useful library.

The bibliographic records present many challenges for creating an integrated information service. The records contain typographical and cataloging errors; there are many duplicate records; and there are few shared standards for entering information in the records. The records come from many sources: Some are prepared by librarians for their patrons, others come from individual authors' personal collections or are assembled from Usenet postings.

Although the heterogeneity of the source records poses a problem for combining all the records into a single collection, it can also be exploited to improve the quality of the overall collection. The heterogeneity provides considerable leverage on the problems of extracting the best possible information from records and providing links between closely related records (an observation made by Buckland, et al. [8]). By identifying related records, a union record can be created that combines the best information from each of the source records.

One of the primary contributions of this thesis is an algorithm for identifying related bibliographic records. The algorithm finds records that share the same author and title fields and groups them together in a cluster of records that all describe the same work; the algorithm works despite variations and errors in the source records. The clusters produced by the algorithm may include several different but related documents, such as a paper published as a technical report and later as a journal article. A catalog of the records, with record clusters presented together, forms the core of DIFWICS.

DIFWICS is intended primarily as an aid to the discovery process, helping users to explore the information collected in the library. In addition to the index of bibliographic records, it provides a simple system for locating cited documents when they are available in digital form. This second system works equally well as a means of locating documents in a traditional library's catalog, which helps to locate physical copies, and provides the groundwork for a more ambitious automatic linking project.

The preliminary automatic linking work, presented in Chapter 6, uses Web indexes like Alta Vista to find copies of papers on the Web. When papers are available on the Web, they are usually referenced by at least one other Web page, which includes a standard, human-readable citation along with a hypertext link. Some of the same techniques used to identify related bibliographic records in DIFWICS can be used to find related citations on the Web and to link the Web pages to the bibliographic records.

Two important characteristics of my work set it apart from related work. First, the collection is assembled without specific coordination among the information providers; this limits the overhead involved for authors and publishers to make their documents available and makes data from many existing services available for use in the current system. The second novel feature of this research is the algorithm for identifying related records. Systems for identifying related records in library catalogs and database systems use a single key for record comparison; my algorithm uses a full-text index of the records to help identify related records. The details of the duplicate detection algorithm are presented in Chapter 3.

Overview of related work

This thesis builds on research in both computer science and library science. The problem of duplicate detection, for example, occurs in somewhat different form in the database community (the multidatabase or semantic integration problem) and in the library community (creating union catalogs and the de-duplication problem). This section gives an overview of some related areas of research.

Networked information discovery and retrieval

Networked information discovery and retrieval (NIDR) is a broad category encompassing nearly any kind of information access using a large scale computer network [26]. DIFWICS is an example NIDR application and is informed by a variety of work in this area.

This thesis touches at least tangentially on many NIDR issues-including locating, naming, and cataloging network resources-but the clearest connection is to several projects that have used the World-Wide Web and other Internet information resources to provide access to some part of the computer science literature. These systems work primarily with technical reports, because they are often freely available and organized for Internet access by the publishing organization.

The Unified Computer Science Technical Report Index (UCSTRI) [43] automatically collects information about technical reports distributed on the Internet and provides an index of that information with links to the original report.

The Harvest system [6] is a more general information discovery system that combines tools for building local, content-specific indexes and sharing them to build indexes that span many sites; these tools include support for replicate and caching. The Harvest implementors developed a sample index of computer science technical reports. Harvest was designed to illustrate the principles for scalable access to Internet resources described in Bowman, et al. [7].

A third system is the Networked Computer Science Technical Report Library (NCSTRL) [12], which uses the Dienst protocol [21]. Dienst provides a repository for storing documents, a distributed indexing scheme, and a user interface for the documents in a repository.

All three systems rely, in vary degrees, on publishers for making documents available and providing bibliographic information. The publisher-centric model is limiting. Information not organized by publishers, including most online information, lacks standards for naming and cataloging.

Other systems, like Alta Vista and Lycos, index large quantities of information available via the World-Wide Web. They are not selective about the material they index, and are somewhat less useful for information retrieval purposes as result. If a user is looking for a particular document, however, and has information like the author and title, these indexes can be quite useful for finding it (if it is available). Not only do these Web indices index the papers themselves, but they index pages that point to the papers. Chapter 5 discusses some possibilities for integrating these citations with the more traditional bibliographic citations used to build the computer science library.

Libraries and cataloging

The recent development of library-like services for using network information, like UCSTRI or Harvest, parallels the traditional library community's development of large-scale union catalogs and online public access catalogs in the late 70s and early 80s. The OCLC Online Union Catalog merged several million bibliographic records and developed one of the first duplicate detection systems [18].

More recently, the library community has begun to re-evaluate its cataloging standards. Several papers [48][40][15][5] suggest that catalogers should focus more on describing ``works''-particular, identifiable intellectual works-rather than ``documents''-the particular physical versions of a work. For example, Shakespeare's play Hamlet is a clearly identifiable work; it has been published in many editions, each a ``document.''

This thesis makes use of this distinction when it labels as duplicates records for different documents that instantiate a particular work. Levy and Marshall [24] raise similar questions about how people actually use libraries in their discussion of future digital libraries.

Database systems

Heterogeneous databases differ from more conventional database systems because they included distributed components that do not all share the same database model; the component databases may have different data models, query languages, or schemas [13]. One of the problems that arises in multidatabase systems is the integration of the underlying schemas to provide users with a standard interface.

Duplicate detection is closely related to integration of heterogeneous databases, but is complicated by the fact that bibliographic formats impose little structure on the data they contain; the wide variations in quality and accuracy that typify collections of Bibtex records further complicate the problem. Papakonstantinou, et al. [30] present a more thorough discussion of the differences between the integration of databases and the integration of information systems like bibliographic record collections. The merge/purge problem described by Hernández and Stolfo [17] implements a duplicate detection system for mailing lists that copes with variations and errors in the underlying data by making multiple passes over the data, each time using a different key to compare the records.

Mediators [45] are a different approach to the problem of integrating information from multiple sources. Mediators are part of a model of the networked information environment that includes database access as its lowest level and users and information gathering applications at the top level. The mediators operate in between the databases and the users, providing an abstraction boundary that captures enough knowledge about various underlying databases to present a new, uniform view of that data to users.

Data warehousing extends the mediator model by creating a new database that contains the integrated contents of other databases rather than providing a dynamic mediation layer on top of them. DIFWICS integrates distributed bibliographies in a similar way.

Information retrieval

Duplicate detection in information retrieval is at the opposite end of the spectrum from database schema integration; information retrieval deals with the full-text of documents with little or no formal structure. Duplicate takes on a wider meaning in this field: Consider a search for stories about an earthquake in a collection of newspaper articles; there are probably many stories about the earthquake, from many different news sources. The stories are duplicates because their content overlaps substantially and not because of some external feature like their title or date of publication. Yan and Garcia-Molina [49] provide a more detailed discussion of duplicate detection in this context.

Next: Cataloging and the Up: Identifying and Merging Related Previous: List of Tables

Jeremy A Hylton
Mon Feb 19 15:33:12 1996