Deduplication and the Use of Meta-Information in the Digital Library

M.Eng. Thesis Proposal

M.I.T. Department of Electrical Engineering and Computer Science

Jeremy Hylton

Thesis Supervisor: Prof. Jerome Saltzer

Introduction

The explosive growth of the World-Wide Web over the last two years underscores the realization that the technology exists -- or will soon exist -- to build the digital libraries envisioned since the earliest days of computing. The experience of using the Web also makes it clear that linking and resource discovery are complicated problems, at least in part because of the sheer quantity of data available.

Resource discovery is difficult not only because there is so much data to sift through, but also because so many varying copies of resources exist. When performing a search with Archie [ED91], an index of anonymous ftp sites, a query for a particular application will often discover dozens of copies of the same resource and return a long search result set that contains little information.

One technique that can be used to manage the discovery and location of large quantities of data is deduplication. Exactly as the name suggests, deduplication is the process that narrows a list of resources by eliminating any duplicate references to a resource.

Deduplication is an interesting problem because it is difficult to decide when two resources are duplicates. Consider two possible uses of a digital library: A student searches for a copy of Hamlet in the library; the digital library actually has many copies of Hamlet -- some by different editors, some in different collections -- but the student only wants to know that a copy of Hamlet is to be had. For the student, deduplication of the list to one or two entries provides a valuable simplication. On the other hand, a Shakespeare scholar may want to find a copy of one of the early printed editions of Hamlet. For that scholar, deduplication needs to work within a narrow sense of equality. The scholar does not need to know that there are three libraries with copies of the first quarto, but does need to know that copies of the first and second quartos exist.

These two brief examples suggest that decisions about when to deduplicate and how to do it are decisions that need to be made within a particular context.

Digital Library Model

The developing information infrastructure will need to support a general class of information objects, and deduplication will be useful in that context. However, I propose to explore deduplication only within the context of the digital library. By dealing with libraries, I will focus primarily on documents like those typically collected in libraries now -- books, journals, technical reports, etc.

The Library 2000 group is exploring the architectural and systems engineering issues associated with a large-scale digital library of computer science technical reports. The project provides a base of several thousand documents that will be available online. Saltzer [Sa92] has a good account of the vision of digital libraries.

Though computer science technical reports might seem like a relatively simple class of data without the many problems of editions and printings that complicate the world of Shakespeare's texts, the collection of reports actually has many complications: reports have been published by more than one organization; authors have published several versions of the same report; reports have later been published in books and journals.

The Library 2000 group is developing an architectural overview of a digital library system [HKCS94], [Hy94]. The overview describes several potentially independent services that interoperate to provide the appearance of a single library system for users. This architecture, and others similar to it [KW94], use long-lived, global names to refer to documents and exchange information about them. For example, when the discovery service needs to communicate a result set it a client that performed a search, it should return a list of document names.

One of the more vexing issues in the architecture is the role of so-called "meta-information," which is a term used to describe information about a document. Meta-information can be used to provide many different kinds of information, ranging from bibliographic information to location information. In general, meta-information is information about a document or one of its instantiations that is used within the system to decide how to manage the document; thus, the librarian uses the bibliography for discovery, the client software uses representation information to decide how to print the document, and the archive service uses access control information to decide who can retrieve a document and how much to charge them. Many other examples exist. Grass [Gr94] has a good discussion of many of these issues.

Because meta-information is used in different ways by different components of the system, such information will be stored with throughout the system rather than in a central location; services that return document names may also include meta-information to clients in the form of hints that may prove to be invalid.

Meta-information can also describes a group of documents. It can be used to create objects that unify existing versions or otherwise capture similarities across a group of documents. For example, a meta-information object could contain the names of all the versions of a particular document. Both Grass and Buckland [Bu94] discuss the utility of such a meta-object.

Deduplication, naming, and navigation

The key insight to understand deduplication is that the equality relationship between two documents is dependent on the context of the comparison. The simple case described above is the difference between the general reader and the specialist. But many other kinds of duplicates can exist within the digital library -- multiple copies of a document at independent sites, copies in different formats, the annotated, condensed, and unexpurgated editions, copies in different languages, copies published by differnt companies at different times, etc.

Many of these problems of duplication are related to the problem of naming resources. Each representation of a digital document needs a name that allows access to it, but creating these names introduces the problem of duplicates. The issues of naming and deduplication are closely coupled: Names have particular scope that specifies how to resolve them, and names are a way of specifying differences between two things. By saying an object has name A, you suggest that it is different from object with name B. The rules governing equivalence in namespaces will have implication for deduplication.

Sollins [So85] discusses the use of contexts and of aggregates, the collection of contexts that an individual uses for resoltuion, in terms that relate to the problems of deduplication. Austin [Au79] describes some of the philosophical issues associated with naming and distinguishing named things.

If names provide a start for the deduplication process, name resolution and retrieval provide the end-point for deduplication. In fact, retrieval significantly complicates the deduplication issue. When a user decides to retrieve a document from a deduplicated list, the system will have to decide what copy to retrieve. There are design decisions involving how much information to associate with the deduplicated list and how to provide it to the retrieval service.

One place to look for a similar problem is the library community and the process of creating union records and information dossiers, as described by Buckland [Bu94]. A union record accumulates information from many bibliographic entries into a single more complete entry. A union record has a relatively narrow definition of equality; on the other hand, an information dossier is a more general kind of union record that incorporates information from several distinct documents into a single meta-object.

Specific research to be performed

I propose a course of research to determine how deduplication should be performed within the proposed architecture of the digital library. The main goal of this research is to understand the tradeoffs involved in various deduplications schemes and their implications for the use of meta-information within the digital library.

I will build working deduplication services for a small number of deduplication contexts. The services would operate primarily on the data assembled for the CS-TR project. They would also make limited use of other computer science technical report information available on the internet to widen the numbers and kinds of duplicates that must be handled.

A working deduplication service will make use of several digital library components that are in some cases not in place yet. The services will require the testbed to supply rudimentary support for namespace management, discovery, and navigation. These areas are no the focus of my research. The level of functionality required should not necessitate a major commitment, and continuing work by the Library 2000 group should not interfere with the deduplication project.

My prototype deduplication services should provide insights into a number of meta-information problems and design issues.

How to effectively present the results of deduplication to the user.
What kind of information could be stored with the deduplicated file list to make later operations easier.
How much overlap there is between notions of equality in a deduplication context and in a namespace.
What information should be stored within various components of the system to make deduplication easier and how it should be stored.

Related Work

The digital library community is a segment of a larger community concerned with the emerging information infrastructure. There is a great deal of current research that considers meta-information, naming, and linking with that larger community. Though many of these projects are tackling a much larger problem, there progress may provide insight into the specific problem of deduplication.

Information Mesh

The Information Mesh [So92] is exploring the operating system and network abstractions necessary to support widely-distributed information. A primary focus of this research is how to link nodes of information. The notions of long-lived names, strong but evolvable typing, and hints bear strong similarities to the architure being discussed for digital libraries.

IETF Uniform Resource Identifier working group

The URI working group of the Internet Engineering Task Force is working to develop an information infrastructure that focuses on naming and discovering information resources, and determing how and where they can be accessed. The URI group has designed a system with three components: uniform resource names (URN), uniform resource characteristics (URC), and uniform resource lcators (URL). The URN is a globally persistent name for an informaton resource. The URL identifies a particular location for a resource named by a URN, but does not have the persistence properties of URNs.

The URC component of the URI architecture is the least well-defined. It is currently defined as a collection of meta-information that is used to aid discovery and navigation. Though the use of meta-information is similar to the digital library data model, the URI model of how meta-information is used differs from the model I have described. The current Internet drafts by Daniels and Mealling [DM94] and Mealling [Me94] describe a single data structure used for both discovery and navigation.

The URI working group's efforts are reported on in drafts by Sollins and Masinter [SM94], Weider and Deutch [WD94], Berners-Lee [Be94].

IAB Information Infrastructure Workshop

Recently, a small group of researchers met with members of the Internet Activites Board to discuss the network-level support needed for the information infrastructure. Two working groups met under the auspices of the IAB Information Infrastructure Workshop and are now preparing reports on their meetings [iiw].

The core model of information architecture described by the report iincludes a description of meta-information that is more like the ltt model than the URI model. It identifies discovery and navigation as two potential uses of meta-information, but recognizes their different acess and update requirements.

Many of the issues relevant to deduplication -- including equality of named objects and typing of objects -- are discussed in the report.

Distributed and federated databases

The distributed database community aims to integrate information from pre-existing, heterogeneous local databases and provide global, uniform access to them. An active area of research in the distributed database community is merging semantically similar information with different representations into a global scheme. Breibart [Br90] and Bright, Hurson, and Pakzad [BHZ92] provide overviews of the multidatabase research and discuss some of the intergration problems; the latter paper discusses the role of naming in a global system.

References

When possible, I have included pointers to online versions of documents listed below. It is quite possible that the URLs for these works will stop working at some point.

[Au79]: J.L. Austin. Philosophical Papers, 3d ed. New York: Oxford University Press, 1979.
[Be94]: Tim Berners-Lee. URLs in WWW. Internet RFC 1630.
[BHP92]: M. Bright, A. Hurson, and S. Pakzad. "Taxonomy and Current Issues in Multidatabase Systems." IEEE Computer 25(3). pp. 50-59.
[Br90]: Y. Breibart. "Multidatabase interoperability." ACM SIGMOD Record, 19(3): Sept. 90.
[Bu94]: Michael K. Buckland. "Union Records and Dossiers: Extended Bibliographic Information Objects," in Navigating the Networks, Proceedings of the ASIS Mid-Year Meeting, May 21-25, 1994. pp. 43-57.
[DM94]: Ron Daniel Jr. and Micharl Mealling. URC Scenarios and Requirements. Internet draft. Nov. 21, 1994.
[ED91]: Alan Emtage and Peter Deutsch. "Archie - an electronic directory service for the Internet," in Proceedings of the USENIX Winter Conference, Jan. 1992. pp 93-110.
[Gr94]: Judith E. Grass. Objects, Meta-Objects, and Property Records. Unpublished draft paper. Dec. 2, 1994.
[HKCS94]: Jeremy Hylton, Andrew Kass, Mitchell Charity, and Jerome H. Saltzer. Scenarios and components [of the Library 2000 architecture]. Work in progress.
[Hy94]: Jeremy Hylton. Library 2000 Architecture. Work in progress.
[iiw]: IAB Information Infrastructure Workshop, Report from Group 2A. Draft in preparation.
[KW94]: Robert Kahn and Robert Wilensky. "Locating Electronic Library Services and Objects: A Frame of Reference for the CS-TR Project." Unpublished paper. Feb 1994.
[Ma94]: Michael Maudlin. Lycos web page. <URL:http://lycos.cs.cmu.edu/>
[Me94]: Michael Mealling. Encoding and Use of Uniform Resource Characteristics. Internet draft. July 8, 1994.
[Sa92]: Jerome H. Saltzer. "Technology, Networks, and the Library of the Year 2000," In Future Tendencies in Computer Science, Control, and Applied Mathematics, Lecture Notes in Computer Science 653, edited by A. Bensoussan and J.-P. Verjus, Springer-Verlag, New York, 1992, pages 51-67. (Proceedings of the International Conference on the Occasion of the 25th Anniversary of Institut National de Recherche en Informatique et Automatique (INRIA,) Paris, France, December, 1992.)
[So85]: Karen Sollins. Distributed Name Management. MIT Lab for Computer Science Technical Report 331, 1985.
[So92]: Karen Sollins. "Supporting the Information Mesh," in Proceedings of the 3d IEEE Workshop on Workstation Operating Systems, Apr. 1992.
[SM94]: Karen Sollins and Larry Masinter. Functional Requirements for Uniform Resource Names. Internet draft. Oct. 19, 1994. Later RFC 1737. [Noted added 2/1/95.]
[WD94]: Chris Weider and Peter Deutsch. Uniform Resource Names. Internet draft. July 1994.

jeremy@the-tech.mit.edu
Last Update: 1 Feb 1995

Return to Library 2000 home page.