M.I.T. Department of Electrical Engineering and Computer Science
Jeremy Hylton
Thesis Supervisor: Prof. Jerome Saltzer
The explosive growth of the World-Wide Web over the last two years underscores the realization that the technology exists -- or will soon exist -- to build the digital libraries envisioned since the earliest days of computing. The experience of using the Web also makes it clear that linking and resource discovery are complicated problems, at least in part because of the sheer quantity of data available.
Resource discovery is difficult not only because there is so much data to sift through, but also because so many varying copies of resources exist. When performing a search with Archie [ED91], an index of anonymous ftp sites, a query for a particular application will often discover dozens of copies of the same resource and return a long search result set that contains little information.
One technique that can be used to manage the discovery and location of large quantities of data is deduplication. Exactly as the name suggests, deduplication is the process that narrows a list of resources by eliminating any duplicate references to a resource.
Deduplication is an interesting problem because it is difficult to decide when two resources are duplicates. Consider two possible uses of a digital library: A student searches for a copy of Hamlet in the library; the digital library actually has many copies of Hamlet -- some by different editors, some in different collections -- but the student only wants to know that a copy of Hamlet is to be had. For the student, deduplication of the list to one or two entries provides a valuable simplication. On the other hand, a Shakespeare scholar may want to find a copy of one of the early printed editions of Hamlet. For that scholar, deduplication needs to work within a narrow sense of equality. The scholar does not need to know that there are three libraries with copies of the first quarto, but does need to know that copies of the first and second quartos exist.
These two brief examples suggest that decisions about when to deduplicate and how to do it are decisions that need to be made within a particular context.
The developing information infrastructure will need to support a general class of information objects, and deduplication will be useful in that context. However, I propose to explore deduplication only within the context of the digital library. By dealing with libraries, I will focus primarily on documents like those typically collected in libraries now -- books, journals, technical reports, etc.
The Library 2000 group is exploring the architectural and systems engineering issues associated with a large-scale digital library of computer science technical reports. The project provides a base of several thousand documents that will be available online. Saltzer [Sa92] has a good account of the vision of digital libraries.
Though computer science technical reports might seem like a relatively simple class of data without the many problems of editions and printings that complicate the world of Shakespeare's texts, the collection of reports actually has many complications: reports have been published by more than one organization; authors have published several versions of the same report; reports have later been published in books and journals.
The Library 2000 group is developing an architectural overview of a digital library system [HKCS94], [Hy94]. The overview describes several potentially independent services that interoperate to provide the appearance of a single library system for users. This architecture, and others similar to it [KW94], use long-lived, global names to refer to documents and exchange information about them. For example, when the discovery service needs to communicate a result set it a client that performed a search, it should return a list of document names.
One of the more vexing issues in the architecture is the role of so-called "meta-information," which is a term used to describe information about a document. Meta-information can be used to provide many different kinds of information, ranging from bibliographic information to location information. In general, meta-information is information about a document or one of its instantiations that is used within the system to decide how to manage the document; thus, the librarian uses the bibliography for discovery, the client software uses representation information to decide how to print the document, and the archive service uses access control information to decide who can retrieve a document and how much to charge them. Many other examples exist. Grass [Gr94] has a good discussion of many of these issues.
Because meta-information is used in different ways by different components of the system, such information will be stored with throughout the system rather than in a central location; services that return document names may also include meta-information to clients in the form of hints that may prove to be invalid.
Meta-information can also describes a group of documents. It can be used to create objects that unify existing versions or otherwise capture similarities across a group of documents. For example, a meta-information object could contain the names of all the versions of a particular document. Both Grass and Buckland [Bu94] discuss the utility of such a meta-object.
The key insight to understand deduplication is that the equality relationship between two documents is dependent on the context of the comparison. The simple case described above is the difference between the general reader and the specialist. But many other kinds of duplicates can exist within the digital library -- multiple copies of a document at independent sites, copies in different formats, the annotated, condensed, and unexpurgated editions, copies in different languages, copies published by differnt companies at different times, etc.
Many of these problems of duplication are related to the problem of naming resources. Each representation of a digital document needs a name that allows access to it, but creating these names introduces the problem of duplicates. The issues of naming and deduplication are closely coupled: Names have particular scope that specifies how to resolve them, and names are a way of specifying differences between two things. By saying an object has name A, you suggest that it is different from object with name B. The rules governing equivalence in namespaces will have implication for deduplication.
Sollins [So85] discusses the use of contexts and of aggregates, the collection of contexts that an individual uses for resoltuion, in terms that relate to the problems of deduplication. Austin [Au79] describes some of the philosophical issues associated with naming and distinguishing named things.
If names provide a start for the deduplication process, name resolution and retrieval provide the end-point for deduplication. In fact, retrieval significantly complicates the deduplication issue. When a user decides to retrieve a document from a deduplicated list, the system will have to decide what copy to retrieve. There are design decisions involving how much information to associate with the deduplicated list and how to provide it to the retrieval service.
One place to look for a similar problem is the library community and the process of creating union records and information dossiers, as described by Buckland [Bu94]. A union record accumulates information from many bibliographic entries into a single more complete entry. A union record has a relatively narrow definition of equality; on the other hand, an information dossier is a more general kind of union record that incorporates information from several distinct documents into a single meta-object.
I propose a course of research to determine how deduplication should be performed within the proposed architecture of the digital library. The main goal of this research is to understand the tradeoffs involved in various deduplications schemes and their implications for the use of meta-information within the digital library.
I will build working deduplication services for a small number of deduplication contexts. The services would operate primarily on the data assembled for the CS-TR project. They would also make limited use of other computer science technical report information available on the internet to widen the numbers and kinds of duplicates that must be handled.
A working deduplication service will make use of several digital library components that are in some cases not in place yet. The services will require the testbed to supply rudimentary support for namespace management, discovery, and navigation. These areas are no the focus of my research. The level of functionality required should not necessitate a major commitment, and continuing work by the Library 2000 group should not interfere with the deduplication project.
My prototype deduplication services should provide insights into a number of meta-information problems and design issues.
The digital library community is a segment of a larger community concerned with the emerging information infrastructure. There is a great deal of current research that considers meta-information, naming, and linking with that larger community. Though many of these projects are tackling a much larger problem, there progress may provide insight into the specific problem of deduplication.
The Information Mesh [So92] is exploring the operating system and network abstractions necessary to support widely-distributed information. A primary focus of this research is how to link nodes of information. The notions of long-lived names, strong but evolvable typing, and hints bear strong similarities to the architure being discussed for digital libraries.
The URI working group of the Internet Engineering Task Force is working to develop an information infrastructure that focuses on naming and discovering information resources, and determing how and where they can be accessed. The URI group has designed a system with three components: uniform resource names (URN), uniform resource characteristics (URC), and uniform resource lcators (URL). The URN is a globally persistent name for an informaton resource. The URL identifies a particular location for a resource named by a URN, but does not have the persistence properties of URNs.
The URC component of the URI architecture is the least well-defined. It is currently defined as a collection of meta-information that is used to aid discovery and navigation. Though the use of meta-information is similar to the digital library data model, the URI model of how meta-information is used differs from the model I have described. The current Internet drafts by Daniels and Mealling [DM94] and Mealling [Me94] describe a single data structure used for both discovery and navigation.
The URI working group's efforts are reported on in drafts by Sollins and Masinter [SM94], Weider and Deutch [WD94], Berners-Lee [Be94].
Recently, a small group of researchers met with members of the Internet Activites Board to discuss the network-level support needed for the information infrastructure. Two working groups met under the auspices of the IAB Information Infrastructure Workshop and are now preparing reports on their meetings [iiw].
The core model of information architecture described by the report iincludes a description of meta-information that is more like the ltt model than the URI model. It identifies discovery and navigation as two potential uses of meta-information, but recognizes their different acess and update requirements.
Many of the issues relevant to deduplication -- including equality of named objects and typing of objects -- are discussed in the report.
The distributed database community aims to integrate information from pre-existing, heterogeneous local databases and provide global, uniform access to them. An active area of research in the distributed database community is merging semantically similar information with different representations into a global scheme. Breibart [Br90] and Bright, Hurson, and Pakzad [BHZ92] provide overviews of the multidatabase research and discuss some of the intergration problems; the latter paper discusses the role of naming in a global system.
When possible, I have included pointers to online versions of documents listed below. It is quite possible that the URLs for these works will stop working at some point.