Presenting Relations and Clusters

This chapter describes some preliminary work on presenting a collection of 240,000 records after author-title clusters are identified and information dossiers constructed. The Web-based interface allows simple full-text queries of the collection, presents related records together on the screen, and automatically creates links that invoke searches of Web-indices, technical report archives, and the local library collection.

Although the user interface is only a prototype, it does illustrate two important design principles.

The results of a search should display the works that match the query, presenting the union records for different versions of the same work together. The presentation should cull the most complete and accurate information available from any duplicate records for a document, but should not hide variations in the underlying records.
The related records in a particular cluster should be used to help the user find an easily accessible copy of a document. The combination of bibliographic information, which allows well-defined queries into Web indices, library catalogs, and other databases, and clusters, which link related documents, should increase a user's ability to find a particular document (or a related one that represents the same abstract work).

The basic Web interface

The basic interface to the collection is a full-text index of all 240,000 source records, that allows basic Boolean queries. (The underlying search engine allows for queries that look for words only when they appear in certain fields; a Web interface for this feature is underway.) The records are stored in a database that maintains a unique identifier for each record, a separate name space for clusters, and a mapping from record id to cluster id. The index uses record ids internally, but returns a list of clusters in response to a user query.

A query using the Web interface returns a summary showing the title, author, and year of all the matching works, followed by a expanded entry for each cluster. Figure 5-1 shows the expanded entry for the paper ``Lightweight Remote Procedure Call.''

The first two lines of the cluster entry show the title and authors- the two defining characteristics of the document-followed by entries for the abstract and keywords, which will be the same for each document in the cluster.

A bulleted list of three document citations follows the keywords; the citations are produced from the union records. The first two lines of document citations are formatted to look like a citation for a traditional bibliography. In the first entry, an article in ACM Transactions on Computer Systems, it shows the date the article was published, the volume and number, and the pagination.

The third line of the document citation contains a hypertext link to a search of the local reading room catalog. The search will look for the particular document being cited. In the first entry, it searches for the journal the article appeared in; for the second entry, a conference paper, it searches for the proceedings.

The last line of the document citation contains two links to more information about the records for the document. The link to the expanded record displays a full union record, which includes nonstandard fields that are not displayed in the regular citation. The second link returns the source records used to make the union record; the link indiciates the number of source records and returns them in their original form.

The last line of the cluster entry (following the bulleted document citations) incorporates links to several other search services. There are links to the Alta Vista and Excite Web indices, which contain queries based on the title field and the authors' last names. When the work has been published as a technical report, links to the NCSTRL and UCSTRI technical report indices are included.

Figure: Sample author-title cluster display from Web interface

Assessing the quality of union records

The most serious shortcoming of the current interface is that provides no warning to the user when there are differences between the union record and the source records. Preliminary work for detecting and measuring these differences is presented here, but the results are not integrated into the current interface.

Because the displayed document citations are based on the union records created by merging duplicate records, there is a chance that the citation will contain mistakes or hide useful information that is contained in the source records. Problems could be the result of a record that describes a different document being included in the cluster by mistake. However, most problems arise because the source records contain different and conflicting information about the document. (These problems were discussed in greater detail in Chapter 4.)

When a record is included in an author-title cluster by mistake, two different kinds of failure result.

In a cluster that contains several citations for a document type, the mis-matched record will be hidden. The union record will not describe the hidden record, and any query that matches the record will return the cluster it is mistakenly included in. The only way to discover the hidden record is to look at the source records.
In a cluster with only a few citations, the merger process may create a union record that includes some fields from the mis-matched record and some fields from the good records, because an arbitrary value is selected when a most frequently occuring field value can't be identified. The result is a confusing union record that mixes information from different documents or works.

The ``source match'' ratio can identify the first problem. It measures how closely the fields in the union record match the fields in a particular source record. It is computed by comparing the standard fields in the union record with the same fields in a source record. (Records not defined in the source record are ignored.) The value is a tuple of the number of fields that match and the number of field that do not match, e.g. (4,3) indicates four fields match and three fields do not.

The second problem can be identified with the ``field consensus'' ratio, which measures how many of the source records contain the same value as the union record for a particular field.

For each union record, we can compute the source match value for each source record and the field consensus value for each of the major fields. The result in each case is a list of ratios, one source match ratio for each source record and one field consensus ratio for each standard field.

Unfortunately, these ratios provide only a very rough measure of the differences between records or fields. The variations in Bibtex records that make identifying duplicate and related records hard also complicates the analysis of union records. Variations, like abbreviations and typos, cause fields to fail to match. To limit the effects of formatting errors, the fields values are normalized before comparison; all non-alphanumber characters are eliminated and all letters are converted to lowercase.

These problems make the ratios hard to interpret. A source match ratio of (2,5) could mean that the source describes a different document than does the union record. But it could just as well mean that the source record contains a number of non-standard abbreviations or typographical errors.

Despite the problems with creating and intepreting these ratios, some informal tests suggest that they are helpful for identifying clusters that contain false matches. Figure 5-2 shows an a sample cluster where the statistics help to identify a false match. The cluster contains four technical report citations, which the union record reports as ``Process Migration in the Sprite Operating System.'' In fact, the cluster also contains a single citation for a different technical report by the same author, titled ``Transparent Process Migration in the Sprite Operating System.'' The problem is clear in the source match ratio, which shows that the third source record shares one common field value with the union record and differs on five other fields. The single differences that appear in the field consensus ratios also suggest that one of the source records may not belong.

Figure: Source match and field consensus ratios for cluster including a false match

In the previous example, the difference between source record and union record was rather pronounced. Often, the statistics are more ambiguous. Figure 5-3 shows a more representative set of ratios. Although the first source record matches on two fields and disagrees on five fields, the cluster is correct.

Figure: Source match and field consenus ratios for correct cluster

If the ratios are useful, it seems less clear how to use them to automatically detect problems and warn the user. Always presenting the statistics to the user also seems cumbersome, because they further complicate the display and because they complicate understanding the display. A middle ground that displays the statistics when they show significant variation and supresses them when there is very little variation might strike the right balance. A warning indicator that showed one of three values-a mistake is likely, possible, or unlikely-would also display the information concisely.

Next: Automatic linking Up: Identifying and Merging Related Previous: Merging Related Records

Jeremy A Hylton
Mon Feb 19 15:33:12 1996