Citation de-duplication.
Another problem in searching for documents is that a large number
of non-obvious duplicate citations is sometimes encountered. Jeremy
Hylton developed a two-tiered deduplication scheme using
Adar's heuristic algorithm at one tier and vector matching on the
second tier. He applied this scheme to a collection of 250,000
computer science citations and found that he could reliably collect
together not only multiple citations to the same work, but could also
identify closely related records, such as work that has appeared in a
workshop, a technical report, and a published paper.
The Digital Index for
Works in Computer Science (DIFWICS) is an experimental demonstration
of this work, which is described in detail in Hylton's thesis Identifying
and Merging Related Bibliographic Records (M.Eng., February,
1996).