Library 2000 Quarterly Progress Report

From:       MIT component of CS-TR project
            (Jerry Saltzer and Greg Anderson)

To:         Bob Kahn, CNRI

Subject:    Quarterly Progress report, October 1--December 31, 1994

A.  Scanning and other operational activities
B.  Technical and research activities
C.  Talks, papers, and Meetings

-------------------------------

A.  Scanning and other operational activities:

A.1.  Scanning of LCS and AI technical reports and memoranda continued
as a production operation.  During the quarter, 86 Technical reports
and memos were scanned, totalling 2,800 pages.  The cumulative total
is now 188 documents, and 8904 pages.

As expected, we have found that the oldest technical reports are also
the most problematical to scan.  First-generation originals are
missing from the paper files, ink has faded and paper has yellowed,
and other strange problems crop up with much higher frequency on
thirty-year-old documents.  Since emphasis is now on increasing the
scanning production rate, we have modified our strategy for the
moment to work with most recent documents backward, rather than
oldest documents forward.  Once we have achieved a satisfactory
production rate we expect to introduce old documents into the
production line to get more insight into the disruption they cause.

All scanning has now been migrated to the Fujitsu 3906G production
scanner, using Optix software.  This combination is now working
quite smoothly and has a higher production rate than any other
combination we have tried.  To reduce disk and backup bottlenecks as
the production rate increases, a second Digital Audio Tape (DAT) and
two 4.0 Gbyte drives were added to the scanning station configuration,
giving it a total of 12 Gbytes of disk space.  To use the 4 Gbyte
drives properly required installation of Mac Operating System 7.5,
which in turn raised a number of compatibility problems with other
software, so as of the end of the quarter only one of the 4 Gbyte
drives was in use, and it was temporarily configured as two 2-Gbyte
file systems.  Resolution of the system 7.5 compatibility problems
was proceeding well at the close of the quarter, with the sources
of all the problems identified and fixes in the pipeline.  To avoid
further disruption of the production operation, we are installing 
the fixes one at a time and observing the result long enough to gain
confidence in it before moving on to the next.

The larger disk space has allowed us to stop doing compression on
scanned images before archiving them to DAT.  Instead, the raw,
unprocessed bit map images (15 Mbytes per page) are now being
preserved, at least in the off-line medium.  The use of compression in
archiving has been a point of considerable discussion.  In this case,
the time spent compressing the images was larger than the time saved
in archiving, so the overall production rate increased when we
abandoned compression.  In addition, we have always been concerned
about increased fragility that comes with removing redundancy from
archived materials.  The discovery that some previously archived and
compressed images, when retrieved and decompressed, were unopenable as
images convinced us that our concern for fragility was well-placed.

We also began exploring alternative archiving methods, on the basis
that DAT's are likely to have a relatively short lifetime.  The
exploration includes recordable CD's, and in discussions
initiated with Eastman Kodak, digital microfilm storage.


A.2.  The previous Quarterly Report mentioned under technical/research
topics that work was underway on defining a document scanning record.
During the quarter, the production scanning operation began preparing
document scanning records for each document.  A document scanning
record contains in a standard, computer- and human- readable form the
conditions under which the scanning was done--hardware, software,
identity of the report, name of the operator, resolution and other
hardware and software settings, and a map that lists each image file
and a descriptor of its contents.  (E.g., "image file 14 contains
report page number 7".)  The scan record is prepared by the scanning
operator while the scan is taking place, using an Excel spreadsheet,
which turned out to provide both a good template/default form
mechanism and also a quick way of building the map.  The form
of a scanned document now consists of a folder containing:

     the document scanning record
     a text specification of the document scanning record
     a set of image files
     a file containing sizes and checksums of all the other files

The checksum file is intended to be merged into the document scanning
record, but current software tools aren't up to the job, so for the
moment it is a separate file.

The document scanning record has gone through a number of versions,
and is likely to continue to be refined as more experience is gained
in its use and its ability to support archiving and document browsing.
The current version is identified as CSTR.1.3, and a specification of
its contents can be found in

    http://ltt-www.lcs.mit.edu/ltt-www/Public/scanrec/scanrec.html

A.3.  Document tracking system.  As production scanning went into full gear
it became apparent that we needed something better than a paper logging
system to keep track of what has been done, what is in the pipeline, and
what remains to be done.  Jeremy Hylton developed a simple document
tracking system using a World-Wide Web form page as the interface to drive
a small database program.  The database has been primed with a list of all
known AI and LCS technical reports, and as they are located in the archives
they begin a series of checkoffs that end with being returned to the
archives.  As of the end of the quarter the design was being reviewed by
the various parties who handle documents in the scanning process, and
production use of the tracking system was expected to begin shortly.

A.4.  Dienst at MIT.  We have for some time been running an
infrastructure-level, local implementation of Dienst.  At Cornell's
request, we attempted to install the full Cornell implementation.  We
found that the distribution package did not build properly in our
environment, and after spending a fair amount of time attempting to
get it to build, gave Carl Lagoze of Cornell an account on our system
so that he could work on it.  As of the end of the quarter, the problems
had not yet been resolved, and we were awaiting the next major revision 
of the software package.


B.  Technical/Research

B.1.  An Interchange Standard and System for Browsing Digital Documents.
Continuing work on the scanned image record mentioned above, Andrew
Kass undertook an M.Eng. thesis to develop a standard document type
that can include scanned images, together with a document browser that
can take advantage of the information in the document type
information.  The following two paragraphs are from the introduction to
his thesis proposal.

As libraries become increasingly more computerized, not just card
catalogs but entire books will be stored on-line.  In addition, with
the advent of fast global digital communication networks, information
will increasingly be delivered in electronic form. For immediate
transfer of data, a plethora of methods, formats and protocols may be
used, both for the transfer, as well as for the content itself.
However, for long term archival of documents and transferring between
different systems, a standard way of representing a document is
needed. In addition, documents have additional information which is
not necessarily reflected in the content of the document, such as
copyrights, publisher, ISBN number, etc. In addition, this
information, even if present, needs to be available in an architected
way for displaying. This "meta-information" needs to travel with the
content of the document, regardless of the format the content is in.

By defining a digital document and creating a standard for its
representation, it is possible to solve all of these problems. By
allowing the content to be in any format, but having the digital
document describe the format, it is possible to use any format,
present or future, for the content itself. By creating a standard way
to describe the document's meta-information and content format, it is
possible to archive documents for long periods of time without having
to worry about what format it is stored in. Finally, documents can be
stored and delivered in multiple formats, making it possible to view
both the scanned images and ASCII text of a document.

The complete proposal may be found at

http://ltt-www.lcs.mit.edu/ltt-www/Public/Proposals/andrew-thesis-proposal.html

B.2.  Deduplication and the Use of Meta-Information in the Digital
Library.  A second M.Eng. thesis was undertaken by Jeremy Hylton, to
explore methods of handling the problem that a search for information
usually provides several leads to what seems to be the same thing;
deduplication of these leads is important to make a search system
usable.  The next four paragraphs are from the introduction of that
thesis proposal.

The explosive growth of the World-Wide Web over the last two years
underscores the realization that the technology exists -- or will soon
exist -- to build the digital libraries envisioned since the earliest
days of computing. The experience of using the Web also makes it clear
that linking and resource discovery are complicated problems, at least
in part because of the sheer quantity of data available.

Resource discovery is difficult not only because there is so much data
to sift through, but also because so many varying copies of resources
exist. When performing a search with Archie, an index of anonymous ftp
sites, a query for a particular application will often discover dozens
of copies of the same resource and return a long search result set
that contains little information.

One technique that can be used to manage the discovery and location of
large quantities of data is deduplication.  Exactly as the name
suggests, deduplication is the process that narrows a list of
resources by eliminating any duplicate references to a resource.

Deduplication is an interesting problem because it is difficult to
decide when two resources are duplicates.  Consider two possible uses
of a digital library: A student searches for a copy of Hamlet in the
library; the digital library actually has many copies of Hamlet --
some by different editors, some in different collections -- but the
student only wants to know that a copy of Hamlet is to be had. For the
student, deduplication of the list to one or two entries provides a
valuable simplication. On the other hand, a Shakespeare scholar may
want to find a copy of one of the early printed editions of Hamlet.
For that scholar, deduplication needs to work within a narrow sense of
equality. The scholar does not need to know that there are three
libraries with copies of the first quarto, but does need to know that
copies of the first and second quartos exist.

The complete proposal may be found at

http://ltt-www.lcs.mit.edu/ltt-www/Public/Proposals/jeremy-thesis.html

B.3  CSTR Architecture definition.  During the quarter, a series of
weekly teleconferences run by Jim Davis, and with the participation of
William Arms, Judith Press, and John Garrett, and Jerry Saltzer,
developed first-cut definitions, scenarios, and a specification of an
architecture for a digital library.  Parallel to that effort, the MIT
research group created a draft set of definitions and scenarios, which
may be found at

http://ltt-www.lcs.mit.edu/ltt-www/Public/Architecture/defs.html

Both the MIT architecture description in that document and the CSTR
joint architecture description that Jim Davis is coordinating are work
still in progress.

B.4  Other continuing activities.  Quite a bit of effort in the quarter
went into studying various ways to increase the rate at which bits can
be transferred out of the scanning station, which is Macintosh-based, and
into the processing stream, which is carried out in a Unix environment. 
The bottleneck in the FTP/TCP path has been identified as arising from an
interaction between application TCP buffering and Macintosh OS scheduling. 
since the latter is hard to change, we are currently focusing on obtaining
application versions that use much larger TCP buffers, and thereby invoke
the OS scheduling mechanism much less often.

Mitchell Charity, with other members of the group, is continuing work
on a replication architecture for very long term storage, on meta-data
representation, and on naming and infrastructure architecture.  The
three are closely related.  People wish a variety of properties of
their information storage and distribution environment - low latency,
broad replication, consistency, high security, ease of creation and
modification, archival stability.  Their requirements are inescapably
in tension.  While good system design can minimize tradeoff costs,
this divergence drives system complexity.  To contain this complexity,
we have been forced in a direction of specializing services (such as
very long term, `no other properties matter' storage), maintaining
their independence so diverse data needs can be met, and creating
cohesion with simple naming services and extensible data typing.  This
separation of concerns seems more promising than, and has helped
clarify the weaknesses of, approaches which mix naming, typing,
and storage.


C.  Papers, Theses, Talks, and Meetings (a quiet quarter)

C.1  .Meetings.

Mitchell Charity participated in the IAB Information Infrastructure
Workshop in Tyson's Corner, Virginia, October 12-14, 1994.

Jerry Saltzer and Mitchell Charity attended a meeting of the CS-TR
project participants at CNRI in Reston, Virginia, on November 9/10,
1994.
Return to Library 2000 home page.