Library 2000 Quarterly Progress Report

From:       MIT component of CS-TR project
            (Jerry Saltzer and Greg Anderson)

To:         Bob Kahn, CNRI

Subject:    Quarterly Progress report, April 1--June 30, 1995

A.  Scanning and other operational activities
B.  Technical and research activities
C.  Talks, papers, and Meetings

-------------------------------

A.  Scanning and other operational activities:

A.1.  High-resolution scanning

Production Statistics                   reports       pages

  Cumulative total, March 31, 1995        334         17344

  Scanning in current quarter:

       LCS Reports                         57          5830
        AI Reports                        108          6256
  total for quarter                       165         12086

  Cumulative total, June 30, 1995         499         29430

In the previous quarter we doubled the total number of scanned pages, and
in the current quarter we increased the average scanning rate another 50%. 
  This increase in volume comes from our efforts to remove bottlenecks from
the scanning process.  The average rate for the quarter was about 1000
pages per week and the quarter ended with a rate of 1250 pages per week. 
Our next target is 2000 pages per week, but a bottleneck that has recently
emerged is the frequency of paper feed jams.  In many cases the originals
of the TR's have warped slightly during storage, or for some reason stick
together, leading to jams in the automatic paper feeder of the scanner.  We
have seen this problem coming for some time, and have tried a number of
strategies to deal with it, so far without success.

Reports placed on-line:
                            reports              pages
  post-scan processing         90                 4800
  postscript conversion        25                 2300
  quarter total               115                 7100
  cumulative total            215                13900

These numbers indicate that not all scanned reports have been placed
on-line.  Images are scanned during the day, and that night, during
unattended operation we make archive copies on Digital Audio Tape (DAT) and
we transfer the image files by FTP to be processed for on-line placement. 
However, if for some reason unattended FTP is not successful within the
next 24 to 48 hours, it is necessary to delete the scanned images from disk
to make room for the next day's scanning--typically we generate four to six
gigabytes of raw data per day.  By quarter's end, individual reports of up
to four gigabytes (250 pages) were being FTP'ed and processed.  The miss
rate has been declining, and missed reports will eventually be
retrospectively restored from DAT and processed.

The 500 reports and 29,000 pages scanned so far amount to about 440
Gigabytes  of archived data on tape.

A.2  Hardware and operations

  Four DAT drives are now installed for unattended overnight archiving, but
one of the older ones is failing, and the two newest ones do not meet their
expected performance specifications.  We are working through the vendor for
warranty service on the first and for no-cost replacement of the other two
with a different model.

Similarly, one of the two four-Gigabyte disk drives on our scanning station
suffered an apparent power supply failure and was returned to the vendor
for repair.  The repaired drive has been reinstalled and is working again. 
Although we have been careful to install only disk drives that have had
several months of field experience elsewhere, that is not enough to ensure
that all design problems have been found.  It would be better to use drives
that have 18 months or more of field experience, but the disk drive
business currently is changing too rapidly for that approach to be
economic.  We believe that the six-month experience accumulation that we
currently require represents the best trade-off between up-to-dateness and
seasoning, though a production activity that did not have the same level of
local technical expertise might find it more effective to insist on a
longer delay.

We acquired a Hewlett Packard model 4SiMx, 600 dpi printer to support
scanning quality control, and have added one part-time (student) staff
person to examine the results of scanning.  For this purpose, Mitchell
Charity configured one of our IBM RS/6000 work stations as a quality
control station adjacent to the scanning area.  Jerry Saltzer, Mitchell
Charity, Keith Glavash, Jack Eisan, Mike Cook, and Justin Waite have been 
meeting regularly to develop quality control criteria and workflow.  A
first draft description of the quality control activity is under review. 
Currently images, file transmission, and processing activity are being
monitored. Mechanisms for reporting problems, recording of quality control
completion, and archiving the result are being investigated.


B.  Technical/Research

B.1.  An Interchange Standard and System for Browsing Digital Documents. 
(M. Eng. thesis of Andrew J. Kass.)  This thesis was completed, providing a
complete definition of a MIME type for a "digiment", an on-line document
consisting of scanned images or other displayable form.  The scheme used
differs from that used by Cornell's Dienst in that meta-information about a
document is provided as an architected part of the document object, rather
than as data in response to protocol inquiries.  We think that Andrew's
approach may be more universal and more easily evolved than the
alternative.  The following two sections, from the introduction to Andrew's
thesis, provide more background.

Abstract:  With the advent of fast global digital communication networks,
information will increasingly be delivered in electronic form. In addition,
as libraries become increasingly more computerized, not just card catalogs
but entire books will be stored on-line. When delivering digital documents
from on-line libraries, two problems must be addressed; delivering a
document which may be available in multiple formats and preserving
relationships among the different formats. This thesis describes a system
for delivering digital documents, called digiments, which can be in
multiple arbitrary formats, while preserving the relationships between the
parts and meta-data about the document itself.

The Need for a Digital Document:  For immediate transfer of data, a
plethora of methods, formats and protocols may be used, both for the
transfer, as well as for the content itself. However, for transferring
between different systems, or transferring a document that may have content
in arbitrary or multiple formats, a standard way of representing a document
is needed. In addition, documents have additional information that is not
necessarily reflected in the content of the document, such as copyrights,
publisher, ISBN number, etc. Furthermore, this information, even if
present, needs to be available in an architected way for displaying. This
"meta-information" needs to travel with the content of the document,
regardless of the format the content is in.

By defining a digital document and creating a standard for its
transmission, it is possible to solve these problems. By allowing the
content to be in any format, but having the digital document describe the
format, it is possible to use any format, present or future, for the
content itself. By creating a standard way to describe the document's
meta-information and content format, it is possible to transmit the
document without having to worry about what format it is stored in.
Finally, documents can be stored and delivered in multiple formats, making
it possible to view both the scanned images and ASCII text of a document.

Digital Documents:  In order to create on-line documents, we need to define
what a digital document is. In this model, a digital document is data in
arbitrary format accompanied by meta-information, which is part of a
document, but not necessarily contained in the document itself. In order to
easily transfer digital documents, I am creating a set of MIME [1] types,
which describe the format of a document and contain the meta-information,
while still allowing arbitrary data formats.

What exactly do we mean by a document?  A document can be a book.  It can
be a magazine or periodical. It can be a bibliography of a technical field.
 A technical report is a document.  Instruction manuals, TV listings,
newspapers, rulebooks, and catalogs can all be considered documents.  What
do we mean by a digital document, or a document stored on a computer?  A
digital document can
be composed of a collection of images in GIF, JPEG, TIFF or other format. 
It can be a Postscript, LaTeX or ASCII text file or set of files.  It can
be a set of images, the ASCII text, and a text to image map that relates
the two. It can include sound.  In short, there is no easy definition for a
digital document. Nor do digital documents correspond well to what we call
documents in the physical world. 

For these reasons, we need to create a new type of object, called a
digiment, for digital document.  Defining a digiment will allow us to
achieve some of the goals expressed above.  It will provide a standard way
to pass around digital documents, regardless of the format in which the
content itself
is stored.  It will provide a standard for archiving and retrieving digital
documents.  It will provide for the use of multiple representations of a
single "document" that can be linked together, as with a text to image map.
 It is intended to accommodate the use of future data formats
transparently. And it will provide meta-information about a document, such
as author, publisher, and copyrights.  For comparison, in a book, the text
on the pages of the book is the content, while the meta-information is
found on the inside of the cover page, such as ISBN number, copyright and
publisher. 

The digiment standard is not a specific data format for storing or
transmitting the content of digital documents.  Rather, it is a container
for transmitting or storing documents in arbitrary formats.  A digiment
consists of the data itself, which can be in any form, along with some
associated meta-information structural information.  This additional
information is what differentiates a digiment from a simple set of images
or data files.  The structural information specifies the format that the
data itself is stored in, whether it is image, text, Postscript, etc.  It
also specifies how the
different data parts of the digiment are related to each other.  The
meta-information includes bibliographic information about the digiment,
such as the author, publisher, copyrights, distribution rights, and
relationships between different formats.  By specifying the digiment in
this way, it is
possible to use any type of format for storing the actual data, including
formats which may be created in the future.

The complete thesis has been deposited in the M.I T. library, and will be
available as an M. I. T. Laboratory for Computer Science Technical Report. 


B.3  Post-scanning processing.

This quarter saw the productionizing of post-scan processing.  Post-scan
processing is now automated, and provides support for quality control.

To review, report pages are scanned to maximize information capture--400
pixels per inch of optical resolution, at 8 bit gray scale (2 or 3 of these
bits may be noise).  These scanned images are about 15 Megabytes of raw
data, and half that when compressed.  Thus a 250 page report, with the
addition of a few extra test patterns, cover and document tracking images,
occupies about four gigabytes of disk space.  During post-scan processing,
pages and meta-information are slightly shuffled, and images are converted
to the primary 600 pixel per inch black and white, and to a lower
resolution grayscale which is used for quality control and as a
supplementary form for pages with gray or color copy.  The post-scan
processing system is now fully automated, and is set up to permit intensive
quality control.  Processing has increasingly kept pace with scanning, with
the main difficulties being limitations on report size, and continued
problems with Macintosh network throughput (discussed ad nauseum in past
progress reports).

Since we had acquired a 4 Gigabyte disk on the scanning station, during the
quarter we began scanning larger reports, up to 250 pages in size. 
Increasing this limit led us to encounter a new round of bottlenecks in
archiving and post-scan processing.  The post-scan processing was improved,
and now both processing and scanning share the same 250 page (four
gigabyte) limit.


C.  Papers, Theses, Talks, and Meetings

C.1  Papers

Eytan Adar and Jeremy Hylton.  On-the-fly Hyperlink Creation for Page
Images.  Scheduled to appear in Proceedings of the Second International
Conference on the Theory and Practice of Digital Libraries, College
Station, Tex., June 11-13, 1995.2.  Jeremy Hylton and Eytan Adar.  

C.2  Theses

Andrew Kass.  An Interchange Standard and System for Browsing Digital
Documents.  M. Eng. thesis, M. I. T. Dep't. of EECS, May, 1995.

C.3  Meetings and talks

Greg Anderson and Jeremy Hylton attended the Digital Library Workshop
sponsored by the Information Infrastructure Technology and Applications
task group of the High Performance Computing and Communications program,
May 18-19, 1995, in McLean, Va.

Greg Anderson and Mitchell Charity attended a meeting of the CS-TR project
participants, June 1-2, in Berkeley, California.  Greg distributed draft
copies of a prospective Technical Report on library issues raised by the
CS-TR project.

Two meetings were held with Bob Kahn and John Garrett regarding
experimental electronic copyright registration, followed up by internal
meetings to discuss possible participation of various M.I.T. organizations
in the experiment.

Greg Anderson made a presentation and other staff members did a
demonstration of the CS-TR project to the M. I. T. Corporation Visiting
Committee on the M. I. T.  Libraries on April 4, 1995.
Return to Library 2000 home page.