From: MIT component of CS-TR project
(Jerry Saltzer and Greg Anderson)
To: Bob Kahn, CNRI
Subject: Quarterly Progress report, January 1--March 31, 1995
A. Scanning and other operational activities
B. Technical and research activities
C. Talks, papers, and Meetings
-------------------------------
A. Scanning and other operational activities:
A.1. High-resolution scanning of LCS and AI technical reports and
memoranda accelerated during the quarter as we have gradually learned how
to overcome various bottlenecks in the production flow. At the beginning
of the quarter we were handling about 200 pages/week; by the end of the
quarter the volume had increased to about 1000 pages/week, and an
intermediate target of 2000 pages/week appears to be achievable. This
increased production volume is apparent in the table below: this quarter's
operations nearly doubled the cumulative totals.
Reports scanned:
reports pages
quarter 146 8440
cumulative 334 17344
Post-scan processing is also accelerating, but it has not been able to keep
up with the increased production volume. The primary bottleneck has been
network transfer of the scanned data (2 Gigabytes/day) from the Macintosh
scanning environment to the UNIX post-processing environment. On a second
processing front, we have now begun routine direct conversion to image of
recently-issued technical reports for which authoritative PostScript is
available. During the quarter, the number of reports and pages that ended
up in our on-line library were
Reports placed on-line:
reports pages
post-scan processing 31 1900
postscript conversion 33 2300
quarter total 64 4200
cumulative total 110 7300
A.2 All of the Macintosh System 7.5 compatibility problems described in the
previous quarterly report were resolved early in the quarter, and that
system is now in production, allowing us to make better use of 4-Gbyte disk
drives. However, both of the 4-Gbyte drives we had purchased failed, were
returned to the supplier, and replaced. As best we can tell from
inspection of the replacements, the vendor has increased the air flow
through the drive cabinets to reduce their internal temperature; the
replaced drives have now been in service for two months with no problems,
and we acquired another identical drive for use on our development system.
The scanning station was relocated to an adjacent room to eliminate
proximity to high volume copiers and other noxious equipment. This
relocation required installation of a new network drop, with the
serendipitous result that network performance increased dramatically. We
suspect that a defect in the previous drop was the cause of frequent lost
packets and consequent poor file transfer performance that we had been
experiencing. While we were at it we installed a second network drop
adjacent to the scanning station to facilitate testing and possible
addition of an adjacent quality control station.
As part of the general campaign to reduce bottlenecks and increase scanning
volume, a third DAT drive was put into service and plans were made to add a
fourth. In addition, a copy of QuickKeys was added to the system; this
package in conjunction with AppleScript allows us to increase the scope of
unattended (overnight processing) activities, a strategy we are using to
increase the overall production rate. Finally, Andrew Kass installed
upgraded disk drivers and upgraded the operating system to version 7.5.1,
which provided a dramatic improvement in file transfer performance
described below in item B.5.
A.3. Document tracking system. As mentioned in the previous report,
Jeremy Hylton developed a document tracking system using a World-Wide Web
form page as the interface to drive a small database program. This
quarter, the system was upgraded to include a record of which archive tape
contains each report and also to track post-scanning processing. This
system was placed into production operation and is now our primary
reporting mechanism. In conjunction with its coming into production, we
performed an inventory of archive tapes and archive records and compared
them with the scanning logs to make sure that everything is accounted for.
As expected, there were a number of confusing things in the records,
primarily because we have changed procedures several times as we learned
how to get organized to do scanning.
A.4. Dienst at MIT. Dienst is now fully operational, thanks to a long,
collaborative series of installation attempts on our RS/6000 system driven
by Jeremy Hylton, ending finally with a new release from Cornell.
B. Technical/Research
B.1. An Interchange Standard and System for Browsing Digital Documents.
(M. Eng. thesis of Andrew Kass.) Andrew has continued work on this
project, adopting the strategy of developing a new MIME type to represent
the metadata that needs to surround a document. An important class of
metadata that he is capturing is the information that shows how the
various components of a document are related to one another: this image is
the page that follows that one; this image contains the 24-bit color inset
that belongs in that 8-bit grey-scale scan; etc. His first cut at a type
definition is nearly complete and he is now working on a demonstration
implementation of a browser that understands how to interpret the new MIME
type.
B.2. Deduplication and the Use of Meta-Information in the Digital
Library. (M. Eng. thesis of Jeremy Hylton.) Jeremy generated a new set of
Technical Report bibliographic records from the publications databases.
The new records are an improvement over the previous batch because the
publications databases have been improved. For example, the AI Laboratory
records now include authors' full names instead of last name and initial;
we also may have information about grants and retrieval. This work was
undertaken so as to have a body of material on which to perform
deduplication experiments. Jeremy also did some preliminary work on a tool
for finding duplicate bibliographic records in the MIT card catalog and in
the CSTR database. The tool manipulates MARC records in OCLC format and RFC
1357 records and produces merged RFC 1357 records as output.
B.3 Jeremy Hylton developed a better demonstration of automatic hypertext
link construction. Building on earlier work on text-image maps, he created
a Web-based browser that constructs text-image maps on the fly by loading
data from Dienst servers. In addition, it recognizes gross features such as
the table of contents and the bibliography, and calculates an offset
between page numbers and image numbers. The complete demonstration now
allows one to approach a scanned technical report by starting from a list
of (automatically generated) links to particular points in the document
such as the first numbered page, the table of contents, or the
bibliography. Looking, for example at the scanned image of the table of
contents, one can select a page of interest and jump to that page using a
link that is generated on demand from the selection point. Looking instead
at a citation of another Technical Report found in the bibliography, one
can select it, invoke a search, confirm the result, and ask to see the
second report. A paper describing this work, co-authored with Eytan Adar,
was submitted to Digital Libraries '95.
The complete demonstration, usable from any modern World-Wide Web browser,
can be found at
http://ltt1.lcs.mit.edu:8008/Public/TIMap/demo
B.4 Andrew Kass developed a better demonstration of our navigation
service. This demonstration includes both an overview and technical
description of how the Navigation service works and what it is used for.
The service itself is based on the internet Domain Name System (DNS). The
demo allows one to look up any Technical Report which is online using the
RFC-1357 ID. It then returns the URL of the Technical Report, as well as
the formats available. In addition, it shows the actual DNS queries used to
obtain the information. There is also another demonstratio which shows the
practical use of the Navigation service to find out information as opposed
to querying the server for the Technical Report each time; the former is
more than an order of magnitude faster. In support of the demonstration,
Andrew also cleaned up the DNS code, made it more robust, and modified it
to work around some changes in the Dienst protocol. This demonstration can
be found at
http://ltt-www.lcs.mit.edu/ltt-www/Public/navigate/navigate.html
B.5 Post-scanning processing. In the previous report we mentioned that
some effort had gone into understanding the limitations of network file
transfer performance on the Macintosh, as part of removing bottlenecks in
the processing of 2 Gbytes/day of scanned images. With the help of a
packet trace and special versions of the ftp program provided by its
Australian author, we established that the network hardware and network
drivers of the Macintosh operating system are actually very highly tuned,
allowing it to drive a 10 Mbit/sec Ethernet at essntially full capacity.
On the other hand, the shared file system is exceptionally clumsy and turns
out to be the bottleneck on large-file transfers. A new Macintosh
operating system update, called System Update 1, provided some major relief
in this area, with the result that our file transfer performance increased
by about a factor of four, to nearly 2 Mbit/sec. However, dividing 2
Gbytes/day by 2 Mbits/sec indicates that 8000 seconds/day (about 2.5 hours)
are still required for the overnight transfer of that much data--and then
only if nothing goes wrong.
Mitchell Charity has been working on automating the processing that goes on
after scanning. In addition to initiating file transfer operation, this
processing includes converting the 400 dpi 8-bits/pixel originals to 600
dpi 1 bit/pixel form suitable for laser-printing and also to 100 dpi 5
bits/pixel for display, correlation with the corresponding bibliographic
record, and placement on the server. These operations, previously done
manually, are now automatically triggered each night by the presence of
completely scanned documents on the scanning station.
C. Papers, Theses, Talks, and Meetings
C.1 Papers.
Eytan Adar and Jeremy Hylton. On-the-fly Hyperlink Creation for Page
Images. Scheduled to appear in Proceedings of the Second International
Conference on the Theory and Practice of Digital Libraries, College
Station, Tex., June 11-13, 1995.
C.2 Meetings.
Jerry Saltzer and Mitchell Charity attended a meeting of the CS-TR
project participants at CNRI in Reston, Virginia, on March 1/2,
1995.
Return to Library 2000 home page.