Moving the MIT CS-TR Project into a Sustainable, Production/Service Environment
Greg Anderson, Information Systems
Keith Glavash, Libraries
Section 1: Setting the Context
In 1992 the Advanced Research Projects Agency (ARPA) funded a three year
grant to investigate the questions related to large-scale, distributed,
digital libraries. The award focused research on Computer Science Technical
Reports (CS-TR) and was granted to the Corporation for National Research
Initiatives (CNRI) and five research Universities: Carnegie Mellon,
Cornell, MIT, Stanford, UC Berkeley.
The research goals of the project varied with each participant. The M. I.
T. Participation in an Electronic Library Plan (10 November 1992), however,
enumerated most of the key project points involving technical,
organizational, service, and data:
* to obtain early experience with a core function of the distributed
electronic library of the future,
* to work with a database that is readily available, that has a critical
time-sensitive value, and that is already well-known and valued by
its target audience,
* to explore the architecture, design, and work-flow issues associated
with making information available in digital form,
* to work within the research/prototype domain with a volume of
information large enough to be useful and interesting and that can scale to
an operational system,
* to provide an important service to an audience of researchers,
faculty, and students who are motivated and likely to have access to
appropriately powerful workstations to use the library from their offices.
The experimental project will conclude in May 1996. The purpose of this
document is to outline the current situation and to propose a strategy for
continuing this work in an operational mode. With the impetus and
technology provided through the alliance of the MIT Laboratory for Computer
Science and the MIT Libraries, the CS-TR project should become one of the
core electronic collections that are available to the MIT community. The
progress and learning gained in this effort is a seed crystal for the
creation and service of other page-image electronic collections at MIT.
As a component of the Library 2000 project in the MIT Laboratory for
Computer Science, the CS-TR experiment was one component of a large
testbed environment where investigations related to digital libraries could
be explored. The range of these explorations is impressive: very large
system architectures for library systems; pursuing the hypothesis that
spinning magnetic storage of materials will soon be cheaper than the
storage of the paper analog; deduplication services, creating search
mechanisms from citation data, operational scanning services, document
scanning (metadata) records, document tracking, automatic link generation,
and navigation services (Annual Progress Report: Library 2000, July 1,
1994--June 30, 1995, available, along with other Library 2000 research
documents, via URL:
For the MIT Libraries and Document Services, participation in the CS-TR
project has helped realize a vision for Document Services that a Libraries
study group determined in 1990: moving the Microreproduction Laboratory
(MRL; the former name for Document Services) into a digital scanning
service. The blending of microreprographics and digital services has
positioned DS as a key supplier of information to the MIT community and
beyond. Without the CS-TR project, it is likely that this transformation,
which has provided equipment and staffing and has given Document Services
and the entire Libraries access to world-class research and development,
would have occured at this pace.
As the CS-TR project comes to a close, Document Services has a robust,
high-quality scanning service in place. It has a solid work-flow, quality
control, and technology infrastructure to provide massive quantities of
page-image content to the MIT networked environment. The question is how
to maintain, grow, and renew this environment as an operational service.
Many academic/reserach libraries are faced with this basic question: how
to participate in development projects and contribute to their success and
how to prepare for the transition of those successful projects into a
service environment? In the CS-TR project, MIT has an opportunity to
continue a valuable service and to take a leadership role in addressing
When the Libraries and Library 2000 first began collaborative work on this
project, we were fully aware of the requirement to end the experiment and
to move the content and services created in the project into a sustainable
service. It was also recognized that the project funding would cease and
that operation of the service should properly move to the service
organization; i.e. the Libraries. This is a basic issue involved in the
collaboration between research and service organizations.
The MIT CS-TR project has accomplished or has made significant progress
toward each of the goals listed above. A very high quality body of
valuable electronic content has been created. A high quality, high
production scanning operation has been created in Document Services because
of this project. The CS-TR effort has helped Document Services realize the
vision set for it as part the Libraries' Microreproduction Services Group
study issued in December 1990.
For the Libraries, IS, and for MIT, there is a committment here that should
be honored, even in the face of reduced resources. Researchers want to
see ideas and experiments realized and used; in the CS-TR project, all of
the elements are there to move successfully from the laboratory to
operational service. Beyond the service, it is important for the
Libraries and IS to maintain the trust and integrity established through
collaborative work with MIT researchers.
There are three processes that must be more clearly delineated and
understood during and after the transtion of the current CS-TR and future
page image services in the Libraries:
* Acquistion - continued scanning of TR's and new scanning projects
as identified; i.e. Theses, etc.
* Archiving - preservation and management of the images.
* Delivery - operation of the online service, including integration
of the service with the Libraries other electronic services such as the
catalog, citation indexes, etc. Although this transition plan is centered
primarily on continuation of services centered in Document Services, it is
clear that the full range of the Libraries' organizational structure may
become involved. In addition, the infrastructure requirements of this
service point toward a high level of participation from Information
Services. For the near term and the long term, it will be useful to
maintain the distinctions in these three processes for operational and
This proposed strategy builds upon the relationships crafted through the
Distributed Library Initiative. Through contributions from Library 2000
project, the MIT Libraries, and MIT Information Systems, the CS-TR project
will be moved into a production environment beginning in June 1996. At
that point, support from the CS-TR sponsored research will end and shared
support for the service provided by the Libraries and IS will begin. Below
we note the contributions from each group.
The timing for this strategy has three components. For the six-month
period, Jan. - June 1996, transition activities will move forward through
collaborative work among Library 2000 staff, Libraries staff, and IS staff.
For FY97, the Libraries and IS will jointly run and manage the system and
services based upon contributions from each organization. Finally,
beginning with FY98, the CS-TR (and other services based upon this
framework) will be operated in a regular, normalized environment based upon
service agreements between IS and the Libraries.
Section 2: Where we are now in Document Services
Although the CS-TR project formally started in July 1992, production
scanning did not begin until early 1994. Scanning operations have now
settled in at a fairly constant pace of about 2,000 images per week. After
the project's scheduled end in June 1995, a no-cost extension was granted
by CNRI to continue scanning until committed funds run out, no later than
By the end of this month about 35% (or 63,000 out of 175,000 pages) of the
materials in the CS-TR collection will have been scanned. Adding images at
a rate of 8,000 per month, we estimate that a maximum of 40,000 more can be
scanned before funding expires, bringing the total to about 55%.
After that point, it is critical to have in place a process that will
enable continuation of scanning and storage so that the entire collection
eventually will be available in this format. Additionally, we want to
extend the process to more research publications at MIT, such as technical
reports from other labs and centers, as well as the thesis collection.
While virtually all MIT publications are currently created in electronic
format, very few are systematically collected or even retained as such.
Thus the conversion back to electronic must continue until electronic
collection standards and policies are put in place Institute-wide. There
are as many as 200,000 existing research publications which are candidates
for retrospective conversion. The CS-TR project has provided a model for
this process, but we look now to transform that research project into an
ongoing production activity, sustained by resources from several areas at
Through the CS-TR project, Document Services has acquired the technology
necessary to run a large-scale, operational scanning service. The Scan
Station acquired through the project includes:
* Workstation: Quadra 840 AV (Apple), with 64 MB RAM, internal CD-ROM,
230 MB Internal hard disk, and a Nanao Trinitron Flexscan monitor T560i
* Fujitsu 3096G scanner w/ADF (gray scale)
* Hewlett Packard ScanJet IIcx scanner (color)
* FWB Jackhammer SCSI card
* 12 Gigabytes of external disk drives
* 4 external DAT drives
In addition, Document Services also has another workstation used for
restore activities; this machine is an Apple PowerPC 7100/66, with 40 MB
RAM, 250 MB internal hard disk, an internal CD-ROM, 8 GB of external disk
storage, and one external DAT drive. There is associated network
capabilities for the technology and there is a strong infrastructure of
other technologies now in place: printing, digital scanning from fiche,
etc. In sum, Document Services is positioned to offer a wide array of
scanning services to the community.
Section 3: Recommended Near-Term Actions: Jan 1996
- June 1997
3.1. Accept Lib 2000 equipment: server/storage
As the Library 2000 project comes to a close, Professor Jerome H. Saltzer
has offered the Libraries some of the equipment purchased for the CS-TR
project.. The following equipment is currently located in LCS and could be
relocated for the CS-TR service in the Libraries:
* One or possibly two RS/6000 models 520 and 530, each with 512 MBytes
of RAM and three internal 600 MByte disks,
* A couple of dozen GBytes of external disk in external boxes.
* A couple of spare RS/6000 workstations similar to the one that Library
2000 installed in DS.
As always, accepting equipment brings expectations and responsibilities.
The RS/6000's are approximately three years into a five year life-cycle
(the life-cycle could become shorter but it should not become longer).
The machines, however, are reliable, low-maintenance machines (so low in
fact that the the hardare maintenance service contract has been cancelled).
With their large RAM's , they are still quite powerful for some
applications. According to Professor Saltzer, Library 2000 can transfer
one or maybe two of these machines over to the library system or IS if it
will help provide an interim storage facility.
3.2 Incorporate Relevant Components of the MIT TULIP Project
MIT's participation in the Elsevier Science Publisher TULIP project ended
in December 1995 (for the final report on the TULIP project at MIT see:
ftp: athena-dist.mit.edu:/pub/elib/TULIP_FINAL_1 ). The page image
delivery system developed for TULIP provides a robust, scalable delivery
mechanism that can be readily adapted to the CS-TR images. In addition,
the Libraries have purchased approximately 40 gigabytes of spinning
magnetic storage for TULIP. As the TULIP datasets are erased, this storage
would become available for the CS-TR images. Finally, TULIP has been
running on an older RS/6000 provided by Information Systems. The RS/6000
servers from LCS would provide additional servers for the short-term.
3.3 Continuation of scanning
In order for this service to remain valuable and viable, it must continue
to grow. In the Libraries current fiscal environment, however, scanning
will cease when project funding ends unless other sources of support are
tapped. These are the issues:
* Staff in Document Services are off budget, supported only by
cost-recovery operations. There is no longer any flexibility to assign
staff to activities which are not revenue producing.
* Equipment for scanning, processing, storing and serving images
needs to be maintained and updated.
A Possible scenario in the June 96 - June 97 timeframe would address these
issues and would enable scanning to continue:
* The Libraries would fundpart or all of a position in DS to continue
scanning on a part- or full-time basis;
* DS would fund the administrative/management portion of the scanning
* The Libraries funds a portion of a Systems position needed to
build and maintain image links.
3.4 Joint Libraries/IS Development Effort
By piecing together the equipment infrastructure and by incorporating the
page image delivery system, the Libraries and IS could begin to deliver
page images of MIT CS-TR's quickly. In addition, through the continuing
and expanding scanning efforts of Document Services, we would ensure that
this valuable body of information would continue to grow and would begin to
represent a comprehensive collection of materials created and used by the
MIT community. The goal would be to make these images available through
bibliographic search clients (such as Willow, WebZ, the new GEAC client,
etc.) and through Web browsers.
In order to be of optimum use to researchers, the CS-TR images must be
linked to the most likely bibliography(ies) in which they are contained.
For these materials that means the Libraries' online catalog, Barton.
There should be a direct link from the Barton record of the publication to
its images, and the ability to scroll back and forth within the document
Undoubtedly some development effort would be required to tune the page
image delivery system to the CS-TR images. As part of the DLI, the
Libraries and IS should be willing to devote resources to these efforts.
3.5. Document the current Library 2000 processes
In order for the Libraries and IS to assume the responsibilities for the
CS-TR work, relevant components of the existing system should be
documented so that they can be supported in an operational mode. Beginning
in January, IS should provide technical writing resources to work with
Library 2000 staff and Document Services staff to capture the work flow,
the technical development, and the features and issues that must be
accomodated in an operational flow. For example, the UNIX scripts
written to facilitate file transfer from DS to LCS must be documented in
order to transfer the image files from DS to IS. Similarly, there are
scripts for processing the original, large image files down to a more
manageable size for viewing and printing.
This requirement means that the documentation must be continuously
updated to reflect changes in the system. For example, the CS-TR project
currently scans at a high resoluiton (400 pixels per inch with 8 bits of
grey scale) which requires the scripts to render these large files (about
16 Mbytes per scanned page) into more manageable sizes for actual use by
customers. As this approach is evaluated and discussed after the project
concludes, any changes in approach would imply a change in the scripts.
3.6 Arrange for IS to provide service level agreement in W91
With the transfer of equipment to the DLI environment, one of the early
Libraries/IS principles comes to bear: rationalize library computer
operations. The approach over the past few years has been to write a
Service Level Agreement between the Libraries and IS to ensure the proper
level of support for the electronic library service located in the W91 Data
Center. For example, the service level agreement for the Geac system is
much greater than for the TULIP service because it is imperative that high
level s of service are always available for the library system.
In this near term proposal, IS would provide the appropriate service level
for the CS-TR effort without charge to the Libraries. Both the Libraries
and IS would plan to incorporate in their respective FY98 budgets
sufficient resources to support this service.
3.7 Near Term Presentation of this Strategy
With the arrival of Ann Wolpert, the Libraries/IS Steering Group should be
re-activated and this proposal should be brought to that group for
discussion and agreement. Professor Saltzer should be involved in those
discussions because of his contributions to the effort . Keith Glavash
should also be involved to contribute his perspective. Given the time-frame
involved and the competition with other priorities, it seems reasonable
for these discussions to begin in late January.
Section 4: Recommended Long-Term Actions: June 1997 -
4.1 Libraries/IS Prepare Budgets for ongoing services
In preparation for the FY98 budget cycle which will begin in the fall of
1996, the Libraries and IS should devote planning and strategic discussions
toward the long-term measures that will incorporate, sustain, and renew
electronic library services to MIT. This scope of these preparations
should include equipment renewal/replacement, new equipment, staffing
issues, a model for normal operational services that will leverage the
best of the Lirbraries skills with the best of the IS skills.
4.2 Deliberations concerning continued research efforts and
collaboration between the DLI and resesearch groups at MIT
With this experience and the TULIP experience, the Libraries and IS should
begin to assess what has been learned, the value of those experiences, and
their combined interest in pursuing similar research/development efforts in
the future. Of particular interest is participation in a Libraries/LCS
effort to explore a regional virtual library infrastructure which would
build upon the successes here at MIT and would lower the threshold for
colleague libraries in New England to enter the Digital Library
environment. The linkage to this current discussion is that such a project
would provide a deeper technical infrastructure for capture, storage,
transmission, and display of page image scholarly content.
Section 5: End State - Long - Term
In June 1996, the Libraries and Information Systems should have a firm
understanding and a solid technical and administrative structure in place
in order to offer MIT an operational service for page image scholarly
materials. The internal support for this service will be comprised of a
set of agreements between the Libraries and IS. The initial system will
be housed to a large degree on hardware donated by the Library 2000 project
which will be ending at that time.
In the fall of 1996 the Libraries and Information Systems will present
either jointly or as key components of their organizational budgets a plan
and request to support the ongoing operations of a page image capture,
storage, transmission, and presentation service. The initial corpus of
that service will be CS-TR's, MIT theses, and other MIT publications of
By July 1997, if not sooner, the CS-TR project will have evolved and
expanded to realize a vision of an electronic repository for all MIT
research publications, supported by a page image delivery architecture that
provides access to these electronic materials through a variety linking
mechanisms: from bibliographic records searchable in Barton or other
sources or via browsable lists. This repository of MIT publications is
comprised of materials obtained through known and supported paths. The
materials may be collected and deposited in at least two forms: page
image format reflecting current and past MIT pubs scanned retrospectively
from the paper source, and electronic format reflecting publications whose
source is electronic.
As a core of the information commons available to all of MIT and beyond,
the organizational responsibilities for this flow of publications reside in
the Libraries for the collection (Archives), intellectual organization
and linking (Bibliographic Access Services), scanning and hardcopy and
distribution (Document Services), and system management (Library
Systems). Information Systems responsibilities include the IT service
and support processes for this system, integration responsibilities with
other information delivery efforts at MIT, and leadership and support in
the ongoing renewal and development of this system.
Through these collaborative efforts and discussions, MIT will provide its
immediate and all related communities access to its valuable corpus of
scholarly information. In leading this effort to provide an operational,
sustainable digital library, MIT will be able to share and provide entry
for other institutions to enter a true digital library environment.
Return to Library 2000 home page.