Picture
rectrectrectrect
Picture
Picture
Picture
rectrectrectrect

General comments on digital reproductions of textual materials for American Memory

Introduction

Applicants for DLI-Phase II who would like to make use of the textual materials converted by the National Digital Library Program (NDLP) for American Memory should be aware of the heterogeneity in digital format.

The resource includes different genres of original material, including pamphlets, typed and handwritten pages, theater programs, sheet music, and full-length books. They have been converted at various times since 1990, during a period when scanning and delivery technology advanced rapidly and the NDLP's understanding of what was desirable and feasible changed. The earliest conversions were made with dissemination on CD-ROM in mind, since the World Wide Web had not yet provided a network-based mechanism for accessing multimedia materials. The NDLP believes that heterogeneity presents a challenge that digital libraries must be able to handle; developing technology will continue to provide better solutions for capture, storage, and representation, but materials captured in the 1990s must still be accessible in years to come.

Some documents are available online only as page-images, others only as searchable text marked up in SGML (Standard Generalized Markup Language), and others in both forms. The resolution and digital format of page-images varies. In contrast to some other digital library projects, the page-images were all scanned without disbinding the source volumes; this constrains the spatial resolution that can be achieved and most page-images have been captured at between 150 and 300 dots-per-inch. In some cases, separate images have been created for illustrations using different digital procedures or formats, to avoid the moiré effects obtained by scanning printed half-tones and to provide images that can be printed acceptably on inexpensive printers. The choice of approach has been affected by several factors: the
state of technology at the time of capture; characteristics of the original documents (such as number of pages and degree of graphical presentation); physical characteristics of the original material or surrogate (such as microfilm) which could be scanned; and cost.

More detailed information relating to the Library's current practices can be found in a 1996 Request for proposals: Digital Images from Original Documents -- Text Conversion and SGML-Encoding. Sections C (Description/Specification/Work Statement) and J (Attachments) contain the technical sections.

Page-images

Typography or line art is often successfully reproduced in a bitonal image which usually also provides adequate printed output. The bitonal images used for American Memory have typically been captured at 200-300dpi and are stored as TIFF images with ITU (formerly CCITT) Group 4 compression. Some materials converted earlier use Group 3 compression. For documents that have not also been converted to searchable text, there is usually also a screen-sized GIF or JPEG image.

The Library has been experimenting with tonal (color and grayscale) reproduction of manuscript and older printed documents, partly after noting shortcomings in some bitonal (black and white) images produced during the American Memory pilot. The tonal images are usually at 8 bits per pixel for grayscale and 24 bits for color, with a spatial resolution of 200-400 dpi. The highest quality image may be an uncompressed TIFF image or a lightly compressed JPEG.

For an illustration or a table in a book, there may be an additional image (either of the illustration or of the page containing it). The illustrations may be available both in a printing-quality image and as a GIF sized for presentation inline in an HTML version of the searchable text.

Searchable text

The Library is converting a wide array of documents to searchable form, including books, pamphlets, legal materials, serial articles and manuscripts. The texts are encoded with SGML, using the American Memory document type definition (DTD), which is based on the guidelines for humanities texts developed by the Text Encoding Initiative (TEI).

The American Memory Document Type Definition (ammem.dtd) was developed to accommodate a broad range of materials by conceptualizing a generalized humanities text, rather than seeking to describe specific document types and subtypes, or text genres. Simple, streamlined models and flexible structure are characteristic of the American Memory DTD. The Library's approach to tagging identifies basic text structure, but little content-related matter. The American Memory DTD is not optimized to accommodate "value-added" material such as editorial comments and annotations, and there are no plans at this time to expand the Library's approach in this direction.

The Library's transcription requirement for contractors has usually been 99.95 percent accuracy compared to the original. [In future contracts, 99.995 percent accuracy will be required for some items.] The Library does not stipulate the method to be used for text conversion, only that the final product meet our accuracy requirements. To date, almost all documents have been rekeyed; OCR has, so far, performed rather poorly with historical materials that contain a wide variety of non-standard type fonts and typographical design.

Links from the marked-up texts to page-images and illustrations use SGML entity references to provide a level of indirection. Links are marked by one of three elements (corresponding to pages, illustrations, and tables) and identified in an entity attribute that names the external entity within which the graphic image is stored. Each document instance, stored in file with the extension ".sgm", has an associated file, with the extension ".ent", that contains the entity declarations for the system entities containing the graphics; these entity declarations, in effect, map the entity attribute values in the document instance to the filenames of the referenced digital objects. For each document, the .sgm and .ent files and all image files are stored in a single directory.

A style-sheet (ammem.ssh) and navigator (amhead.nav) and other support files appropriate (and necessary) for use with Panorama (a viewer for SGML) are also available.


Textual materials available for use in DLI - Phase II

The following American Memory collections, currently released or in an advanced state of production, include textual material. More detailed technical summaries are available for each collection.

Collection title:     Click for technical summary

Characteristics

African American Perspectives: Pamphlets from the Daniel A. P. Murray Collection, 1818-1907

Varied documents, searchable text (with illustrations and covers)

An American Ballroom Companion: Dance Instruction Manuals, ca. 1600-1920

Searchable text, with page-images

American Life Histories: Manuscripts from the Federal Writers' Project, 1936-1940

Searchable text, with page-images

The American Variety Stage: Vaudeville and Popular Entertainment, 1870-1920

Playscripts and programs, page-images only

California As I Saw It: First-Person Narratives of California's Early Years, 1849-1900

Books, searchable text (with illustrations)

A Century of Lawmaking for a New Nation: U.S. Congressional Documents and Debates, 1774-1873

Includes some documents converted as searchable text with page-images and others only as page-images

Documents from the Continental Congress and the Constitutional Convention, 1774-1789

Searchable text, with page-images

The Evolution of the Conservation Movement, 1850-1920

Includes some documents converted as searchable text with page-images and others only as page-images

George Washington Papers at the Library of Congress, 1741-1799

Includes hand-written page-images for all documents and searchable text transcriptions for some.

Music for the Nation: American Sheet Music, 1870-1885

Sheet music, page-images only

Pioneering the Upper Midwest: Books from Michigan, Minnesota, and Wisconsin, ca.1820-1920

Books, searchable text, with page-images

Poet At Work: Recovered Notebooks from the Thomas Biggs Harned Walt Whitman Collection

Page-images only, hand-writing

Votes for Women: Selections from the National American Woman Suffrage Association Collection, 1848-1921

Varied documents, including books, searchable text, with page-images

Words & Deeds in American History: Selected Documents Celebrating the Manuscript Division's First 100 Years

Varied documents, page-images only, including hand-writing

Two of the collections of sound recordings are ethnographic field collections. They include documents such as transcriptions and field-notes. Some are converted as page-images and others as searchable text.


Challenges faced by NDLP

Even when text is searchable, differences in vocabulary provide barriers to retrieval. Search terms selected by today's users may not correspond to the language of older historical documents. Can effective tools be developed to allow teachers to locate material to illustrate broad topics and themes in a prescribed curriculum although the words and phrases in the texts themselves are very specific?

How can digital reproductions of books and documents be presented to be as convenient and as useful as possible for different classes of users, including scholars, teachers, schoolchildren, and the general public?


[Digital Library Initiatives] [LC NDLP]
[
Challenges] [Collections] [Site Map]

[Lists] [Rights] [Notes] [Samples] [Handover]

[Documents] [Images] [Maps] [Sound Recordings] [Motion Pictures]