American MemoryThe National Digital Library Program: 
Archived Documentation

The Library of Congress / Ameritech National Digital Library Competition (1996-1999)

Digital Formats for Content Reproductions


Carl Fleischhauer
Technical Coordinator, National Digital Library Program, Library of Congress
August 20, 1996


See also: July 1998 version.

Contents

I. Introduction

This document and two others are intended to take the place of the 1994 document titled Elements of Digital Historical Collections. The new documents have been prepared, in part, to offer guidance to applicants in the Library of Congress/Ameritech competition. The other two documents are Digital Historical Collections: Types, Elements, and Construction and Access Aids and Interoperability.

The trio of documents present a snapshot of the Library of Congress digital conversion activity as of August 1996. The ideas and approaches outlined represent the collection-digitization effort of the American Memory pilot program (1990-1994) and the operational National Digital Library Program (1995-1996) that has followed the pilot. The institution recognizes that many avenues remain unexplored and that new technology will lead to changing practices.

Most of the formats listed below are in use for World Wide Web access to American Memory collections released or in production in 1996; a few are planned for use in the near future or are alternatives that other institutions have employed. The Library's selections represent an attempt to balance quality of reproduction, convenient accessibility for the general public over the World Wide Web, likely longevity of format (using standard formats where possible and proprietary formats only where widely deployed), and production cost. For digital reproductions of original items, the greatest stability and public accessibility obtains for images that reproduce manuscript documents, printed matter, and pictorial materials, and for searchable texts, including those that employ a Standard Generalized Markup Language. Formats that reproduce the time-based content of sound recordings and moving-image collections, however, have a less-certain future.

II. Pictorial Materials

For pictorial collections, the Library produces three image types:

Thumbnail
A small image presented with the bibliographic record, to allow users to judge whether they wish to take the time to retrieve a higher quality image.

Tonal depth:
Format:
Compression:
Spatial resolution:

8 bits per pixel
GIF
Native to GIF
Circa 200x200 pixels


Reference
The "fetchable" higher quality image. In current projects, only one reference image is provided; future collections may offer two (or more) at varying levels of resolution.

Tonal depth:
Format:
Compression:
Spatial resolution:
Grayscale: 8 bits per pixel; color: 24 bits per pixel
JFIF (JPEG File Interchange Format)
JPEG (generally about 10:1 compression)
Moderate class ranges from about 500x400 to about 1000x700 pixels; higher resolution class (future) will range from 2000x1400 to 4000x3000; both moderate and higher resolution will be offered to users

Archive
An uncompressed (or, in the future, lossless-compressed) image free of the artifacts resulting from lossy compression, provided to users for reproduction or held for future reprocessing as compression standards change. Not provided at this time; may be provided to users as a downloadable file in the future.

Tonal depth:
Format:
Compression:
Spatial resolution:

Grayscale: 8 bits per pixel; color: 24 bits per pixel
TIFF (Tagged Image File Format)
Uncompressed
Moderate class ranges from about 500x400 to about 1000x700 pixels; higher resolution class (LC examples coming in future) will range from 2000x1400 to 4000x3000; only the highest resolution will be archived

Alternative format
Several organizations have used the Kodak PhotoCD (Image Pac) format in their imaging projects. Originally associated only with CD-ROM disks, this multi-resolution format may now be written to other storage media. The Library has not had extensive experience with PhotoCD/Image Pac. Archives wishing to produce collections that are interoperable with those at the Library of Congress and who plan to use PhotoCD technology should either determine how direct access to those images may be provided to WWW clients or plan to reprocess the Image Pac images to produce GIF and JFIF/JPEG images for WWW access in association with the American Memory site.

III. Textual Materials Reproduced as Searchable Text and Images

Searchable transcriptions can be a tremendous aid to a researcher seeking instances of particular words or phrases in a textual work. Transcribed text, especially when encoded with markup language, can also facilitate the researcher's navigation of a longer document, The cost of providing perfect or near-perfect transcriptions is very high, however, and, for many researchers, proper understanding of a document may depend upon seeing a facsimile (and in some cases, the original). For these reasons, the Library has experimented with the presentation of manuscript and printed matter items as a coordinated set of page-images and searchable text. In some American Memory pilot-period collections, separate images of tables and illustrations were provided in addition to, or in lieu of, page-image sets.

The Library encodes its documents using Standard Generalized Markup Language (SGML), as described below. Since the Library always places SGML texts online together with bibliographic records or a finding aid, the headers within the documents contain minimal bibliographic information. For a more detailed description of the Library's approach to text-reproduction using SGML, see American Memory DTD for Historical Documents.

Full-function SGML viewers for the WWW are not available free or as shareware at this time. For this reason, the Library derives an HTML version of the text from the SGML version and places both online.

The page images included in the text reproduction sets employ the formats for tonal and bitonal images described below.

Searchable text
The Library's transcription requirement for contractors is 99.95 percent accuracy compared to the original (in future contracts, 99.995 will be required for some items). The texts are encoded with SGML, using the American Memory document type definition (DTD), which conforms to the international guidelines for humanities texts, the Text Encoding Initiative (TEI). Entity values in the document consist of of the filename, without extension, for each page image, illustration, or table. The entity file lists entity declarations (which include entity values) and their corresponding filename with extensions.

SGML Document Type Definition:
Character sets:


Associated file:
Filename extensions:

American Memory DTD
ISO 646; the IBM extended-character sets are represented by the publicly declared entity reference sets in ISO 8879.
Entity file
sgm (for the text), ent (for the entity file)

Alternate formats:
Texts marked up with HTML or Text Encoding Initiative-conformant SGML DTDs other than that developed by American Memory. Archives wishing to produce collections that are interoperable with those at the Library of Congress and who plan to use an alternate approach, should (a) make available the associated DTD, style sheets, and navigators for use by the general public and (b) provide sufficient title and identifier information within the document header to facilitate integration into the American Memory interface.

IV. Textual Materials Reproduced as Images

The following discussion of text page-images applies to images associated with searchable texts (see preceding section) and image-only presentations of manuscript or printed documents.

The Library has been experimenting with tonal (color and grayscale) reproduction of manuscript and older printed documents, partly after noting shortcomings in some bitonal (black and white) images produced during the American Memory pilot. Original items with a mixture of lighter and darker markings are often more successfully reproduced in a tonal rather than in a bitonal image. Typography or line art, however, is often successfully reproduced in a bitonal image. Bitonal images usually provide better printed output. Thus, some collections may warrant the production of both types of images. Multiple versions of an image of a page may also be needed to provide a WWW browser-based means for paging through documents (see section V.).

Tonal images of manuscript and printed documents.
At this writing, the Library's only online example of tonal document reproduction is the small collection of Walt Whitman notebooks . A demonstration project to refine a tonal-image approach to manuscripts is underway with a portion of the Federal Theater Project collection. In this latter project, the best-quality (or "archival") version of the image is tonal and the access ("reference") image is bitonal.

Tonal image types:
Reference
In the current Whitman presentation, this is the image "fetched" from the page menu. See Section II.D on browser-based paging for an alternate approach that requires GIF (Graphic Interchange Format) files.

Tonal depth:
Format:
Compression:
Spatial resolution:

Grayscale: 8 bits per pixel; color: 24 bits per pixel
JFIF
JPEG
For the Whitman notebooks: 150 dpi

Archive (uncompressed)
An uncompressed (or, in the future, lossless-compressed) image free of the artifacts resulting from lossy compression, provided to users for reproduction or held for future reprocessing as compression standards change. May be provided to users as a downloadable file in the future.

Tonal depth:
Format:
Compression:
Spatial resolution:

Grayscale: 8 bits per pixel; color: 24 bits per pixel
TIFF
Uncompressed
For the Whitman notebooks: 300 dpi


Archive (compressed)
The project to demonstrate approaches for manuscript digitization is testing the production of a compressed tonal image for archiving (or highest quality display) together with a bitonal image for reference access. A steering committee argued that legibility was the highest goal and that modest compression artifacts could be tolerated for the sake of smaller file sizes.

Tonal depth:
Format:
Compression:
Spatial resolution:

Grayscale: 8 bits per pixel; color: 24 bits per pixel
JFIF
JPEG
300 dpi

Alternate formats:
PDF (Portable Document Format from Adobe Corporation). The Library has not had extensive experience with this format; archives wishing to produce collections that are interoperable with those at the Library of Congress and who plan to use PDF must be capable of helping to guide their implementation. (See also Section II.D on browser-based paging.)

Bitonal images of manuscript and printed documents. The use of the lossless CCITT (FAX) compression for bitonal images may mean that one image may serve both reference and archiving needs. For some items, however, higher resolution may be desired for an archival copy. Projects patterned on the book-reformatting work pioneered at the Cornell University Library may fall into this category.

Image types:
Fetchable Reference

Tonal depth:
Format:
Compression:
Spatial resolution:

Black and white, 1 bit per pixel
TIFF
CCITT Group 4
LC examples: 300 dpi; potential range 150-300 dpi

Archive Higher resolution version if needed.

Tonal depth:
Format:
Compression:
Spatial resolution:

Black and white, 1 bit per pixel
TIFF
CCITT Group 4
No LC examples; potential range 300-1200 dpi

Alternate formats:
PDF (Portable Document Format from Adobe Corporation), the new JBIG standard, or other proprietary but widely used and widely supported formats. The Library has not had extensive experience with these formats; archives wishing to produce collections that are interoperable with those at the Library of Congress and who plan to use them must be capable of helping to guide their implementation. (See also Section II.D on browser-based paging.)

The special problem of printed halftone illustrations. Printed halftones present special problems in reproduction because of interference between the spatial frequency of the halftone dot pattern and the spatial frequency applied by scanning and/or output devices. The interference "waves" caused by the intersection of the two frequencies manifest themselves as moiré patterns that degrade the image. There are a number of treatments that can mitigate or correct this degradation but not all are practical in a production-line environment. Possible treatments include the following:

Descreening and rescreening
This approach removes the halftone dots and converts the image to grayscale, then rescreens it to produce a new halftone. Xerox has incorporated this approach in some of its advanced scanning devices and it has also been employed in the Cornell University Library's book-reformatting projects. In the implementations known to the Library, the process seems to depend upon "four-square" capture of the source items. This requires the placement of flat sheets of paper on a scanner's glass surface which requires that books be disbound. Furthermore, if a page containing both text block and illustration is captured, the system (or operator) must zone the page and capture text and illustration separately. The Library of Congress has been digitizing books for access and not for preservation. Since the volumes have not been disbound, the Library has not had the opportunity to employ this technique.

Capture at high enough resolution to reproduce the halftone dots
This approach requires capture resolutions at one or more multiples of the original halftone screen. Thus, for books with high-quality illustrations, the capture might be at 600 dpi or higher. In order to reproduce the scanned image without loss, the screen display or printer must also offer high resolution. In order to produce reduced-resolution (smaller) images for access, a post process consisting of descreening/rescreening, converting to grayscale, or dot-randomization would have to be applied. The Library has not availed itself of this technique.

Grayscale reproduction
For many illustrations, this approach offers a reasonable onscreen rendering at moderate resolutions. Since printed output from a typical laser printer requires that grayscale images be halftones, paper copies produced from these grayscale images may suffer from moiré patterns. If a page containing both text block and illustration is captured in grayscale at moderate levels of resolution (e.g., 200-400 dpi), the grayscale treatment that benefits the illustration may injure the clarity of the typography. Thus, one may wish to zone the page and capture illustration and text separately or capture two versions of the page image. The Library has not availed itself of this technique.

Randomization of scanner "dot pattern"
This process reproduces printed halftones as bitonal images to which a special diffuse dithering treatment is applied at scan time (or in post-processing a grayscale image). This reduces but does not eliminate moiré patterns. The effect on typography is not as severe as the effect produced by grayscale capture, although it adds speckles to white areas surrounding the type. The Library has employed this approach in a number of collections, capturing images at 300 dpi. The resulting images print on a laser printer with good results but do not rescale well for screen display. In order to provide a screen display of the whole illustration for which no rescaling is required, the Library has also created thumbnail versions of the larger image. These are not incorporated in the current American Memory WWW presentation at this time.

Image types for printed halftones:
Fetchable Reference
The Library's randomized-dot-pattern images have been captured on a Xerox K5200 scanner. When the diffuse dithering treatment is applied, this scanner's software creates files in the PCX format (a format associated with ZSoft's PC PaintBrush software)

Tonal depth:
Treatment:
Format:
Compression:
Spatial resolution:

Black and white, 1 bit per pixel
Xerox diffuse dithering
PCX (some have been converted to TIFF)
Native to PCX or none
LC examples 300 dpi

Archive: None created by the Library for halftone illustrations

Bitonal thumbnail image (not offered on WWW at this time)l
The images described here are bitonal; the Library is considering creating grayscale thumbnails for printed halftones in the future.

Tonal depth:
Treatment:
Format:
Compression:
Spatial resolution:

Black and white, 1 bit per pixel
Xerox diffuse dithering
PCX (some have been converted to TIFF)
Native to PCX or none
Within window of about 500x400

V. Images for Use in Browser-based Paging Sets

The companion document Digital Historical Collections: Types, Elements, and Construction outlines some of the devices that may be used to provide users with the many images (or other digital objects) that are linked to a single reference, including browser-based paging. If browser-based paging is adopted, images in the GIF format must be produced. The Library's plans call for the production of GIF images with reduced resolution (when compared to the source image) and altered tonality (bitonal made into grayscale). For example, in the case of the 150 dpi, 256-shade grayscale images of the Walt Whitman notebook pages offered online, the GIF images might be at 75 dpi and offer only 16 shades of gray. In the case of a 300 dpi bitonal image of a page of printed matter, the GIF image might be from 75-100 dpi and in 16 or 32 shades of gray, produced with a process that shades the bitonal image to gray at the same time that it is rescaled.

VI. Maps

The Library's Geography and Map Division is developing an approach for digitizing map collections, with the advice of the division's Center for Geographic Information. For most of the maps selected for the Library's program, the preliminary finding is that good legibility will be afforded in tonal images at a spatial resolution of 300 dpi. Archival copies will be stored without compression (or with lossless compression). There are many challenges associated with Internet transmission, display, and printing of very large images and the Library has not formulated plans for the presentation of maps in the WWW environment.

VII. Recorded Sound

The large files required to reproduce audio have launched the computer multimedia industry on a constant search for new and better compression and playback schemes. For this reason, the digital audio formats suitable for the WWW are less stable than those for text and pictorial images; thus the audio files produced today will become obsolete more quickly. When materials are remastered, the moderate fidelity of current audio formats will mean that the source material or an intermediate format, e.g., DAT (digital audio tapes), must be used (again) to create files in the new format.

The audio selections included in American Memory collections at this time are "downloadable," meaning that the browser must copy a file to the local computer before it can begin to play them. Since the files are large (the four-minute recordings in the Nation's Forum collection run about 2 megabytes each), this is space- and time-consuming. In order to address this problem, the Library is preparing "streaming" files that will begin to play as they are transmitted through the network. Although more convenient, these files are of lower fidelity than the downloadable examples.

Audio file types:

Downloadable files (online today)
The Library plans to replace these files with the downloaded type described below by early 1997.

Attributes:
Format (file type):

11.025 kHz sample rate, 8 bit word, mono
AU (Sun Microsystems format developed for UNIX systems)

Downloadable files (planned for early 1997)

Attributes:
Format (file type):

22.05 kHz sample rate, 16 bit word, mono
WAVE (Microsoft format)

Streaming files (planned for early 1997)

Encoding:
Format (file type):

For 14.4 modems
RA (RealAudio format from Progressive Networks)

VIII. Moving-image Materials

The large files required to reproduce motion pictures and video have launched the computer multimedia industry on a constant search for new and better compression and playback schemes. For this reason, the moving-image formats suitable for the WWW are less stable than those for text and pictorial images produced today and will become obsolete more quickly. When materials are remastered, the moderate quality of current moving-image formats will mean that the source material or an intermediate format, e.g., videotape, must be used (again) to create files in the new format.

Moving-image file types:

Files (online today)
The Library plans to supplement these files with the types described below by late 1996.

Encoding:
Image size:
Frame rate:
Data rate:
Format:

Indeo 3.2 (Intel)
320x240 pixels
15 fps
ca. 1 megabits/second
AVI (Audio Video Interleaved from Microsoft)

Moderate resolution files (planned for late 1996)

Image size:
Frame rate:
Data rate:
Compression:
Format:

320x240 pixels
30 fps
ca. 1.2 megabits/second
MPEG-1
mpg

Low resolution files (planned for late 1996)

Image size:
Color depth:
Data rate:
Format:

160x120 pixels
24 bits/pixel
ca. 100 kilobytes/second
QuickTime (Apple Computer format)

IX. File Headers

The Library plans to add data to the file headers for all of its reproductions over time. For now, a preliminary implementation exists for the four types listed below. Header content will almost certainly be a part of, or interplay with, the administrative and structural metadata associated with the repository described in Digital Historical Collections: Types, Elements, and Construction. The development and implementation of headers will keep pace with the Library's overall design process for metadata.

TIFF image files
The most fully developed image header scheme applies to the archival version of pictorial image files (as distinct from text pages). The Library has been using TIFF version 5.0 but expects to begin using version 6.0 very soon. It is worth noting that the Library's use of TIFF formats and headers has not always gone smoothly, perhaps the inevitable result of using an industry rather than a true standard. This accounts for some of the uncertainties imbedded in the description that follows. The Library has used the following TIFF tags. Contractors have been asked to provide typical or expected data for most tags; exceptions to the norm are noted in the comments column.

Regarding the fields that have pixel counts or other dimensional information, it is worth noting that most of the Library's pictorial collections have been digitized from negatives, copy negatives, or copy prints; for most items, the actual dimensions of original prints or artists' works (as displayed) are neither known nor easily incorporated in scan-time data.

Description Tag Comments
NewSubfile Type 254
ImageWidth 256 Actual pixel count
ImageLength 257 Actual pixel count
BitsPerSample 258
Compression 259
Photometric Interpretation 262
Document Name 269 This data is usually the collection identifier and
filename, e.g., from the Yanker poster
collection: yan/1a12345u.tif. An alternate
would be to use the file identifier rather than
than the name (i.e., exclude the "u" for
"uncompressed" and the extension):
yan/1a12345
Strip Offsets 273
Samples Per Pixel 277
Rows Per Strip 278
Strip Byte Counts 279
Xresolution 282 dots per inch
Yresolution 283 dots per inch
Resolution Unit 296 2 (inch)
Date Time 306 date and time scanned
Artist 315 Library of Congress
Xresolution 282 (see following note)
YResolution 283 (see following note)
Resolution Unit 296 (see following )
Date Time 306 date and time scanned
Artist 315 Library of Congress

NOTE: The Library has At least two options exist for tags 282, 283, and 296.

Option 1 (has been used for full size uncompressed images)

Xresolution
YResolution
ResolutionUnit
282
283
296
actual pixel count
actual pixel count
1 (no unit specified)

Option 2 (has been used for thumbnail images)

Xresolution
YResolution
ResolutionUnit
282
283
296
dots per inch
dots per inch
2 (inch)

SGML text files
The American Memory DTD for historical texts includes a simplified Text Encoding Initiative (TEI) header. The header is an integral part of the SGML-encoded document. Since the Library accompanies its marked-up texts with bibliographic records or finding aids, the header contains only a handful of MARC field equivalents: title statement and statement of responsiblity (MARC field 245), copyright registration number (MARC field 017), and the Library of Congress catalog card number (LCCN), when one exists.

WAVE audio files
The Library plans to use the following Resource Interchange File Format (RIFF) INFO list chunk data with its WAVE files:

INAM (name/title)
ICRD (creation date)
IARL (archival location)
ICOP (copyright)
identifier for the item
date digitized by vendor as YYMMDD
Library of Congress, identifier for collection or project
"see collection restriction statement"

RealAudio files

The Library plans to use the RealAudio header:

Title
Author
Copyright
identifier for the item; date digitized by vendor as YYMMDD Library of Congress, identifier for collection or project
"see collection restriction statement"