American MemoryThe National Digital Library Program: Archived Documentation

The Library of Congress / Ameritech National Digital Library Competition (1996-1999)


Disclaimer

Digital Historical Collections: Types, Elements, And Construction

NOTE: Links to resources outside the Library of Congress are to URLs that were active when this set of archived documentation was actively maintained. Some links may no longer be active because resources have been removed. If a link is active, the resource may have changed substantially since the documentation was created. No attempt will be made to trace the linked resources or to suppress bad links. The URLs are being retained for their value as historical evidence.


Carl Fleischhauer
Technical Coordinator, National Digital Library Program, Library of Congress
August 21, 1996


Contents


Introduction

This document and two others are intended to take the place of the 1994 document titled Elements of Digital Historical Collections. The new documents have been prepared, in part, to offer guidance to applicants in the Library of Congress/Ameritech competition. The other two documents are Digital Formats for Content Reproductions and Access Aids and Interoperability.

The trio of documents present a snapshot of the Library of Congress digital conversion activity as of August 1996. The ideas and approaches outlined represent the collection-digitization effort of the American Memory pilot program (1990-1994) and the operational National Digital Library Program (1995-1996) that has followed the pilot. The institution recognizes that many avenues remain unexplored and that new technology will lead to changing practices.

Types of Collections

Historical collections. At the Library of Congress, the fundamental building block for online historical materials is the collection, a coherent body of related materials. Collection types include:

Archival groupings
Examples include personal papers, e.g., the forthcoming collection The Papers of Marian and Edward MacDowell, or a photographic company's negative files, e.g., The Detroit Publishing Company Photographs, 1850-1920.
Accumulated sets
Examples include sets of prints and documents within a single genre or category and gathered by the Library's special collections divisions over the years, e.g., the early motion pictures selected from the Library's larger Paper Print collection or the forthcoming collection American Political Prints, 1780-1876.
Special compilations
Groups of related materials assembled especially for a digital project, e.g., The Evolution of the Conservation Movement, 1850-1920, or a selection of books about a region's history, e.g., the forthcoming collection California As I Saw It, 1850-1900.

Elements in the Library's Digital Collections

Each Library of Congress online collection consists of the following elements:

Framework. A set of illustrated HTML (HyperText Markup Language) text files that serve as the collection home page or home page family. The framework is so named because it provides an intellectual frame for the collection and embraces the other collection elements. The family of texts includes scope and content notes, chronologies, technical notes, and bibliographies. The framework may be compared to a book's main title page, table of contents, front matter, and back matter.

Access aid. The data set that describes the items in the collection and supports searching and browsing by users. The access aid may be thought of as descriptive metadata. In Library of Congress online collections, the most familiar and frequent access aid is a collection-specific catalog consisting of bibliographic records. For a few Library collections, access is provided by sets of menus that serve as rough-and-ready finding aids. In the future, access to some collections will be provided by fully realized finding aids that conform to the emerging Encoded Archival Description (EAD) model. (See also Access Aids and Interoperability.)

Reproductions. The digital images, searchable texts, recorded-sound files, and moving-image files that reproduce original items. Some items are reproduced by a combination of the above. When stored in a computer system, the digital reproductions contain and/or are associated with administrative and structural metadata that may be used to track the location of the reproductions in the system, report the specific digital type or format, control access, indicate relationships for multipart reproductions, and provide other technical information. See also Digital Formats for Content Reproductions.

Supplementary programs. Elements that introduce collections to users, explaining what they contain and how they may be used. Called Special Presentations at the Library's WWW site, in their simplest form, they are brief texts or "slide shows," while at their most elaborate, they take the form of interactive multimedia programs.

Diverse Access Aids

Digital historical collections include a variety of elements that may be used for access. These include not only the catalog records and finding aids mentioned above, but also full texts (when, say, a printed book has been converted to searchable text), and expository introductory materials. Although the Library has not yet instituted a search system that embraces all of these elements (indeed, at this writing, no EAD finding aids linked to digital content are in place at the Library's WWW site), the Library foresees that a search system with a broader embrace will be established in mid-term future. (See also Access Aids and Interoperability.)

When the access aid takes the form of bibliographic records, the level of cataloging perfection may vary. In some cases, Library staff members produced full cataloging adhering to relevant rules and authorities; in others, a variety of simplifications, variations, and unorthodoxies were employed or tolerated. For a few of the collections, item-level prose annotations have been provided. To the degree possible (when sufficient resources exist), subject terms have been drawn from Library of Congress Subject Headings (LCSH) or Library of Congress Thesaurus for Graphic Materials (LCTGM).

This existence of a diverse array of access elements inevitably diminishes the effectiveness of "classic" bibliographic records--records that employ a structured vocabulary and make careful use of authorities for names and subjects--within the larger corpus. Thesaurus-governed subject terms, for example, will not always be present in a working database that provides item-level access within a single collection, in EAD finding aids that resemble traditional manuscript registers, or in the searchable full texts provided for some manuscript and many printed-matter collections, whose vocabulary consists of "ordinary English" words and terms.

The Library hopes to develop additional aids to access as well as a search system powerful enough to plumb the jumble of structured and unstructured vocabulary. A first step will be to create a catalog of collection-level bibliographic records for the historical collections. This database of collection-level records will serve as a point of entry for many users.

Identifiers, Logical Names, and the Digital Repository

The Library is developing an approach to assigning names or identifiers to digital reproductions. These identifiers link the reproductions and the access aid, just as library call numbers link catalog records to the books on the shelf. The discussion below describes current Library practice and the use of a two-part identifier.(See also Identifiers for Digital Resources.)

Why are identifiers needed? In the Library of Congress digital repository, reproductions move from one device to another, e.g., when data are backed up or when obsolete hardware is replaced. This migration of data requires that identifiers be logical rather than physical names. The physical location of the files in the repository will be recorded in a document locator system. Thus a query made of the access aid will yield the identifier (logical name) for the desired item. When sent to the locator system, the logical name will be translated into the current physical name, and the reproduction will be retrieved.

The Library's extensive use of contractors to create the reproductions has led to the practice of using identifiers as physical names at delivery time. The digital materials produced by contractors (and often those produced by staff as well) spend part of the life cycle in DOS operating-system computers; e.g., contractors often deliver data on write-once CD-ROMs. Thus the Library requires that directory names and filenames adhere to DOS naming requirements.

The current identifier system works within the closed system of the Library's servers and does not provide pointers to the institution itself, e.g., the address for the server that holds the data (including the domain name "loc" for Library of Congress). Thus, if access aids with the current identifiers were loaded into, say, the catalog of another library, the data could not be used to link back to reproductions in the Library's servers. The development of the Library's repository is intended to address this circumstance. The Library is analyzing the option of treating the two-part identifier as a single unit, adding the missing pointer, and formalizing the result as a Uniform Resource Name (URN). If adopted, the use of URNs will change the structure of the identifier and its placement in the bibliographic record but not the explanatory concepts outlined here. Readers will recognize that this proposed use of the URN springs from the same impulse that led to the definition of Persisent Uniform Resource Locators (PURLs), which may serve other archives as identifiers. (See also Access Aids and Interoperability and Handle Server: Overview.)

Items, Aggregates, and Identifiers

The Library is assigning the identifiers described in the preceding section to both items and groups of items, called aggregates. These are, in fact, the identifier's two parts. In the context of digital collections, the Library uses the term item in a flexible way. Depending upon the archivist's judgment and collection design, an item may be:

The archivist's definition of an item will reflect his or her determination of the level of detail desired (or affordable) in the access aid at hand. When high levels of detail are to be offered, the collection item will be a truly discrete unit; when group-level description is provided, the term item will name what is in fact a group of items. Readers should note, therefore, that in this context item is defined as the unit of content referenced by an individual bibliographic record or by a line in a finding aid. In the Library's bibliographic records, items are referenced in MARC field 856 subfield f. The Library has not yet established a parallel convention for EAD finding aids.

When archived in the digital repository, reproduction items are placed in groups called aggregates. Smaller, single-medium collections may be represented by a single aggregate, while larger, complex collections may be represented by multiple aggregates. An emerging Library practice is to place like items in separate aggregates, i.e., one aggregate will contain reproductions of photographs, another books, and still another sound recordings.

The establishment of separate aggregates for groupings of items in like original formats permits the Library to regularize the invocation of presentation and display routines. (At this writing, this emerging practice has not been fully implemented.) The online presentation and display of, say, a photograph is different from the presentation and display of a book. When the Library presents the bibliographic record for a photograph, for example, the system accompanies the record with a display of a thumbnail image. In the case of a book, the access aid provides links to a searchable full text which in turn links to the set of page images. By having photographs in one aggregate and books in another, the system knows the item's type and can activate the needed delivery routines. In bibliographic records, aggregates are referenced in MARC field 856 subfield d. The Library has not yet established a parallel convention for EAD finding aids.

As noted above, the Library is investigating the possibility of using a URN (Uniform Resource Name) for linking; if adopted, the URN would be placed in a to-be-determined subfield in MARC field 856.

For additional discussion of the topics covered in this section, see also Access Aids and Interoperability.

Single Reference, Many Parts or Versions

Since a collection item may be multipart, the online presentation must offer the user some means of retrieving all of the parts from the single identifier found in the access aid. Although examples like multi-image manuscript folders spring readily to mind, one-to-many links often exist even when the item described is a singular original. A large theater poster, for example, may be reproduced by (1) high- and moderate-resolution versions (two reproductions) or (2) an image containing the whole poster and a set of six image "tiles," each of which contains a segment of the original (seven reproductions). Similarly, a book may be described in a single bibliographic record but--due to its length--be presented online as a set of chapters.

A number of devices may be employed to provide users with the many reproductions (or reproduction "parts") that are linked via the one reference. The simplest albeit least elegant device is to create a table of contents or menu (see the current version of the American Memory Walt Whitman notebooks collection). Here, an HTML file lists the pages by number and links each one to its reproduction image. As an alternate option, some observers have suggested consideration of multipage PDF files (the Portable Document Format was developed by the Adobe corporation and PDF files are viewed in Acrobat software).

Meanwhile, other organizations are developing models for paging through a document using a set of images that are displayed "inline" (in the browser). The TEI planners are developing a envelope, while a group at the library at the University of California, Berkeley, is testing what they call Ebind. A paging approach has been applied to a set of serials in the Making of America project organized by Cornell University and the University of Michigan. These presentations provide tables of contents, sets of inline page images (in the GIF format, at moderate resolution), tools for navigation ("paging"), and (in some cases) fetchable or printable access to higher-resolution versions of the images.

Each of these one-to-many devices has its own advantages and disadvantages; the Library foresees that it may employ all three over time, making its selection after analyzing the needs of a particular collection. While it implements additional approaches for paging through or selecting the parts of a multi-part item, the Library will continue to rely upon HTML table-of-contents files.

Producing the Elements, Assembling the Collection

Library staff produce frameworks, access aids, and introductory programs. Generally speaking, the frameworks, access aids, and introductory programs are produced by Library of Congress staff, typically specialists in the Special Collections divisions. A variety of software is used: the Library's mainframe-based cataloging system (MUMS), MARC software (Minaret, running on DOS or UNIX systems), and various database and word-processing programs.

When data records destined to serve as bibliographic records have been created in word-processing or database software, they are marked with tags and field delimiters, and are subsequently loaded into Minaret or other MARC software. In addition, some records created in MUMS are edited in Minaret, at which time certain fields or data not supported by MUMS can be added. Meanwhile, the Library is very interested in the move toward cataloging simplification represented by the development of the Dublin Core set of fields for bibliographic information and hopes to begin to implement this approach in the near future.

The Library is also exploring other options for access aids. Current production activities have suggested a role for relational databases, the content of which may have some of the characteristics of bibliographic records. Meanwhile, the Library is bringing its first content-linked EAD finding aids online and is investigating software options for their efficient production.

For additional discussion of the topics covered in this section, see also Access Aids and Interoperability.

Contractors and staff produce reproductions. Most reproductions are produced by specialist contractors; some are produced by Library staff. In both cases, specialized equipment and techniques are used. Requirements are outlined in the statements of work (to be placed online in the very near future) found in the requests for proposal (RFPs) associated with Library of Congress digitizing contracts.

Multiple reproductions. The Library's projects often produce multiple digital reproductions. For example, an historical photograph may be digitized as:

  1. a high resolution, uncompressed digital image (excellent quality)
  2. a high resolution, compressed digital image (very good quality)
  3. a moderate resolution, compressed digital image (good quality).
  4. a thumbnail image (for presentation with the bibliographic record)

For works converted to searchable-text form:

  1. a text marked up with SGML
  2. a text marked up with HTML
  3. a set of page images to accompany the machine-readable texts.
(We look forward to a time when browsers support SGML or HTML versions can be generated on the fly.)

In some cases, a project will produce analog as well as digital reproductions. For example, the desire to create long-lasting, high resolution copies for preservation has led photographic curators to recommend a production path that, first, produces a film intermediate and, second, a digital reproduction. Since the futurity for current-day digital moving-image formats is uncertain, and since most digital-movie formats are derived from analog video, the Library's plans for motion picture collections always include, first, the production of a video master tape and, second, the production of the digital versions.

The Library archives all versions of the reproductions but generally selects one or two for online access. In the examples noted in the paragraph above, the Library would provide online access to the compressed digital images of the photograph and the digital movie file.

For additional discussion of the topics covered in this section, see also Digital Formats for Content Reproductions.

Preservation Contributions

The preceding discussion highlights the uncertainties that surround the role of digital reproductions in the Library's preservation activities. When and how can the digital project contribute to preservation? The NDLP staff believes that a digital project can make three contributions:

  1. For some materials, digital reproductions faithfully reproduce the original and, if they endure, will rival or surpass analog reproductions as preservation copies. The NDLP believes that this is the case for manuscripts, printed matter, and sound recordings. It may also be the case for photographs up to a certain size, e.g., 35mm negatives.

  2. Closely related to the preceding, but when the reproduction is slightly less than faithful, digital copies may serve all typical uses. For example, digital images of photographs with a spatial resolution of 4,000x3,000 pixels may not capture every nuance of the original but are capable of serving virtually every user's needs. Thus, for a collection of historical images with moderate artistic value, such as a set of digital images, if they endure, will provide the same service as a set of copy negatives. In a time of scarce resources, such copies may suffice for some collections. This approach will be especially appealing when the original materials can be conserved, e.g., providing cold storage to halt the deterioration of older negatives.

  3. For some projects, the production scheme will create both an analog preservation copy and a digital reproduction. For example, a project to preserve a collection of nitrate negatives may create a set of analog interpositives which are then scanned to produce the digital reproduction.

The phrase "if they endure" in 1 and 2 above reveals the principal anxiety about digital information: will obsolescent hardware, software, and media render digital information unreadable? The Library shares that anxiety but at the same time hopes that the repository for digital information alluded to elsewhere in this document--with its accompanying routines for backing up and migrating data--will succeed in sustaining digital information.

Assembling the Elements

Once produced, the framework, access aid, reproductions, and introductory program are ready to be loaded into retrieval software; once loaded, the various links are actualized and the assembled collection is ready for use.

The Library's vehicle for patron access is the World Wide Web (WWW). Collections are currently loaded for WWW hyperlinking; the access aids are indexed for searching in Inquery software. Meanwhile, the Library welcomes private-sector initiatives that will lead to other forms of dissemination, e.g., CD-ROM publications or commercial online service.