12. Phase II production plan

Image types. The plan for Phase II, the demonstration project's production phase, reflected the Phase I discussions and called for the capture of the following image types:

For the record, in a post-project activity carried out by the Library, a second set of access images was produced:

Producing the preservation images. The initial image capture of the 10,000 testbed images was carried out in an intermittent activity through 1996. The scanning took place in the Music Division at the Library of Congress. At scan time, each sheet was placed on the platen of an HP ScanJet 2cx flatbed scanner and captured at 300 dpi. If the image was captured in grayscale, it was scanned at 8-bits-per-pixel. If the operator saw any intentional color on the document, it was scanned in color at 24 bits. Backs of sheets were scanned if they contained significant (or potentially significant) marks of any kind.

Approximately 20 percent of the images were captured in color. Understanding capture to include finding the next sheet, placing it on the platen, scanning, removing the sheet and refiling it, and examining the image on screen, the average time to capture a grayscale image was 1 to 2 minutes and 3 to 4 minutes for a color image. It is worth noting the greater amount of time required to capture color; this could be a factor to consider when weighing the pros and cons of color capture for a large-scale production project. With the percentages of grayscale and color images indicated, and allowing for various administrative and other activities, the operator was generally able to capture approximately 200 to 300 images in an eight-hour day.

Contrast stretching. The SCANJPEG program developed to control scanning performed contrast stretching under operator guidance. After scanning and before compression, while the entire luminance image is in memory (after RGB to YCbCr conversion for color scans), a gray level histogram was calculated. It was then shown to the operator as a graph. The mouse is used to control the dark and light cutting points where no significant number of pixels have grayscale values.

In most of the scanning, both the dark and light cut points were set. The software then created a look-up table which altered each original grayscale value to stretch the used grayscale values over the full 0 to 255 range, thus brightening the image and increasing its contrast. The cut points were written into the log file for the batch. A screen image immediately showed the brightened image to the operator as it was compressed and written to disk.

Late in the scanning, it was decided that a more correct procedure was to only stretch at the white end of the scale, since a seemingly insignificant number of darker pixels (too small to register on the graph) might actually play a role in rendering tonal distinctions. Aggressive clipping at the dark end could partially eliminate some of those distinctions, slightly lowering quality.

Each image was compressed at approximately a 10:1 to 20:1 ratio using the JPEG baseline algorithm. The compression took place in software on the scanning workstation. The compressed image was then written to the workstation's hard disk. Following JPEG specifications, the color images were first converted from RGB (red-green-blue) to YCbCr (luminance, chrominance-blue, chrominance-red). These stored images constituted the set of preservation-quality images.

In batches, the preservation-quality images were recorded on writable CD-ROM disks for shipment to the contractor's facility in California. The directory naming structure followed specifications developed by the Library for the Federal Theatre Project collection curatorial staff. The names for individual files represented exposures or "pages" and these consisted of an incrementing sequence of numbers generated by the scanning software.

Producing the access images. When the batches of preservation-quality images arrived in California, they were processed to create the set of binary access images. This process begins by decompressing the JPEG preservation image in the case of the grayscale examples or, for the color images by decompressing only the luminance or Y portion of the image). Next, a sophisticated thresholding algorithm was used to create a binary access image at 300 dpi. Finally, the binary access images were compressed via the CCITT Group 4 algorithm in software and written to CD-ROM disks for delivery to the Library.

The edge thresholding algorithm is based on techniques developed by Picture Elements. It has been implemented in both software and high-speed hardware for integration into scanners by manufacturers. It produces very detailed and clear binary images by finding significant edges within the image and then placing black/white boundaries in the binary images at those precise edge locations. This produces highly accurate binary images with stroke widths exactly matching those in the original grayscale images. A variation incorporated into the algorithm also reproduces broad, fuzzy lines (such as those produced on copies by carbon paper) with high sensitivity.

Iterative thresholding. Picture Elements first attempted to threshold the images with a single setting, but some of the images had low contrast information, while others had higher contrast "noise" (especially the ones with onionskin paper), requiring either manual inspection and re-thresholding or automatic analysis and re-thresholding. Picture Elements chose the latter approach.

Two approaches were used to identify images with excessive noise. The first was based on the observation that images with noise tend to have a large number of very small specks and a fewer number of larger specks. Small specks can be removed from the image via speckle deletion, but large specks can be as large as fine print, so they cannot be speckle-deleted without risking information loss and therefore are more of a problem. Reducing the thresholding sensitivity tends to reduce both sizes of speckle, with the larger specks being eliminated first, so the goal is to determine how much the sensitivity needs to be reduced to make a clean image. Luckily, images don't tend to have both low contrast information and noisy backgrounds at the same time, so the highest sensitivity with acceptably low noise tends to produce the best binary image possible for the page.

If the number of small specks is low, then there are probably very few large specks in the image, and a high sensitivity can be used to pick up any low contrast information. An easy way of estimating the number of small specks in the image is to threshold the image twice, once with speckle deletion enabled and once with it disabled and comparing the compressed file sizes. If the ratio of the file sizes is close to 1 (e.g. 1.05), then there must have been a low number of small specks in the image, so there are probably very few large specks in the image. A large ratio of file sizes (e.g. 1.5 to 1) indicates that there is a lot of noise in the image and a less sensitive threshold is needed to produce a clean image.

The noisy images also had very large compressed file sizes. A manuscript page will typically have less than 4000 typewritten characters, but a noisy image can also have 100,000 noise specks. A clean image can compress to 20 KB (kilobytes), but a noisy image can be over 100 KB, with the average at around 50 KB. Decreasing the threshold sensitivity rarely causes a large percentage of the characters to drop out, but it can cause all of the noise to disappear. Therefore a test is made to see if the compressed file size is over 100 KB. A second test is to see the ratio of the file sizes between slightly differing threshold sensitivities. If the ratio is small, then reducing the sensitivity probably doesn't eliminate much noise (e.g. the noise is already gone), but could be dropping out some information, so the more sensitive threshold will be used.

Hardware thresholding and image compression allow lots of alternative settings to be automatically evaluated, starting with the most sensitive and stopping when it is determined that a reasonably clean thresholded image has been produced. In the algorithm employed, a total of 8 threshold sensitivities are available (each with and without speckle deletion), with the noisiest documents requiring the most iterations.

While the first images were processed using a software routine, most of the images were binarized using the same algorithm implemented in hardware. In hardware, the process took from less than one to a few seconds per image.

This type of thresholding algorithm is complex and too slow to run at commercially viable speed in software-only implementations (even without iteration). Hardware implementations of related (though non-iterative) algorithms are now seen in binary scanners from BancTec, IBML, Bell & Howell and Kodak.


Next Section | Previous Section | Contents