Sunday, August 9, 2009

Standards for Genealogical Scanning


In a recent article published in the Netherlands, The Current State-of-art in Newspaper Digitization, A Market Perspective, by Edwin Klijn in the D-Lib Magazine, the author summarizes the current standards for professional scanning. Since so much of the source material for genealogical research is being scanned and put online, I thought it important that individuals who scanning for their own research know of these international standards. Quoting from the article:

Most companies use specialized equipment for scanning from microfilm and paper originals. Sometimes this is commercially available hardware such as standard A0 or A1 flatbed scanners. Some companies use custom-made large-format scanners purposely built to digitize newspapers. To create master images the consensus approach is to scan at 300ppi. The preferred format is uncompressed lossless TIFF, although some respondents also suggest using JPEG (quality 10) or JPEG2000. Scanning from the originals is generally acknowledged to produce higher quality master images. There is some disagreement amongst the survey respondents as to whether one should scan in colour or greyscale. Scanning in colour produces a master that is closer to the original newspaper (more 'authentic') than greyscale. Also, according to some respondents colour images may lead to better OCR results, or at least provide better 'raw materials' to improve the OCR in due course. Choosing the appropriate format is also closely related to the issue of storage. A master image in TIFF format requires approximately twice as much storage space as a JPEG2000 (lossless) image and ten times as much as a JPEG (quality 10) image requires.

Frequently applied image enhancement technologies include tools for deskewing, despeckling, rotation, cropping, noise removal, balancing white backgrounds and image splitting. These tools are often used in semi-automated processes, with manual correction performed at the end. Some companies optimize images in order to improve OCR results. In their workflow they clearly distinguish between images produced for viewing and images that are specifically prepared for OCR processing. In this context the alternative of so-called hybrid PDFs is suggested. These PDFs embed different quality levels within a single file, e.g. one image optimized for the plain text and delivered as a bitonal image, and another image for the illustrations on the page, delivered in greyscale.

As the derivative for web delivery, most respondents recommend JPEG, mainly because of its efficient compression rate and zooming potential. Three respondents mention the JPEG2000 format as a suitable derivative. ISO-standard JPEG2000 is considered to be an efficient compression format because it produces relatively small files. One large digitization company strongly advises against using JPEG and – to a lesser degree – JPEG2000. It argues that in the case of bitonal and greyscale images, such as those with line-art drawings, JPEG compression can lead to low-quality images. According to this respondent, PNG is preferable to JPEG because it is presently more widely supported than the promising – but not yet generally accepted – JPEG2000. This view is supported by another respondent who believes that PNG provides the optimum compression for B&W and text 'images'. Two other respondents suggest PDF as an alternative format for derivatives. Since the majority of all users are familiar with PDF files, delivering newspaper pages or articles in PDF is a common feature of most newspaper web delivery systems.

This corresponds with my own experience in scanning over the past ten or fifteen years. Although, I suggest that delivery systems in PDF format are not as useful to genealogists until the lineage linked database programs start supporting inclusion of files in PDF format.

1 comment:

  1. I have to speak up for digitizing in color. I deal with 16th century Spanish documents digitized by the General Archive of the Indies in Seville, as well as other Spanish archives. The greyscale images can be awfully difficult to decipher. One document I dealt with is Juan de la Bandera's "long" account of the Juan Pardo Expedition (Cataloged as SANTO_DOMINGO,224,R.1,N.8). On the second of 35 folios, the background was speckled to the degree that the writing was extremely difficult to decipher. If the image had been scanned in color, I am sure that it would have been easier to read, as the background would have had less interference.

    I also have had experience with images rendered as PDF, and I am not fond of them. I've dealt with the microfilmed version of the East Florida Papers, which date from Florida's Second Spanish Period (1783-1820), and were colonial Spanish papers seized by the U.S. government upon taking over Florida from the Spanish. The originals are in the Library of Congress. I used the scanning microfilm reader at the Jacksonville Public Library to capture images from the library's microfilm edition of the Papers for use as illustrations in my book on the colonial, territorial, and state censuses of Florida, which comes out this fall. The only way the library's equipment could render the image was as a PDF. I would much rather have had a JPG image, because the PDF was extremely stark and contrasty.

    The East Florida Papers are being digitized, and an individual I did some translation for gave me a copy of an image in color, and it was much easier to read than if it had been in greyscale. Evidently, the scanning project is using the originals.

    Good examples of color digitization can be found on the Florida State Archives' Florida Memory Project, which has color scans of Spanish land grants and Confederate Civil War pension files, http://www.floridamemory.com/Collections/.

    My two cents from my experience with digitized documents.

    ReplyDelete