RootsTech 2015

Some people eat, sleep and chew gum, I do genealogy and write...

Monday, November 26, 2012

How do you handle thousands of images?

Harvest Time - one of thousands of images scanned from original glass and acetate negatives from the Margaret Godfrey Jarvis Overson Photographic Collection by James L. Tanner
Oh, you say, I only have a few images, but think about it, if you can handle a few you already know the basics and if you can't handle a few, then you need to know the basics before you begin scaling up to hundreds or thousands. So, there are a number of levels you have to go through before you are ready to work with a large number of images.  First, you need the basics. Just short time ago, I wrote about Geoff Rasmussen's new book on scanning that has a good introduction to the basics. (See a very helpful book entitled Digital Imaging Essentials. The book is available in both PDF and paper editions). But here, I am talking about what you do where Geoff left off, that is organizing thousands of images (or tens of thousands or more).

In thinking about this subject, I didn't want to seem elitist or special in any way. Whether or not you have one document or thousands may depend on your particular family. I just happen to come from a family of savers. They saved boxes and boxes of everything. I have samples of everything from hair (yes, cut hair), crocheted doilies, to medals, thousands of letters and thousands of photographs. You may not have anything, but if you are persevering, ultimately, you may have to think about how to preserve and organize all your documents and photos and other stuff.

I would summarize the levels of experience and knowledge you need as follows:

Entry - learning about digital files and why some file types are better than others for archiving and preservation.
Beginning - learning the basics of scanning; operating the scanner and/or a high resolution camera and saving files to your computer's hard drive or to an external hard drive. This includes knowing funadamental file maintenance and backup techniques.
Intermediate - Acquiring and using a basic image organizing program such as Google's Picasa or Photoshop Elements.
Advanced - Utilizing more advanced professional level programs such as Adobe Lightroom and Apple's Aperture and becoming involved in learning about digital preservation. In addition, investigating repositories where the records can be permanently maintained.

Where do you start to need advanced tools? That is hard to say. Will you reach that level with a thousand images? Ten thousand? Fifty thousand? Whatever level you choose, at some point you will find that you have lost control of the number and variety of images you have on your computer's various hard drives and you are in serious need of help. Picasa or Photoshop Elements don't do the job. In my experience Picasa worked for me for a while, then I moved on to Adobe Bridge, but as the number of files continued to increase, I got a more powerful (read harder to learn and use) program, such as the Adobe Lightroom, I found that the time savings alone was worth the expense and effort to learn the program.

Another factor is the speed of the computer. This is one issue that has driven me over the years to buy the newest technology. I was always pushing the envelope with larger and more complex files, as well as increasing numbers of images. Presently, all of my scans and digitized images are downloaded from my camera into RAW images and stored in Adobe's Digital Negative (.dng) format with the RAW file embedded. The images are between 20 MB to 50 MB each. Now, if this doesn't mean anything to you, then you have a lot to learn. Usually, most instruction books and basic courses are aimed at the novice or beginner. This is not an area for either. If you are going to work with 100,000 images, you need a lot more than some basic training. You need some high powered software tools as well as the fastest computer you can afford and the knowledge to use both.

Here is a simple question: how long would it take your computer and hard drive to load 100,000 images onto a 3 TB hard drive. It takes mine, one of the fastest available, over eight hours if there are no problems. If I had even tried to do this even a year ago, I would have exceeded the capacity of my largest hard drive and bogged down my computer. I could get faster equipment now, but there is a trade off between the time it takes for certain tasks and the cost. I buy a new computer when the old one makes me nervous because it is too slow. That cycle seems to take about three or four years. As for hard drives, I am already in the market for a 4 TB drive, if the price comes down or a larger drive if they become available. Right now, 6TB drives are quite expensive and cost more than two 3 TB drives.

I read a blog post recently that recommended scanning images as follows:
At the very least, you should scan at 300 dpi (dots per inch) if you never intend to print larger than the resulting digital record size. 1200 or greater dpi is recommended if you think you will ever want to print a larger version of the record. The scanning device you purchase will have software that allows you to set the desired dpi.
The article does not identify the author, other than "Guest Blogger." But making statements like the above indicates a less than complete understanding of archival requirements. Here is a link to a pretty good explanation of the real issue with scanners.  The issue of scanning resolution is a watershed issue as to the amount of knowledge necessary to be serious about digital preservation. The blog post quote is misleading and not specific enough to be helpful. If you skip over the basics of the physical requirements for archival results, you may do a lot of work that will ultimately be less useful that it could have been.

Another example of incomplete information from the above blog post, is a reference to JPEG 2000 as an archival file format. Here is a listing from the Library of Congress list of file formats for JPEG 2000. As you can see, this is not quite a simple as the blog post may suggest:

  • JPEG
  • JPEG, JPEG Image Encoding Family
  • JPEG_DCT_BL, JPEG DCT Compression Encoding, Baseline
  • JPEG_DCT_PRG, JPEG DCT Compression Encoding, Progressive
  • JPEG_DCT_EXT, JPEG DCT Compression Encoding, Extensions
  • JPEG_orig_LL, JPEG Original Lossless Compression (ISO/IEC 10918)
  • JPEG-LS, JPEG Lossless Compression(ISO/IEC 14995)
  • JFIF_1_02, JPEG File Interchange Format, Version 1.02
  • SPIFF, Still Picture Interchange Format
  • JPEG_EXIF, JPEG Encoded File with Exif Metadata
  • JPEG 2000 Encodings
  • J2K_C, JPEG 2000 Part 1, Core Coding System
  • J2K_C_LL, JPEG 2000 Part 1, Core Coding, Lossless Compression
  • J2K_C_LSY, JPEG 2000 Part 1, Core Coding, Lossy Compression
  • J2K_C_Profile_0, JPEG 2000 Part 1, Core Coding, Profile 0
  • J2K_C_Profile_1, JPEG 2000 Part 1, Core Coding, Profile 1
  • J2K_C_Profile_3, JPEG 2000 Part 1, Core Coding, Profile 3
  • J2K_C_Profile_4, JPEG 2000 Part 1, Core Coding, Profile 4
  • J2K_C_BIIF_01_00, JPEG 2000 Part 1, Core Coding, BIIF Profile (v. 01.00)
  • J2K_C_NDNP, JPEG 2000 Part 1, Core Coding, NDNP Profile
  • J2K_EXT, JPEG 2000 Part 2, Coding Extensions
  • JPEG 2000 File Formats
  • JP2_FF, JPEG 2000 Part 1 (Core) jp2 File Format
  • JPX_FF, JPEG 2000 Part 2 (Extensions) jpf File Format
  • JPM_FF, JPEG 2000 Part 6 (Compound) jpm File Format
  • JPEG 2000 File Formats with Encoded Bitstreams
  • JP2_J2K_C_LL, JP2 File Format with JPEG 2000 Core Coding, Lossless
  • JP2_J2K_C_LSY, JP2 File Format with JPEG 2000 Core Coding, Lossy
  • JP2_J2K_C_Profile_0, JP2 File Format with JPEG 2000 Core Coding, Profile 0
  • JP2_J2K_C_Profile_1, JP2 File Format with JPEG 2000 Core Coding, Profile 1
  • JP2_J2K_C_Profile_3, JP2 File Format with JPEG 2000 Core Coding, Profile 3
  • JP2_J2K_C_Profile_4, JP2 File Format with JPEG 2000 Core Coding, Profile 4
  • JP2_J2K_C_BIFF_01_00, JP2 File Format with JPEG 2000 Core Coding, BIIF Profile (v. 01.00)
  • JP2_J2K_C_NDNP, JP2 File Format with JPEG 2000 Core Coding, NDNP Profile


For further example, here is the quote from the Library of Congress about JPEG 2000:
At the same time, JPEG 2000 encoding is not generally built into still-photography camera chips nor is JPEG 2000 decoding native to Web browsers, and this has led some commentators to compare JPEG 2000 unfavorably to JPEG_DCT in terms of adoption. JPEG_DCT is native to virtually all still-image digital cameras and Web browsers. Meanwhile, however, JPEG 2000 has begun to appear as a built-in option in moving image cameras.
You see, JPEG 2000 is a specialized format and for example, is not currently supported by Adobe Photoshop without a specialized plug-in.

These comments are not made with the intention of discouraging anyone from becoming involved in digital preservation. I can get really technical, really fast. This is also an area with a huge number of differing opinions on file formats, image resolutions issues, and other controversial issues. My point is that if you are going to spend the time to digitize thousands of images, you need to know what you are doing and make sure the results of your scanning activities are productive in the sense that the images are useful and the file formats sustainable.

Where to go to start? How about the Library of Congress website? I suggest as a minimum that you become familiar with the issues involved in digital preservation.

There is a whole lot more to be said on this subject. I would be glad to help anyone facing the problem of archiving thousands of photographs. You can contact me through Facebook.


No comments:

Post a Comment