Some people eat, sleep and chew gum, I do genealogy and write...

Saturday, December 6, 2014

What percentage of the world's records have actually been digitized?

Among a certain group of genealogists, there is an automatic reaction to statements about online digitization. The issue is whether all of the world's records have been or could ever be completely digitized? This question comes up in the context of beliefs by some that "everything has been digitized and is online," as opposed to a currently popular and circulating figure that "only 10% of the worlds records are online."

The real question involves a more complex issue altogether. I have been thinking about the question of digitization for a long time. Off and on, I have done some research to see if determining a reasonable answer to the question is even possible. Earlier this year, FamilySearch published the infographic, shown above, with some speculation about the number of records and the progress towards complete digitization. When I see any statement that quotes round-number statistics, I automatically reject those statistics as false or at least bad estimates. Now, I probably need to explain my reasons and also express my own opinion about the status of digitization as it applies to the genealogical community.

First a note about the category of records used by genealogists. Here is a tentative list of those categories:
  • Unique records that are presently only on paper or the equivalent, such as private letters, documents, photographs, or similar records. By their unique nature, these documents exist only in one location.
  • Records, on paper or the equivalent, that originally had multiple copies stored in multiple locations such as books, newspapers, other periodicals, short-run family histories, government documents etc.
  • Documents that are available on microfilm, microfiche or some other duplication method and are primarily stored in some sort of institutional repository. 
  • Documents that have been digitized and are available only at one unique location, such as private digitization efforts undertaken by individuals.
  • Documents that are digitized, but still remain generally available in paper copies, such as books, newspapers, and other widely disseminated records.
  • Documents that have been digitized but the original record, in whatever format, has been destroyed or is no longer available. 
  • Documents that are digitized and available only to certain people who have "official access." This can be anything from classified government documents to restricted collections in university libraries.
  • Documents that digitized but are available only by subscription.
  • Documents that are digitized and freely available on the Internet. 
It should be immediately pointed out that the availability of a record for examination by a genealogist is an entirely separate issue from the issue of whether or not the document has been digitized. Many private business and government records around the world are held in electronic format, but are totally unavailable to the "public" without special permission for access. The mere fact that a record is digitized does not mean that it will necessarily be any more available than that same record on paper.

Digitization itself does not ensure that any given record will be any more available that it was before the document or record was digitized. Preservation of a document and its availability are two entirely separate issues. From this standpoint, digitization becomes only one of many methods of preservation. As I side note, I am frequently asked whether original paper records should be thrown away or otherwise disposed of after the records are digitized. In every case, I plead that the original paper records always be preserved, if at all possible. If we are at all interested in preserving records, the originals should always be preserved and care taken to also preserve copies of the originals. Moving the document from one format to another does not ensure availability of the document and in the case of digitization, the document can be as easily lost (or perhaps more easily lost) than an original on paper.

So where are we, as genealogists, in this process of converting documents to a computer-readable and electronically storable format? We should be very cautious about blithely throwing around numbers about the percentage of documents that are and have been put in electronic format. The infographic above is a good example of the problems associated with ignoring this advice. Look very closely and critically about what is and what is not presented by this one view expressed in the infographic. 

This infographic represents the position of one digitizing entity: FamilySearch. On its face it says that FamilySearch has 5.3 billion preserved. It does not say that all of those records have been digitized. So the question is, how many of the 10 billion more records "from the regions in black" are preserved but not digitized and how many of the already preserved records are available to the genealogical community? Where does the number of 60 billion additional records come from? How was it estimated? How many of those records have already been digitized by other entities? How many of those records are available to genealogists? 

These types of questions need to be asked whenever this subject comes up. Although this infographic has several other entities logos along the bottom, the infographic makes no attempt to estimate the number of records in the possession of these other entities. But fundamentally, what is a record? What is considered a record? Is one book a record or is each page in the book a record? 

If you go to and look at the Historical Record Collections, you will see the term "Collections" and the term "Records." In many instances, it appears that the term "record" means individual entries. For example, a U.S. Census form may have 50 lines for information. Is each line a record or is each of the Census sheets a record or is the an entire Census year a record? This selection of the meaning of the word "record" makes a huge difference in the total number of records already digitized and the total number of "records" left to be digitized. 

I decided to see if the numbers of digitized records on the's record collections comes anywhere close to the number of 5.3 billion on the infographic. So I got out a calculator and started to add up the numbers of "records" listed for the "collections." To get an estimate, I only added up the numbers in 1 million record increments, so I was looking for 5,300 millions. I was immediately stopped in my efforts due to the fact that there were no numbers for records for the collections that had only images. But I decided to throw in the number of images with the number of records just to see what happened. Arguably, an "image" could have only one record or dozens, but counting the images in with the records would give a lower number than an accurate one.

After a few minutes of trying out this exercise, I got really frustrated. For example, without knowing if each line in a census sheet was considered a record, if there were over a million images, did that represent a million records or 50 million. Obviously, at this point the issue of the definition of a record becomes absolutely crucial. Guess what? In the FamilySearch Infographic there are no definitions. 

But if the Infographic is taken at its face value, there are approximately 70 million records in the world and 5.3 of those have been digitized by FamilySearch alone! That is something over 7% in one repository alone. But what about,,,, and on and on and on? Guess what? None of these entities use the same criteria to measure the number of records (whatever that is) that have been digitized. For example, presently uses the number of 5.7 billion+ records on its website while lists the total of collections but not the total number of records. I could add up over 4 billion records on from just the first two pages of their Card Catalog. So three entities,, and have an estimated 15 billion records. 

Now I hear the squawks. What about duplicates in each of these websites? Yes, some of the records are duplicated. But I did not include the 2 billion entries on Ancestry's Public Member Family Trees or the 1.5 billion plus on MyHeritage's trees etc. 

This post is getting pretty long. What is my conclusion? Easy. No one has a clue as to the actual number of records or the actual number that have been digitized and no matter what the actual numbers turn out to be, we will all need to be looking at paper for a long time. But from my perception, the average genealogist will probably never look at paper again during their entire lifetime. Most of the so-called genealogists will find all they think they need online. In those rare cases when they get so frustrated that they actually need to look at paper, they won't know what to do. 

A final note. I have worked in the Mesa FamilySearch Library and the Brigham Young University Family History Library combined for now more than ten years. I find that both have huge valuable collections of paper-based books. But hardly anyone touches them. I have yet to see one patron at the BYU Family History Library pull a reference book off the shelf and very, very few of those books are readily available in digital copies. Yes, paper will be around for a long time, but will it be used?

No comments:

Post a Comment