Some people eat, sleep and chew gum, I do genealogy and write...

Friday, October 29, 2010

Genealogy, pages, images, books, documents and records -- What???

In my recent post about statistics for FamilySearch, a comment by Randy Seaver of Genea-Musings got me thinking about the statistics and terminology used online by all of those huge records collections, everybody from FamilySearch to to the Library of Congress. One of the most influential books I have ever read is a small 144 page treatise written in 1954 entitled "How to lie with Statistics." Here is the complete bibliographical information:

Huff, Darrell, and Irving Geis. How to Lie with Statistics. New York: Norton, 1954.

Now, don't get me wrong, I am not saying that any of the record collections are lying. But it is very useful to understand how statistics, and in the case of genealogical records, the numbers, can be manipulated to show a specific results. Quoting numbers and statistics is done every day by millions of news outlets, public relations organizations and even individuals and skewing the numbers for a particular purpose is rampant in all of the media and especially on the Internet. In the genealogy world, online providers use their statistics to show how large they are in relation to other providers. This tendency is not limited to subscription or pay-as-you-copy services, but is a general tendency throughout the entire online world. I guess that the large numbers are supposed to impress potential users as to the usefulness of the database or for other motivational or advertising purposes.

Let's look at some of the words commonly used to describe and quantify online collections. As we review these words, ask yourself if they really have any quantitative meaning?

Name: You would think that counting the number of names in a database would be unambiguous? Maybe. It is possible that the same individual is mentioned many times in the same document. Is each instance of the person's name counted as a "name?" Who determines if there are duplicates? Is there an unwarranted and unsupported assumption that each name represents a unique individual?

Individuals: Well, what can I say? How do they know how many individuals there are in the database unless they take into account duplication. For example, there are databases in which my Great-grandfather's name shows up dozens, perhaps hundreds, of times. How many individuals are there in the database? Does anyone really know?

Page or image: Probably the least ambiguous of the terms, I guess it is supposed to be understood that each "page" equals one scan or one image. You will obviously note that one page might have a hundred or more names. So what does it mean if one page has one name and the next page has over one hundred names? What if a large number of the "images" have no names or other useful information at all?

Document: If page is the least ambiguous, then document is the most. What is a document? A document might have one page or 1000 or more. What use is it to count documents? Any given document may have no genealogical information or be extremely useful. Counting the number of documents in a database really gives no information as to the value of the information contained.

Book: See document. Books can have no genealogical value or be chock full of useful information. Knowing the number of books in a library, for example, might give you some idea of the size of the building, but what if the books are all for children and only have a dozen pages? Almost every library website, somewhere, lists the number of books in the library. Other than as a general comparison, counting the number of books in a library is like the time I sat out on the wood pile and numbered all the pieces of wood. An interesting activity but not too useful.

Records: See books, documents, and images.

Items: See books, records, documents and images.

Collections: The most ambiguous term of them all. How do you define a collection? Is a collection on FamilySearch Record Search equivalent to a collection on If so, what is the criteria for comparison? If not, what is the meaning of the word?

It is likely that each of the larger database providers have their own internal definitions of each of qualifiers. What is apparent, is that no one outside of the organization has any idea what those qualifications are.

OK, now how do the various entities use these numbers? For example, in its annual report for fiscal year 2008, the Library of Congress reports that it recorded a total of 141,847,810 items in the collections. What is an item and what is a collection? Neither word is defined in the report. However, the report goes on to state the following:
  • 21,218,408 cataloged books in the Library of Congress
    classification system
  • 11,599,606 books in large type and raised characters, in- cunabula (books printed before 1501), monographs and serials, music, bound newspapers, pamphlets, technical reports and other printed material
  • 109,029,796 items in the nonclassified (special) collections, including:
    3,005,028 audio materials, such as discs, tapes, talking
    books and other recorded formats 
  • 62,778,118 manuscripts 
  • 5,357,385 maps 
  • 16,086,572 microforms
    5,674,956 pieces of printed sheet music 
  • 14,388,175 visual materials, as follows:
  • 1,207,776 moving images 
  • 12,536,764 photographs 
  • 98,288 posters 
  • 545,347 prints and drawings
 Sounds like a lot of detail doesn't it?  The large numbers are very impressive and might be useful in obtaining Congressional funding. But think about what is actually being said. In each case there is an inference that someone on the staff of the Library has actually counted an item as distinct. What about duplication? Would your opinion of the collections at the Library of Congress be affected by knowing that many of the items came from the Copyright Office and that there is a requirement as follows:
•    All works under copyright protection that are published in the United States are subject to the mandatory deposit provision of the copyright law.
•    This law requires that two copies of the best edition of every copyrightable work published in the United States be sent to the Copyright Office within three months of publication.
Is there anything in the report that talks about the duplicate copies of all of the copyrighted works? Do they count or discount the duplicates? Hmmm. Good question.

It would be a really good idea to discount large numbers. It is really difficult, if not impossible, to understand or even comprehend 141 million items. Large numbers do not mean anything if the items are not accessible. When was the last time you went to the Library of Congress?

1 comment:

  1. The words "item" and "collection" have specific meanings in library world and are not meant to be ambiguous. You can find definitions in the Online Dictionary for Library and Information Science at A library can have multiple copies of one item in its collection. As for accessibility, library collections now include many electronic resources, which do not require an on-site visit.