Some people eat, sleep and chew gum, I do genealogy and write...

Wednesday, February 1, 2012

How many records? A dilemma

Online digitized record repositories' claims regarding their collections are hopelessly muddled and create a dilemma for genealogists trying to make a comparison between the online websites. At the root of the problem is the total lack of consistency about such terms as record, file, document, names, individuals, collections, and many other similar terms. Unfortunately, users of the various sites sometimes judge the relative usefulness of the information based on the way the size of the database is expressed. This is especially true of family tree or user contributed sites. Large is equated with good and useful even if this is not necessarily so.


Let's look at one site for an introductory example.

WeRelate.org is an extremely useful and focused site for displaying individual and family information. The wiki format allows for intensive sourcing and inclusion of media. I highly recommend the site. Now, what about their claims regarding records? Here is a statement from the WeRelate.org startup page, where WeRelate claims to be "the world's largest genealogy wiki with pages for over 2,153,200 people and growing. This is quite an impressive number until you look at the WeRelate.org Special:Statistics page of the wiki. Here is the quote from the page:

There are 6,111,311 total pages in the database. This includes "talk" pages, pages about WeRelate, minimal "stub" pages, redirects, and others that probably don't qualify as content pages. Excluding those, there are 2,876 pages that are probably legitimate content pages. (emphasis in the original).

What is a "legitimate content page" and how does that differ from the claim to "over 2,153,200 people?" No where is the seeming discrepancy explained. What value are the "people pages" if they contain no content?


This issue of content is not at all unique to WeRelate.org (and I am not picking on that website at all, merely using it as an example). Take for example a comparison between two huge online genealogy giants; FamilySearch.org and Ancestry.com. A superficial look at the two sites would have you believing these two claims: FamilySearch.org claims 1033 "collections" and Ancestry.com claims 30,554 collections. Are the two claims accurate and if they are, do they reflect the relative size differences between the two databases?


It turns out that the term "collection" as used in the two databases are substantially different in their application to the records contained in the databases. Both websites use the term in a totally ambiguous way that gives the user little information about the amount of information on the website.



FamilySearch.org uses the term "collection" in a loose way to designate geographically related records created in a certain way. The term collection is used to refer to original source records as well as extracted records and indexes. So in one instance, the 1855 Alabama State Census is said to contain 34,978 records but in this case, this collection is an index, so it contains names not records. It is certainly not clear what is meant by the term record when each index entry is counted as a record. In another example, the 1869 Argentina National Census is said to contain 1.799,773 records on 157,426 images. Apparently, the number of records refers to the entries on the Census records. But in another collection, such as the Argentina, Salta, Catholic Church Records, 1634 - 1972 there is no number for the records, just a reference 144,293 images.  So the total number of "collections" is arbitrary and meaningless. If you drill down into the records, especially those that have images only, you will find that some individual collections are comprised of dozens of rolls of microfilm.


Again, I am not criticizing FamilySearch or anyone else, merely commenting on the vague and ambiguous nature of the designations. Why give a number if the number is meaningless?


Ancestry.com has the same issues as FamilySearch.org but in most cases it is harder to penetrate the confusion. Collections in Ancestry.com are listed with a number of "records." But the number of records is not further defined as pages or individuals or whatever. One number sticks out, the number of records in Member Family Trees is claimed to be 1,838,295,985. Hmm. That is a really big number. How many unique individuals are represented by that number? For example, if I search for one of my ancestors in the Public Member Family Trees, take Henry Tanner for example, I find 57,210 instances of his name. Speculating, if I divide the total number of records claimed by Ancestry.com by the number of duplicate records for Henry Tanner, I get about 32 million entries, still a large number but what is the real number? How many duplicates are there? Isn't this the same problem I started out with on WeRelate.org? Only Ancestry.com does not bother to tell us how many records have content?

The number of records claimed by both FamilySearch.org and Ancestry.com do not give us any idea of how many duplicate records there are for an individual. For example, my ancestor might appear in multiple family trees, but he may also appear in multiple records, all with exactly the same information such as a death certificate and an index of deaths.


The confusion in the terms "record" and "document" is even more dramatic. Fold3.com is an example of using all terms interchangeably. For example, Fold3.com has a list of "collections," claims to have 86,022,535 images, and 100,232,144 memorial pages. Fold3.com collections include American Milestone Documents and Matthew Brady Photographs among other collections. How do we compare the numbers to either Ancestry.com or FamilySearch.org? The simple answer is we can't.
Numbers don't lie, but they don't say much either.


Rather than take these numbers, no matter where they originate, with a grain of salt, perhaps we need a whole salt shaker.

1 comment:

  1. I guess, when it comes down to it, it's whether they have the records relevant to what you are searching for at that moment.

    ReplyDelete