Some people eat, sleep and chew gum, I do genealogy and write...

Saturday, September 21, 2013

A look at the collections of the big online databases

All four of the big on line database providers, MyHeritage.com, Ancestry.com, findmypast.com and FamilySearch.org, refer to the content of their accumulation of indexes and sources as "collections." Can you compare the four different entities by looking at the number of their collections? This would be a rather simplistic and easy was to judge the value of their holdings but it might also be entirely misleading and inappropriate. This is the case despite the use of numbers of "collections" and other measurements such as the number of "records" by each to promote the "value" of their holdings.

Unfortunately, there is no common definition, in this context, of any of the terms used by the various entities. I have written about the ambiguity of these terms in the past, but it is time to review them again. The terms used include the following:

Visits or users
The four companies seldom release information about their site visits. If they do, they may use the term "unique visitors." Use of this term is supposed to mean the number of different users who visit a site during a particular time period. If one person visits the site ten times, then that is supposed to be counted as one unique visitor but without a clear definition of visitors, repeated visits by one user may be counted as separate visits. Unfortunately, the Internet metrics sites that measure and rank websites don't always share their definitions of the terms they use. Metrics cites may use identifiers such as cookies to differentiate between users but that only works if the users have cookies enabled. Without a standard definition for these terms and a standard method of measurement, claims of the number of visits or users can be inflated.

People, names or members
Although this term is also not usually defined by any of the four entities, it may include only those who have signed up for the service. There always seems to be some confusion about whether or not these figures reflect actual paid memberships in the subscription services or if they also include statistics from the "free" portions of the websites. Of course FamilySearch.org is a free service, but still has signed in members. All of the entities have a "free" area that can be viewed without paying a membership fee or signing into the website. It is usually unclear if the claimed visits or people using the site are paid or registered members or merely causal viewers. Since there is no consistent usage or definition, comparisons between the sites based on the number of members or people is unreliable.

In addition to referring to the number of users of the site, the terms "people" or "names" are used to indicate the number of individual entries in various records. The ambiguity here is whether or not a record such as a death certificate contains one name, the deceased, or six or more names including the attending physician.

Records
This is another commonly used term. For example, in a collection of newspapers, is each newspaper edition a record or each page as digitized? Is each line of the U.S. Census a separate record or each page or each enumeration district or each set of districts filed together on one microfilm roll? Do records equate to the number of names? What about duplicates? For example, Ancestry.com claims it has a collection called Public Member Trees with 2,147,483,647 records as of the date of this post. How many of those "records" are duplicates? I know that some of my ancestors have hundreds of duplicate entries. How do the duplicates affect the total number of records.

Collections
This is probably the most ambiguous term of all. In any one of the four databases, the word collection can mean a file with 2 billion names or file with less than 100 names. As an extreme hypothetical example a database might contain 32,000 collections of 100 records each or 32,000 collections of a billion records each. Obviously, this doesn't happen, but how are we to judge whether there are a large number duplicates?

Now to give some numbers. Here is a chart showing the status of the claims as of the date of this post:

Ancestry.com claims 31,378 collections with the largest of those collections with 2,147,483,647 records and the smallest with only 1 record. Two-thirds of those collections have less than 500 records and somewhat more than half of the collections have less than 250 records. Over 10,000 of the collections have about 100 records or less.

FamilySearch.org claims 1648 collections. It is very difficult to determine either the largest or the smallest collection because there is no way to know home many records are in the collections that have yet to be indexed except to click on each one, one at a time. But I did find one collection with 93,331,370 records and one collection had 22 records.

MyHeritage.com's WorldVitalRecords claims a total of 23,208 collections. It appears that they are listed in order by size and so near the end of the list there are collections with no records at all listed.

findmypast.com has a list of "records" but has no numbers associated with the list.

There is no doubt that these are massively large databases. It is also apparent that there is no way to judge the relative size of the entries they claim as collections or records. What it comes down to is that either they have the record you are looking for or they do not. In any event, you should basically take the claimed numbers with a grain of salt.



1 comment:

  1. Re: "Ancestry.com claims it has a collection called Public Member Trees with 2,147,483,647 records..."

    I would question whether their count really happens to be exactly 2**31 - 1 James.

    ReplyDelete