Pages

Thursday, September 24, 2015

Plumbing the Depths of Online Record Collections -- Part Two

One of my favorite books of all time is the following:

Dunning, Stephen, Edward Lueders, and Hugh Smith. Some Haystacks Don’t Even Have Any Needle: And Other Complete Modern Poems. Glenview, Ill.: Scott, Foresman & Company, 1969.

The title of the book comes from one of the poems. Here the question asked by the title applies to the vast online genealogical database companies and is a serious consideration. How many of the huge, online databases are missing exactly those records you need to find your ancestors? Do you sometimes feel like you are looking for a needle in a haystack when you face the huge database programs?

Let me ask a few more questions about these huge programs to get started in plumbing their depths.
  • How may of the huge online genealogical databases have complete copies of the United States Federal Census Records? 
  • What percentage of the records on each of the websites is duplicated on other similar websites?
  • How many of the records on the large database websites are repetitious of other similar records on the same website?
  • What percentage of the records on these websites are not source records, but user contributed copies?
These questions point out some significant limitations in relying on the numbers supplied by any one of the websites. I am not picking on any particular site; this is a general issue with all of them. In fact this issue extends to websites with claims to far fewer collections and records. I certainly see the need for a fair degree of redundancy. It is comforting to know that records such as the United States Federal Census are available from several sources online, but when the duplicates start being used to puff up the total numbers then that becomes a concern.

One fact is clear. The total number of original source records being digitized and put online continues to soar. Millions of new, previously unavailable records are being added every day. It is also clear that records are being added from areas around the world heretofore not previously available.

In genealogy, redundancy is absolutely necessary. It is very, very seldom that complete information is available about an individual from one record. For example, it is naïve to assume that a recorded birth date is accurate without some additional collaboration. Unsophisticated genealogists have a tendency to rely on the reporting of a single event in creating their genealogical view. This individualized focus engenders an atmosphere of uncertainty. This is especially true of situations involving distinguishing individuals with similar names.

In most cases, the larger websites provide a window into their inner workings. Usually, this is in the form of a catalog. For example, Ancestry.com has a "Card Catalog." This card catalog lists all of the "collections" individually on the website. In addition, Ancestry.com and some of the other websites provide a method of filtering the list of "collections" in a way to show what is available in any geographic area or in a specific topic. If you would like to know how the various websites compare in their specific holdings, I suggest a close examination of their catalog. These listings of the various collections in each of the individual large websites is usually designated either as a catalog or as a place for you can search all the collections or view all of the collections. Sometimes it takes some searching to find the list of all the resources.

Size is far from the only concern with large online genealogical databases. Whatever the size of the database, the quality of the search engine is far more important. One of the most persistent complaints about all of the websites is the apparent lack of responsiveness of the searches. The user enters a search term, such as the name of an ancestor, and the program returns results that vary considerably from the expected results. It is usually the case that the returns vary wildly as to geographic areas and time periods. For example if I search for John Jones in New York in 1850, I do not expect to see John Jones in California in 1900 or even John Jones in England in 1640.

Part of this apparent unresponsiveness of the search engines is actually designed into the program. the process or set of rules implemented by the developers and programmers, often referred to as algorithms, provide for wider responses if the main search terms are not met. So if the program cannot find "John Jones" in New York in the time period specified by the search, the program will default to providing any John Jones that appears similar. Because of the content of the databases, the results may appear random. Some of the larger databases tried to avoid this problem by separating the results of the searches into categories either by awarding the results a star, from 1 to 5 stars, or by actually separating out the results into different categories depending on their perception of the reliability. The fact that the search turns up a variety of responses reflects the reality that these results were exactly what the programmers anticipated.

If you want to get an idea of the accuracy of any given search engine, just search for something that you know is already in the database. You might be surprised at the results.

Another important factor in this accumulation of digitized documents is the quality of the indexing. Since accurate handwriting recognition is still an unobtainable goal and, given the fact that optical character recognition is also not perfect, the online documents have to be indexed by people, one letter at a time. Everyone who has worked back researching old scripts and bad handwriting knows the challenge of accurate indexing. So amassing huge collections of scanned documents may make for easier access than seeing the same documents on a roll of microfilm, but without accurate indexing the advantage stops there.





1 comment:

  1. Do you ever sleep? How do you find time to keep up with so many different topics of immense interest?

    ReplyDelete