Pages

Friday, December 6, 2013

Analyzing mistakes in family trees -- Part Two

The premise of this series of posts is that online user contributed family trees are essentially full of duplicated mis-information and errors. These online trees are uniformly disparaged by genealogical researchers. But most efforts to explain or qualify the errors are missing from the discussions. As I mentioned in Part One, it is way too easy to simply dismiss the online family trees as irrelevant to "serious research" and only valuable as possible leads to "real research." I wanted to begin a discussion about whether or not the causes of the deficiencies can be quantified and if there are some corrective measures that can be implemented either from a programming standpoint or from efforts by the genealogical community.

In the context of this second post in the series, I would remind the readers that I am not directing my comments at any particular online family tree program. I see the deficiencies in every one of the programs, more in some, less in others, but uniformly present. I started by discussing the impact of copied family trees and the perpetuation of errors through copies made with any review or correction by the new user/copier.

To start the present analysis, I used a widely available family tree program and searched in the family trees using the name of my Great-grandfather, Henry Martin Tanner. I choose him as the basis for any such search because I am already well aware of the number of family trees in which he is incorporated and the extent of the mis-information. Here are some observations from the search.

The program (not further identified) was searched with the following information:

  • Name: Henry Martin Tanner
  • Birth: 1852
  • Location: California, USA

Let me emphasize, the only really significant issue involving records of this individual is the name of the county where he was born. The county is commonly recorded as San Bernardino County, but that county was not in existence at the time of his birth and should have been recorded as Los Angeles County. All of the rest of the correct information is extremely easy to obtain and recorded in dozens of freely available original sources and even books, online articles, blogs and other sources.

The search returned 132,221 results. So some addition information was added to cut down the extraneous returns, if possible. I did not indicate an "exact" search because I wanted to see the non-complying entries. I added his spouse's first name as Eliza. That addition had no effect on the results, so I added in a more specific location, San Bernardino, California, USA without specifying if it was the county or the city. This addition only reduced the total results by one. I decided to include and ignore a total number of "false positives" to explain the problem with the search engine. I might mention that none of the entries had more than 1 source listed. It occurred to me that one easy fix for this problem would be for the family tree programs to prioritize those entries with sources.

There was some discussion in the comments to the last post that sources were not an indication of accuracy. That is likely very true, but I will have to take up that issue in a subsequent post.

Here is the tally and the type of error: (Note, if an entry had more than one error, I counted each error so the numbers are greater than the total number of entries).

Name wrong  xx
Name incomplete xx
No birth location xxx
Wrong birth date x

The remaining entries were false positives and useless. Oops, then I realized I had put the birth date into the death date field. I got returns with this wrong search info. Interesting.

Next with the correct search information. This time I got 88,755 returns. Of course, I did not go through all of them.

Correct entry xxxxxxxxxxxxxxxxxxxxxx
Wrong county xxx
No birth location xxxxxxxxxxxxxxxxxxx
Incomplete place information xxxxxxxxxxxxxxxxxxxxxxxxxxxx
Wrong name x
Death information wrong x
Incomplete name x

It looks like the trends here are going to continue for a really long time and that the incorrect entries are going to far outnumber the correct ones. Almost none of these entries had more than one source listed and the source was uniformly another family tree. What is strange is that the correct information was certainly available from some of the family trees.

It appears that some already present programming features could correct the vast majority of these problems. There are already programs such as Rootsmagic.com that tell the user when they put in a county outside of the time when the county existed. Also, the programs could easily require county information if a date and state were present. The basic question is whether or not to allow entries online that are incomplete as to one or more of the places where events are listed? Would it be too much to expect that if an entry were made in a new family tree that duplicate entries in other trees would be checked and questions raised as to the conflicting information or the correctness of place information? Is uploading GEDCOM files the root cause?

On to Part Three.

No comments:

Post a Comment