Pages

Sunday, September 27, 2015

Genealogy and Probabilistic Record Linkage

Record or data linkage is one of the critical issues in genealogical database construction. Now, before you nod off to sleep about this topic, let me point out that this is exactly what genealogists are involved in doing every day they do research. As genealogical researchers, we are immersed in the issues of data cleaning, removing duplicates, merging individual level datasets and other record linkage activities.

How do you go about recognizing two records in two files that represent identical people? Additionally, how do you go about recognizing the existence of duplicate individuals in the same program? Here are some examples of why I am writing about this subject:


This screenshot shows the results of searching for a duplicate for a person named John Bryant identified by a Person Identifier of LHP9-RZP. This is from the FamilySearch.org Family Tree. How do I determine which, if any, of these suggested duplicates are actually duplicates? Could I design a computer program that would determine the correct solution to this problem? The limitation in designing such a program for genealogy lies in the details present in the records. There are several possible issues with original historical records:

  • The information in the record may be complete and accurate
  • The information my be incomplete and accurate
  • The information may be incomplete and only partially accurate
  • The information may be incomplete and inaccurate
Researchers are faced with this challenge every time they find a name in a record and have to decide whether or not the found record should be included, in whole or in part, into an existing database. 

Let me put this into a hypothetical situation. Suppose that I am doing research to find information about an ancestor. I find an entry in a parish register. I have several variables that need to be considered. Such as the following:
  • Spelling variations
  • Variations of dates and places
  • Variation the identity of associated individuals
Depending on the amount of information in the original record there could also be many other variables. Probabilistic record linkage involves assigning various degrees of possible linkage based on those factors which agree or disagree with what we consider to be accurate. At this point, I should note that this particular issue is usually discussed in the context of managing large databases. Due to the fact that there are numerous possibilities for error, this whole process is really at the core of the accuracy of genealogical research. The challenge is whether or not a computer program can be designed to accurately make these determinations.

It is apparent that some online genealogical database programs have achieved a high degree of accuracy, at least in the area of finding records that match those individuals in a particular family tree. The degree of accuracy depends heavily on the amount of information present in the original record and the accuracy of the information already in the family tree.

For this process to work properly, it is absolutely necessary to go through any database and clean up the data. Returning to my example above involving John Bryant, here is an example of the information that is presently in the Family Tree.


 If you examine this data closely you'll see some anomalies. First, the birth and christening dates are the same. Second, the death and burial dates in the same. This could happen but it is unlikely. We should also note that there are three alternate names, each of which is designated as a "Birth Name." One of the names is "Thomas Bryant." It is very likely that this is not the same person. To understand why these birth names exist it is important to review the history of the entire program. Without doing so, I can simply conclude that someone has made a mistake. If I delete all three of these "birth names" what will be the consequences? One of three options appears to be a spelling variation. This could indicate that the information contained in the "Vital Information" section is inaccurate. In this particular case there are a number of sources listed. Can these questions be answered by examining the sources?

If we focus on the dates involved in this particular example, we would realize that spelling variations in names should be expected rather than being the exception. The real question here is whether or not all of the considerations that go into resolving the apparent problems with the data in this particular entry could be programmed into a computer? Haven't we really gotten to the point where we need to have some additional information? In this particular record, the question most certainly arises upon examination of the sources when we find that one of the children, Sarah Bryant, was born after her father died. By the way, all of the source records show the spelling of the surname as Briant.

 In the context of a family tree, even before I began making any corrections to this record back in 1730, it is absolutely necessary that I correct the information for more recent individuals to have some assurance that I'm actually related to this individual. Far too many genealogical researchers rely upon information which they have inherited from others. We must also remember that in this particular case, there were three potential duplicates. Any potential matches between this individual and a record depend entirely upon the accuracy of the information already in the family tree. Presently none of the larger online databases or individual genealogical databases provide "pruning" activities that show you exactly where your family tree ceases to be accurate.

The FamilySearch Family Tree attempts to do this with icons indicating grossly inaccurate information but ultimately the corrections of the data rely upon the individual judgment of the researchers.

 If you would like to get into some reading about probabilistic record linkage here are a few references:

Australian Bureau of Statistics. Assessing the Quality of Linking School Enrolment Records to 2011 Census Data: Deterministic Linkage Methods, Dec 2013 Research Paper. Canberra: Australian Bureau of Statistics. http://www.abs.gov.au/ausstats/abs@.nsf/cat/1351.0.55.045.

Batini, Carlo, and Monica Scannapieca. Data Quality: Concepts, Methodologies and Techniques. Berlin; New York: Springer, 2006.

Dong, Xin Luna, and Divesh Srivastava. Big Data Integration, 2015. http://dx.doi.org/10.2200/S00578ED1V01Y201404DTM040.

Fair, Martha, Statistics Canada, and Canadian Perinatal Surveillance System. Validation Study for a Record Linkage of Births and Infant Deaths in Canada. Ottawa: Statistics Canada, 1999. http://www.statcan.ca/cgi-bin/downpub/listpub.cgi?catno=84F0013XIE.

Herzog, Thomas N, Fritz Scheuren, and William E Winkler. Data Quality and Record Linkage Techniques. New York: Springer, 2007.

Machado, Carla Jorge. Early Infant Morbidity and Infant Mortality in the City of São Paulo, Brazil a Probabilistic Record Linkage Approach, 2002.

Machado, Carla Jorge, and Kenneth Hill. Probabilistic Record Linkage and an Automated Procedure to Minimize the Undecided-Matched Pair Problem Relacionamento Probabilístico de Dados E Um Procedimento Automático Para Minimizar O Problema Da Incerteza No Pareamento de Registros. [Rio de Janeiro]: SciELO, 2004. http://search.ebscohost.com/login.aspx?direct=true&scope=site&db=nlebk&db=nlabk&AN=178144.

Newcombe, Howard B, John D Abbatt, and Eldorado Resources Ltd. Probabilistic Record Linkage in Epidemiology: Computer Methods for Searching Death or Cancer Files Yield Risk Data for Large Cohorts. Ottawa, Ont.: Eldorado Resources Ltd., 1983.

Statistics Canada, and Statistics Canada International Symposium on Methodological Issues. Symposium 2010 social statistics: the interplay among censuses, surveys and administrative data : proceedings = Symposium 2010 : statistiques sociales : interaction entre recensements, enquêtes et données administratives : recueil. [Ottawa]: Statistics Canada = Statistique Canada, 2011.

No comments:

Post a Comment