Pages

Sunday, September 28, 2014

Is Genealogy Big Data?

"Big Data" is a new jargon term for a computer programing and technology approaches to massive amounts of information. Here is one definition of "Big Data" from Wikipedia:
Big data is an all-encompassing term for any collection of data sets so large and complex that it becomes difficult to process using traditional data processing applications. 
The challenges include analysis, capture, curation, search, sharing, storage, transfer, visualization, and privacy violations. The trend to larger data sets is due to the additional information derivable from analysis of a single large set of related data, as compared to separate smaller sets with the same total amount of data, allowing correlations to be found to "spot business trends, prevent diseases, combat crime and so on.
As genealogists, we like to think that we are good at what we do; finding ancestors. In fact, this process, we call genealogical research, of finding, evaluating and recording the information found about our ancestors could be done by computer programs. Currently available programs from the large online database programs such as MyHeritage.com and Ancestry.com and closely followed by FamilySearch.com, have already demonstrated that computer programs can find sources more efficiently and at least as accurately as any human researcher. In effect, they are tackling the issues of "Big Data" as they apply to genealogy. 

The main obstacle to computer programs completely overtaking humans is in deciphering handwritten documents. But once the "indexing" process is done, the computers can and will take over.

Human researchers, when confronted with all of the genealogical information online, point out that the online programs "don't have all of the sources." I would add the word "yet." Even though significant amounts of genealogical data are locked up in paper (and other media) around the world, that situation is changing rapidly. 

We only have to look into the future a short time to see that the domination of paper will change. There are presently more than 7.2 billion people on the earth. The percentage of Internet usage varies from over 96% in countries such as Iceland, to somewhat lower numbers in developing countries. The total number of Internet users worldwide is over 2.8 billion. If you think of the average family size, you can see we are almost at the saturation point. 

What this means is that in the future, there will be no need for "research" about families. All of the family history data about each one of us globally will, for all intents and purposes, be readily available. Right now, if the average genealogist who is in a developed country or whose ancestors were in a developed country, signs up for Ancestry.com or MyHeritage.com, they can expect the programs to find source records to automatically build a pedigree back two or three and perhaps, four generations. All the user has to do is confirm the matching record hints. If you think about what will happen two or more generations into the future, you will see that our descendants will automatically have four and five or many more generations provided to them by the computer programs. 

At the same time, these huge databases will keep gobbling up the records of the past at a huge rate. As traditional genealogists, we are fixated on the human-based, purely mechanical process of gathering family history data. We talk endlessly about reasonably exhaustive searches and proof and other issues. All the time, these issues are vanishing right before our eyes. Much of what current genealogists do is duplicate research that has already been done in the past. To the extent that computer systems allow us to avoid this duplication, our efforts can be directed to those areas that really need research. 

Let's suppose that FamilySearch.org, for an example, solves the problems of duplication in the Family Tree program. Let's further suppose that the partnership with Ancestry.com, MyHeritage.com and findmypast.com is successful in opening the FamilySearch records to millions of more users. Let's further suppose that FamilySearch finishes digitizing all of its existing records going back to 1938. Let's also suppose that the rate of digitization of records increases with attention being made to smaller and smaller collections. In addition, let's suppose that a way for people who have collections of records in their possession is created where they can share those records online. Let's go even further and suppose that there are millions upon millions of people involved in this process, not just the tiny number involved today. 

Do you really think you can avoid this inevitable process? There will still be dead ends and brickwalls in the past. But they will be real missing data, not just a failure to do adequate research. Don't under-estimate the impact of this process. 

6 comments:

  1. Thanks for this blog post. I think the growing digitization of records - worldwide - is exciting for the future of genealogy and records research. The remaining challenge, I think, will be maintaining a focus on collecting and telling the human stories of our ancestors. For me, that's what adds life to family history and makes it so meaningful.

    ReplyDelete
    Replies
    1. You raise an interesting point. But unless those stories are recorded in some format, they will ultimately be lost. It they are recorded then they will become part of the Big Data.

      Delete
  2. Your blog assumes that the "source" has been uploaded to a computer to allow computers to find it.
    No computer can find "sources" that have not been digitised.

    As for "...have already demonstrated that computer programs can find sources more efficiently and at least as accurately as any human researcher", you must be joking OCR software is still not 100 percent accurate, and is easily thrown when confronted with blled through and smidges etc.
    That is why indexing is still carried out by humans, though it has drastically improved over the last ten years.
    I would admit that many of the bulk data sets such as civil registers and census are quickly digitised by big companies the bulk of more local records will not be digitised in the next 50 or even 100 years and some never will be digitised.
    Cheers
    Guy

    ReplyDelete
    Replies
    1. You raise some issues that I will address in a subsequent blog post. I will be a little more specific in my answer in the subsequent post.

      Delete
  3. James, a very thoughtful look forward.

    But "In fact, this process, we call genealogical research, of finding, evaluating and recording the information found about our ancestors could be done by computer programs."

    The "evaluating" part I think is questionable. Do you anticipate that algorithms in the future can correlate groups of kin and associates with geographic areas over time, can interpret motivations, can draw conclusions from deeds, estate inventories and sale bills, and a host of other analytical elements that presently require a human brain?

    ReplyDelete
    Replies
    1. I think we need more discussion on the issue of evaluating documents and what is meant by that. I will think about that subject and will likely have more to say.

      Delete