Saturday, June 11, 2016

Genealogy enters the Information Age: A New Paradigm

My recent experiences in working out the conglomeration of entries in the Family Tree have gotten me thinking again about the relationship of genealogy as an activity directed at information and the fundamental changes affecting our entire world society because of the amount of information available to an every increasing number of people throughout the world. There are several factors that are contributing to the reformation of the previous paradigm experienced by generations of genealogical researchers. The shift in the availability of information of all kinds, including historical, genealogically valuable information, has realigned the more historic distribution of information processors and concentrated the tools and skills in a very small group of researchers.

In the past, say hundred years or so, there has been a continuous shift in the way information is processed and distributed. This is an obvious fact that cannot be ignored. Unfortunately, the results of this inexorable shift is that the tools that provide access to this information in a meaningful way have become concentrated in a very small number of people who are in a position to exploit both the tools and the availability of that dramatic increase. We are generally aware that, as genealogists, we have access to much more information than our predecessors, but that is only a part of what is actually happening. Availability does not equal usability. Just because we have more information available does not work out into a common factor of accessibility.

Normal distribution is illustrated by the familiar "bell curve" diagram:
By Dan Kernler - Own work, CC BY-SA 4.0,
This type of diagram illustrates the fact that In probability theory, the normal (or Gaussian) distribution is a very common continuous probability distribution. Normal distributions are important in statistics and are often used in the natural and social sciences to represent real-valued random variables whose distributions are not known. Quoting from Wikipedia: Normal distribution. Further quoting from Normal Distribution:
Psychological research involves measurement of behavior. This measurement results in numbers that differ from one another individually but that are predictable as a group. One of the common patterns of numbers involves most of the measurements being clustered together near the mean of the distribution, with fewer cases occurring as they deviate farther from the mean. When a frequency distribution is drawn in pictorial form, the resulting pattern produces the bell-shaped curve that scientists call a normal distribution.
We have become accustomed to viewing genealogy in a "bell curve" fashion. Although interest and participation have never been adequately measured, it is commonly believed that "genealogy is one of the most popular of hobbies (i.e. human leisure activities)." So distribution of the interest and participation in genealogy would thus follow the standard. We move from those individuals on the left-side of the diagram who are not at all interested in genealogy into the perceived large group of individuals who are interested in and involved in genealogy and then into the area of the very few individuals who are extremely involved, i.e. professionals and highly skilled participants. This perception of the distribution of interest is based on the point of view that interest equaled activity.

But as I pointed out, the vast increase in information changes the paradigm. Availability does not equal usability. Why is this the case?

My experience over the past few days illustrates the paradigm shift exactly. As is now generally acknowledged, the distribution of information is not conforming to the normal bell curve. The type of distribution today is more usually illustrated by the Pareto Distribution curve.

By Danvildanvil - Own work, CC BY-SA 3.0,
What is happening is that the both the access to information and the tools to use that access are becoming more and more concentrated in a very few individuals.

What does this mean for genealogists and those "interested in family history?" As simply put as possible, it means that yes, there is an overwhelming amount of genealogically useful information becoming available online but at the same time, we are also becoming overwhelmed with all of the cumulation of information from the past. A program such as the Family Tree would not have been available just a short time ago. There are several factors that have only been developed very recently that make such a program possible.

  • Computer systems had to develop to the extent that they could handle databases with billions of names and further billions of records
  • The Internet had to develop to the point where such amounts of information could be stored and distributed efficiently to a large number of people
  • Individual access to computers and data processing had to also become available 
  • There had to be individuals who could "take advantage" of these developments and use the information
It is not a coincidence that very few people have an extensive background in computer technology, a background in history and languages and also have access to the fastest consumer computers available with a huge memory and the fastest Internet connections available in the world. How many of those people have also spent their time learning about genealogical records and doing research?

It would be a utopian world if the results of this concentration of knowledge were following the expected bell curve. But reality illustrates the opposite. Here is a quote that illustrates what is happening from Wikipedia: Pareto distribution.
Pareto originally used this distribution to describe the allocation of wealth among individuals since it seemed to show rather well the way that a larger portion of the wealth of any society is owned by a smaller percentage of the people in that society. He also used it to describe distribution of income.[8]This idea is sometimes expressed more simply as the Pareto principle or the "80-20 rule" which says that 20% of the population controls 80% of the wealth.[9] However, the 80-20 rule corresponds to a particular value of ╬▒, and in fact, Pareto's data on British income taxes in his Cours d'├ęconomie politique indicates that about 30% of the population had about 70% of the income. The probability density function (PDF) graph at the beginning of this article shows that the "probability" or fraction of the population that owns a small amount of wealth per person is rather high, and then decreases steadily as wealth increases. (Note that the Pareto distribution is not realistic for wealth for the lower end. In fact, net worth may even be negative.) This distribution is not limited to describing wealth or income, but to many situations in which an equilibrium is found in the distribution of the "small" to the "large".
I have left in the footnotes for reference. What is happening in genealogy is exactly consistent with this type of analysis. Those who can adequately process the huge amounts of data are in a very small minority. From the same article, here is a list of the types of activities that are seen as Pareto-distributed. I would suggest that genealogy now falls directly into this paradigm.
  • The sizes of human settlements (few cities, many hamlets/villages)[10]
  • File size distribution of Internet traffic which uses the TCP protocol (many smaller files, few larger ones)[10]
  • Hard disk drive error rates[11]
  • Clusters of Bose–Einstein condensate near absolute zero[12]
  • The values of oil reserves in oil fields (a few large fields, many small fields)[10]
  • The length distribution in jobs assigned supercomputers (a few large ones, many small ones)[citation needed]
  • The standardized price returns on individual stocks [10]
  • Sizes of sand particles [10]
  • Sizes of meteorites
  • Numbers of species per genus (There is subjectivity involved: The tendency to divide a genus into two or more increases with the number of species in it)[citation needed]
  • Areas burnt in forest fires
  • Severity of large casualty losses for certain lines of business such as general liability, commercial auto, and workers compensation.[13][14]
I would suggest that participation in genealogical research now follows the Pareto distribution. Even though vast amounts of information are now available, the number of individuals who have even most of the tools available to take advantage of, utilize, evaluate and sort out that information is very, very small. If the goal is organize and utilize the information that is being digitized and indexed, then support for those few individuals should be increased rather than fighting the trend and decreasing their support. 

At the same time, efforts to incorporate more individuals should be increasingly aimed at those who already have the skills needed to work with the information overload. If we limit our "recruitment" efforts to the general population by assuming that the bell curve distribution still exists, we run the risk of losing those who can actually untangle the information. My conclusions do not mean that present inclusive activities have to cease, there just needs to be a balance where those who can do the work receive the support they need from the overall community. 


  1. Excellent insight! "Even though vast amounts of information are now available, the number of individuals who have even most of the tools available to take advantage of, utilize, evaluate and sort out that information is very, very small. If the goal is organize and utilize the information that is being digitized and indexed, then support for those few individuals should be increased rather than fighting the trend and decreasing their support." Why doesn't FamilySearch understand this?

    1. Good question. I have no idea what the answer is.