Some people eat, sleep and chew gum, I do genealogy and write...

Sunday, December 21, 2014

More or Less, Statistically Speaking -- A Look at Statistics for Genealogists

In our present world, statistics are ubiquitous. We can't even hear a weather report without a claim as to the percentage possibility of rain or sunshine or whatever. In fact, it was the weather reports that started me thinking about this subject that I have reflected on many times over the years. First off, I have to say, I have never had a course in statistics, I am a consumer of the product. But as an attorney, I have been faced with many situations where statistics were used one way or another to persuade people. Over the years, I have read a huge number of different "studies" that intended to prove, by citing statistics, one point or another. Regularly, I have taken it upon myself to investigate the basis for the claimed "facts" as supported by statistics.

One notable and very involved study that I did had to do with the claims of the Bureau of Labor Statistics concerning consumer expenditures. The most common of these statistical findings is summarized in the "Average annual expenditures and characteristics of all consumer units and percent changes" for each year. My interest in this particular set of statistics came from the fact that the numbers seemed unreasonably high given my own personal experience and I was interested to find out why my experience varied so completely from what was being represented as the "average."

Now we are back to the common question with some of my blog posts, what has this got to do with genealogy? If you are patient, you will soon see the connection. One of the things my interest has done is to get me to read as I investigate the different issues. One interesting book is:

Huff, Darrell, and Irving Geis. How to Lie with Statistics. New York: Norton, 1954.

I read this book a number of years ago and it confirmed what I had found out through my own investigations; statistics can be manipulated for a specific purpose. In saying this, I am not accusing anyone specifically of misrepresenting facts, I merely note that those citing specific statistics usually have an agenda they wish to support, whether it be commercial, political or social.

Sometimes it is hard to separate statistical reports from those purporting to be factual. In this regard, claims of growth rates and popularity of a particular product or activity are suspect. Many times, I find that claims about the frequency of certain activities is also manipulated. This is easily done by altering the definition of the activity to include more or fewer participating individuals. I have found this to be the case when people wish to emphasize the importance of any activity from movie attendance to sales of a particular product. In many instances the people publishing these figures base their claims on "actual attendance" or "actual sales figures" when no such figures have been or can be obtained.

The main culprits here are the terms "average" and "median." Both of these terms as usually employed are misleading. For example, if I have ten people and nine of them have an income of $1 and one has an income of $1000, what is the average income? The answer is simple, $1009 divided by 10 or slightly more than $100. Now, I am fully aware that a "careful" analysis of this data would possibly apply some sort of selection process that might eliminate the highest number or weight the lower numbers, but no matter how the process is applied, the average in this type of situation is misleading. Statisticians try to avoid these types of problems through "random" sampling and other such methods, but in most studies there is always a "margin of error."

The main issue is when statistics are used to predict a certain type of results, such as elections or public opinion. There is less of a problem when statistics are used solely for the purpose of explaining what has happened, but in many cases, the interpretation of the numbers is skewed to show the results desired rather than the facts.

Now to the subject of genealogy. In the past, I have written several blog posts refuting the claim that genealogy is one of the most popular pastimes or hobbies in the world. I have never seen any numbers at all that would substantiate such a claim. Notwithstanding that fact, over the past year or so, I have continued to hear claims about genealogy's popularity purportedly based on some general claim of this sort. In many cases such claims are based on a survey asking a general question such as "are you interested in your family's history."

The other, much more serious impact of statistics on genealogy, is when statistical claims influence the activities of genealogists. One other topic I have written on in the past is the impact of claims of the number of identity theft victims and the growth of identity theft as a crime. I have researched this topic over and over in the statistics provided by the Federal Bureau of Investigation (FBI), the Bureau of Justice statistics, Uniform Crime Reporting Statistics and U.S. Census Bureau and find no support for the outrageous and very commonly reported statistics. The problem is that unsupported claims, masquerading as statistics, have caused a significant number of people to be afraid of sharing their family history. If you were to believe the claims, you would soon be afraid to go to a store a buy food.

Why is this important and who cares? Why do we think or do not think that the popularity of genealogy is important? Why should we be or not be afraid of identity theft?

Considering a much less threatening claim, I recently analyzed the claims about how many of the world's records have been or have not been digitized. This is another area that receives a lot of claims as to the percentage based on general claims of the total number of records. There are two things that lead me to believe that the statistics cited are unreliable; either the numbers are rounded off and exact such as 1,000,000 of this or that, or they are so specific as to be impossible such as a number that is claimed for the total number of books published in the entire history of the world.

It is important to distinguish between numbers that are represented to be statistical claims or that have to be based on sampling of the total number of items being reported and numbers that come from actual counts. Even if a number is based on a supposed actual count, the question still remains as to how the count is characterized and what was counted? For example, the very large genealogical database companies regularly report the number of "record" digitized and sometimes claim so many people in their database files. What is a "record" and who counted how many names there were on each record? Did someone really look at each digitized page and count the number total number of people or did the reporting entity take a sample and multiply out the number as an estimate? Did the number of records really come out to an even number?

Genealogy is not all that different from any other aspect of our world-wide culture. We have some of the same challenges in evaluating the numbers that are thrown at us each day in the media. We need to learn to be very critical when we are urged to act or not act based on some claim of a study or statistics. This is especially true when the entity making the claim cannot possibly have compiled the numbers claimed.

We need to apply the same sort of skepticism when we are doing our research. Is the information we are finding making any sense? Is it consistent with reality and can any of the numbers be independently supported?

No comments:

Post a Comment