Some people eat, sleep and chew gum, I do genealogy and write...

Saturday, October 24, 2015

Cluster Research

Very few genealogists realize the importance of cluster research. It seems counterintuitive to spend time researching people who are not obviously related to you in any way. The idea is that information that is entirely missing from your ancestors' records may have been preserved by neighbors. Cluster analysis is very common in a number of disciplines. Here is a definition from Wikipedia:
Cluster analysis or clustering is the task of grouping a set of objects in such a way that objects in the same group (called a cluster) are more similar (in some sense or another) to each other than to those in other groups (clusters). It is a main task of exploratory data mining, and a common technique forstatistical data analysis, used in many fields, including machine learning,pattern recognition, image analysis, information retrieval, and bioinformatics.
One common way of establishing a cluster is to look for class members who live in proximity to each other. From an historical basis, the clusters should be defined by sampling the area surrounding the target ancestral family to establish homogeneity. One of the most common characteristics of the selected class is that they all have same immigrant origin. The affinity of the class can be based on a similar religion, national origin, language or economic level. One study of the subject from Harvard University is entitled, "Determinants of Immigrant ClusteringEvidence from New York City in the late 20th century" by Kris Bartkus.

This is an instance of a very powerful analytic tool that is very rarely employed to solve genealogical problems. As the study undertaken by Kris Barkus indicates:
This paper will explore how a number of macro-level indicators impact immigrant clustering rates of immigrants from various countries. Some of these indicators would appear to be cultural, but we can use basic economic theory to predict how each of these indicators would impact clustering rates. We then test these predictions against the data to see which of these indicators actually have a significant impact on clustering rates. Unlike other studies, which either study clustering on the state rather than neighborhood level or look at only the characteristics of the neighborhoods themselves when studying determinants of clustering (which raises endogeneity concerns), this study will look at clustering on a micro level, which is more accurate since clustering occurs on a neighborhood scale, while using macro level panel data to explain variation in clustering. The paper will also comment on differences in clustering between immigrants and first-generation Americans. 
A simplistic way of examining clusters is to plot the addresses of groups with similar countries of origin in close proximity as indicated by the U.S. Federal Census Returns. It is helpful to use a spreadsheet, such as Microsoft Excel, to enter the data. To do this, you have to examine the original U.S. Census Enumerators Maps. For example, has the United States Enumeration District Maps for the Twelfth through the Sixteenth US Censuses, 1900-1940, in its digitized Historical Record Collections. Older enumeration maps are also available but are more difficult to locate. Genealogist Michael Hait, CG has compiled a valuable reference entitled the, "United States Federal Census Pathfinder."

Clusters can also be created by using alternative search terms in programs such as, and In some instances the search results can be copied or exported to a spreadsheet for further analysis. I recently employed a modified form of cluster research with searches in the program to break through an end-of-line problem in England that had baffled researchers for over 100 years. In this case, I searched for all of the individuals with the target surname in an entire county and then reviewed the list by locality within the parish.

Each particular cluster search will entail its own unique parameters. But, generally speaking, the concept is to extend the research to a defined geographic area surrounding the target family and then examine the resultant population for similarities that may be applied to further delineate the origin or even the identity of the target family.

No comments:

Post a Comment