RootsTech 2014

Some people eat, sleep and chew gum, I do genealogy and write...

Wednesday, November 2, 2011

More on managing huge data files

Once again I must warn the faint of heart and weak of constitution, this post deals with managing files of unusual size (not related to the rodents of the same name). What do you do with files of 20,000 or more individuals? You don't put the documents in pretty colored folders. Do not be offended by my comments. There are considerations of scale that make most of the reason organizational strategies unworkable and ponderous. How long would it take me to file 73,000 documents and color code them? How much would it cost just to buy file folders and where would I keep them? My house is already full of boxes. I have no intention of printing out my file not even on copy of the pedigree going back 19 generations or so.

Years ago when data storage was a major issue, many people managed their larger files by splitting them into various lines, either four or eight or even more. Managing these somewhat smaller, but split files are a nightmare. One major problem with split files is pedigree collapse, the fact that some of your ancestors married relatives and lines cross. With even two files, assuring that you are not creating duplicate entries becomes an overwhelming challenge. More files increase the challenge. Oh, you say, what are a few duplicates? Essentially, breaking a pedigree into files corresponding to maternal and paternal lines is a way of trying to duplicate a paper system, colored folders and all that. However, I might point out that most of the people who are married or otherwise are not related. It is perfectly acceptable to have two different files for people who are initially, at least, unrelated.


Back in the pre-computer days, my Great-grandmother worked for years on her own lines and those of my Great-grandfather. In going back through her surviving research, it is apparent that she researched some of the lines two and even three times, duplicating her efforts over the years. The reason is simple, even with a meticulous filing system, when you get over 10,000 or so names, you cannot possibly remember who is who. With computers this cross-checking becomes manageable, but I would venture that anyone approaching 20,000 names has hundreds of duplicates with no way of eliminating them, how do you know if one "Mary in 1730" is the same as another "Mary in 1740?"


Enough editorializing. What do you do about the large files? First step: Make sure you know what you do know and do not know what you do not know. Sounds simple? Not really. How confident are you in each of the ancestral levels? Who is the first person in your huge file that you do not personally know to be properly documented? Assign a percentage reliability factor, either write it down, or keep it in your head, on each of the generations in your pedigree. Are you 100% sure you are correct or some lesser number. Think it through. How firm are your names, places and dates? If I wanted to do so, I could add thousands upon thousands of names to my pedigree by simply copying work online or in books. Is this doing genealogy? Most of today's database programs give you a way to express your degree of reliability. Where does your knowledge end? That is the limit of your research. Any steps you make to extend your level of confidence through actual research and documents increases your family's pedigree. Everything you document can then become stepping stone for future researchers who will not have to go through the same process you did to gain a degree of certitude and reliability.


If you have a huge file you can choose to increase the documentation at the levels that it is possible to do so, or ignore the known and focus on the unknown. That is your decision as a researcher. For years I have chosen to document existing lines for the simple reason that no such documentation has existed in my family. Yours might be different. But in each case we only focus on the known and move one step at a time into the unknown. That is the key to managing a huge file. It doesn't matter how you mark the individuals and families you know are correct, it is sufficient that you have documented them and can then move on. Any system that allows you to do this is a good system.


Keep working, you will get a lot done. Leave a clear trail so you can be followed.

No comments:

Post a Comment