Some people eat, sleep and chew gum, I do genealogy and write...

Monday, May 6, 2013

Why not to rely too heavily on search engines for genealogists

I recently did a post of a comparison of the search engines used by Ancestry.com, FamilySearch.org and MyHeritage.com. This short exercise pointed out one very important issue: do not rely too heavily on the accuracy or completeness of search engines. This applies at whatever level and at whatever subject, but is especially true for those of us spending time looking for ancestors online.

It is important to note some really basic rules with search engines, including Google and all the rest down to the little search box on most all websites. First and foremost, you can only get out what you put in. In other words, if you aren't searching for the right subject, you will never find what you were intending to find. Search engines rely on what is called a "matching string search." That means that what you put into the search engine will determine what it looks for. Usually and most commonly, the search engine will try to match the characters you enter to something in its database. But let's think about that for a minute.

Let me go through a hypothetical search. Someone accumulates a list of records For example, they digitize a huge pile of death certificates. What they have is a whole lot of digitized images that look more or less exactly the same. At this point, the digitized pile is roughly equivalent to a paper pile. A computer, no matter how wonderfully fast and accurate cannot tell the difference between any one image and any other image without someone giving it instructions on how to do that.

As a side note, what if you decided to use some sort of technology, such as OCR (optical character recognition)? Would this help? Maybe. How accurate is your OCR program? What if the images are really bad and difficult to read? What about image matching technology? Well, it might be able to find another image that looks like a death certificate, but with a pile of death certificates, that isn't any help at all. We aren't interested in finding any old death certificate, we are interested in the information on the form.

If you want to find any one of the names on any one death certificate, you will have to go through the pile, one by one, and look at each image to see if that image is one you are looking for. Even if you do this, you can't be absolutely sure you didn't go brain dead at some point and simply miss looking at the right image. If you want to improve on the process, you might resort to extracting the names of the deceased from the records and putting the names in some sort of order in an index. How about alphanumeric, commonly called alphabetic? Hmm. That raises another basic issue; did you decipher or record the names of the deceased individuals on the records correctly? Was the information on the record correct in the first place? Did the person entering the information into the death certificate even know what he or she was doing? This is the subject of another post.

You could do a lot of things at this point to improve your accuracy, you could have two people go through the pile and extract the names and then have a third person look at the lists compiled by the two extractors and arbitrate any differences. Or you could just accept whatever was extracted and leave it at that. But wait, there are lots of different pieces of information on a death certificate. Are you going to extract only the name of the deceased? What about the death date? What about cause of death? What about the name of the person supplying the information? What about the place of birth, death etc. Ideally, you would want to extract every last piece of information from the form and put it into your index.

OK, so now you have your index. The computer with the right software, can search a very large index in less time than it takes you to type in your search terms. But guess what. When the extractors looked at your particular death certificate, they saw McCormich and your ancestor was named McCormick. So you use your search engine (the program that does the actual search) and it does not find the match. Try as you might, you never find a McCormick in the file that was created. Does that mean the record you are looking for is not there? Have you moved beyond looking at each record individually and seeing if you can find the right one?

Oh, you say, I will make my search program smarter. I will tell it to look for any combination of characters that might look the same, such as "h" and "k," then it will work. Guess what, the U.S. Census Bureau tried this a hundred years ago with a system called Soundex. You can look it up or go to the FamilySearch Research Wiki for that subject. No, I am not going to give you a link, you will have to use your own search engine this time to find the answer (I am trying to make a point).

So, does all this mean that we are going to have to do manual searches of all the records to be sure that the information we are looking for is not there? Well, yes, sometimes that is necessary. We used to do this all the time with microfilm. But now we rely on a very unreliable, incomplete, kluged, sort-of, highly specialized type of program we call a search engine to do all of this for us. Don't you think you might want to think this through a little bit before claiming you can't find what you are looking for online?

1 comment:

  1. I think for genealogists the frompo search engine id best it is curates, user-friendly and eco-friendly too.


    www.frompo.com

    ReplyDelete