Some people eat, sleep and chew gum, I do genealogy and write...

Thursday, April 9, 2026

The Main Challenges of Full-text Search Part One

 

Three of the major online family tree/data base websites have implemented AI based full-text search and to some degree, handwriting recognition in the last three or so years. FamilySearch.org's offering is called "Full Text Search" and includes handwriting recognition. The Full Text Search is available for free to all users. MyHeritage.com introduced a similar program called Scribe AI. Ancestry.com's contribution is confined to OCR and lacks handwriting recognition. All the efforts of the genealogy programs are behind the ability of Google Gemini with NotebookLM and some of the other AI websites. Of course none of the genealogy programs have the resources of Google or OpenAI and the others. 

As far as the genealogical community is concerned, handwriting recognition, document translation, and full-text search are revolutionary in changing the way we do research. I can put hundreds of documents into NotebookLM and then have a discussion with Gemini about the contents of all the documents at onece and the conversation quicky evolves into an examination of what records need to be found to resolve serious historical issues. But working with full-text search opens a whole new series of challenges. 

The first and major challenge is the issue of search terms. In the old way of searching using Google, for example, you can formulate search terms based on categories. For example, if I want to find some probate records, I search for probate records in the state or country where my ancestors lived. I didn't expect Google to find the actual records of my ancestors. That was left to me to search through the records individually for the people I was interested in finding. Of course, I could go back to the time when I was forced to use microfilm and read through the entire collection of records to find one name. The time and effort spent in searching record collections severely limited the amount of information I was able to gather in any reasonable time period. In addition, of course, I was faced with the issue of deciphering all of the handwritten records. The results of these conditions were that very little research was produced in any particular time spent. 

 Full-text search opens up the ability to review hundreds, perhaps thousands of records in a matter of minutes, but the catch is the fact that you need to guess the search terms to use to open up the information. Let's say your ancestor's name was John. Here is a list, generated by Google Gemini, of the variations of the name John. 

The name John is one of the most enduring and prolific names in history. It originates from the Hebrew name Yohanan (יוֹחָנָן), meaning "Yahweh is gracious."

Because of its biblical significance—associated primarily with John the Baptist and John the Apostle—the name spread through the Byzantine Empire, into the Latin-speaking West, and eventually across the globe through various linguistic shifts.


Linguistic Evolution

The path of the name from its ancient roots to modern English follows a clear phonological chain:

  1. Hebrew: Yohanan

  2. Greek: Ioannes (Ἰωάννης)

  3. Latin: Iohannes / Johannes

  4. Old French: Jehan

  5. Middle English: Jan / John


Variations by Region and Language

Western & Northern Europe

These variations reflect the Germanic and Romance linguistic families where the name has remained a top choice for centuries.

  • English: John, Jack, Johnny, Jon

  • French: Jean, Yannick (Breton)

  • Spanish: Juan, Juanito

  • Portuguese: João

  • Italian: Giovanni, Gianni, Nino, Vanni

  • German: Johannes, Hans, Jan, Johann

  • Dutch: Jan, Johan, Hanne, Hans

  • Scandinavian (Danish/Swedish/Norwegian): Johan, Jan, Jens, Hans

The British Isles (Celtic Variations)

The Celtic languages developed distinct versions of the name that have since become globally popular in their own right.

  • Irish: Seán, Shane, Shaun, Shawn

  • Scottish Gaelic: Iain, Ian

  • Welsh: Evan, Ioan, Ieuan, Siôn

  • Manx: Juan

Eastern Europe & Eurasia

In Slavic and Orthodox traditions, the name often maintains the "I" or "Y" sound from the original Greek Ioannes.

  • Russian: Ivan, Vanya

  • Polish: Jan, Janusz

  • Czech/Slovak: Jan, Ján, Janko

  • Hungarian: János, Jancsi

  • Romanian: Ion, Ioan, Ionuț, Nelu

  • Bulgarian/Serbian: Ivan, Jovan

  • Greek: Ioannis, Giannis, Yannis

Middle East & Africa

These versions often stem directly from the Hebrew original or the Islamic tradition.

  • Arabic: Yahya (يحيا), Yuhanna (يوهنا)

  • Hebrew: Yohanan (modern: Yochanan)

  • Amharic (Ethiopia): Yohannes

  • Turkish: Yahya

Asia & Pacific

In these regions, the name is often adopted through religious conversion or phonological adaptation of Western names.

  • Chinese: Yuēhàn (約翰)

  • Japanese: Yohane (ヨハネ - Biblical), Jon (ジョン)

  • Korean: Yohan (요한)

  • Hawaiian: Keoni


Diminutives and Medieval Short Forms

Historically, many surnames were created from pet names or shortened versions of John.

  • Hank: Derived from the Dutch Hanne.

  • Jan: Common in Northern Europe; used as a root for many surnames.

  • Jenkin: A medieval English diminutive ("Little John").

  • Hick/Hitch: Obsolete medieval English rhyming nicknames for John.


Summary Table of Major Forms

LanguagePrimary FormCommon Diminutive
EnglishJohnJack
SpanishJuanJuanito
RussianIvanVanya
GermanJohannesHans
ItalianGiovanniGianni
IrishSeánShane
ScottishIanIain
FinnishJukkaJani
 Which one of all of these terms was the one used by your ancestor named John? Did he use the name John at all, or did he use some other name, such as Bubba or Kid or J.T.? So when you are faced with a search field such as this one from FamilySearch, what are you going to use for the search terms?


If you assume that the person's name was John, what are your chances of finding him if he went by one of the other names?  For example, my great-grandfather's official name was Henry Martin Tanner, but when he signed legal documents, such as deeds, he always used Henry M. Tanner. Full-text searches are rather literal, and if I search for Henry Tanner. I will possibly not find Henry M Tanner.  I can use all sorts of Boolean algebraic terms, but I will still face the same problems of determining the search terms I need to use to find any specific piece of information I am searching for. Another example: one of my relatives is named Joseph Christiansen. His grave marker says Joe Christiansen. He apparently did not like to be called Joseph. How am I supposed to know this?

 the people programming full-text search could add the variations for all the names and all the places and practically everything else into their program. They might even implement artificial AI to recognize all of the variations. What happens in that circumstance is that the number of documents discovered by AI can run into the millions.

Do I have a solution for this? No, but I have a methodology I use to attempt to narrow down the number of possible variations. This primarily includes carefully reviewing the documents that I do find to discover the possible limited variations of the name used by the person I am searching for. This whole process also applies to place names and, to some extent, to dates, particularly when you think about calendar changes.

This is part one of this particular series of articles, and hopefully you will stay with me and read the rest of the series as it comes out over the next few weeks.

No comments:

Post a Comment