Pages

Wednesday, November 17, 2021

FamilySearch using computer-assisted indexing for digitized records

 

https://www.thechurchnews.com/members/2021-10-29/computer-assisted-indexing-familysearch-records-231067

This article is interesting from several different aspects. First, there are some interesting and updated statistics. Here is a quote with some of the statistics from the above linked article to start out. 

In September, FamilySearch announced a milestone 83 years in the making — the completion of digitizing its collection of more than 2.4 million rolls of microfilm. 

The digital archive containing information on more than 11.5 billion people represents over 200 countries and principalities and more than 100 languages. 

The reference to "83 years in the making" is obscure. This refers to the date of the first microfilm efforts in 1938. Hence, 83 years. FamilySearch hasn't been digitizing for nearly that long. It is the case, however, that some of the early microfilms have been indexed. The next statement in the news article has got to be the genealogical understatement of the year. 

While images of these records are available to view online, several records still need to be indexed so FamilySearch users can search for and find them. Many of those languages, however, are difficult for people to index.  

I think the writer of the article left out the word "billions" after the word several. For some time, I have been watching to see how many of the billions of records on the FamilySearch.org website have been digitized.  The number of digital images (as opposed to people represented by those images) is about 4.6 billion according to the Company Facts section of the FamilySearch.org website. Now to the percentage of records that are not yet indexed, again quoting from the article above. 

Only 20% of FamilySearch’s online historical records are currently indexed, and FamilySearch hopes computer-assisted indexing can increase that percentage at an accelerated pace. 

My most current estimates ran at about 30% which is a figure I have heard several times from FamilySearch. However, more recently, FamilySearch has been uploading raw digitized images to the Image Section of the website. 

Optical Character Recognition (OCR) has been available for years and is very sophisticated. I have always wondered why FamilySearch did not utilize this existing technology to assist in indexing. It appears that they may have now started to do so. Of course, they use OCR to digitize their online books collection that presently stands at about 531,909 and is increasing weekly. 

It isn't clear from the article exactly how FamilySearch is using artificial intelligence to assist in indexing, but I can guess that they are relying on research done by the Brigham Young University Family History Technology Lab in part. You might want to read the article for yourself and see what you think. 

Here is another quote.

Records indexed by a computer are labeled with a box in the top right corner that reads “This record was indexed by a computer. If you find an error, click here to report it.” 

I have yet to run into any of these records. I am aware of the obituaries that were transcribed by OCR that allowed corrections, but I have been asking for years why FamilySearch does not crowdsource their indexing online with user transcriptions of individual records as they are searched. Here is the record shown in the article. 



Now back to the issue of the images. It is apparent to me that the number of images that are unindexed is growing faster than the effort to index them. This is likely the incentive for FamilySearch's automated indexing efforts. 

2 comments:

  1. The obituaries you mentioned seeing are using some of the same technology mentioned in the article. There are a couple different concerns that we're using artificial intelligence technology to address:

    1. Reading text from an image (handwriting and/or typewritten) Most off-the-shelf OCR only works with printed type and even that works much better with higher quality images which historical records are often not. Handwriting is obviously harder, but is also being done with pretty good results.
    2. Natural language processing - even if you know what the text says, you need something to tell you what pieces are names/dates/places/relationships/etc. and how everything fits together to make indexed records.

    Many of the obituaries you've encountered have only used the technology from #2, although there are some from printed newspapers where OCR has also been done. In terms of handwriting recognition, the reason you likely haven't seen any is because we're focusing mostly on Spanish Catholic christening records at the moment because the number of indexed records in other languages greatly pales in comparison to English. We've published nearly 67 million records this year of these handwritten records and hope to get to about 115 million by the end of the year. More languages, record types and styles are coming as fast as we can develop the capabilities. We're hoping to eventually turn the tables like you mention so that capturing images becomes the constraining factor. We've already had some cases where indexed records were published within 24 hours from the time the images were uploaded for new image captures.

    ReplyDelete
  2. Thanks for the update, I wish this pandemic was over and I could meet with everyone in person. I appreciate the Spanish Language records because I am helping with consultations in Spanish.

    ReplyDelete