Pages

Tuesday, October 27, 2020

Nearly 900 Million OCR-Indexed Records Coming from FamilySearch

 

https://www.familysearch.org/blog/en/optical-character-recognition-indexing/

I have been keeping a journal for many years. Originally, it was on paper and handwritten but eventually, I started keeping it on a computer. When I first started, because computer storage was so volatile, I printed off a copy of the journal periodically and kept it in three-ring binders. Finally, the inevitable happened. I lost the computer file for part of my cumulative journal. Fortunately, I had the print-out for the part I lost. Back then, Optical Character Recognition or OCR was just getting started. When I discovered that a file had been lost, I used an OCR program of the day to reconstruct the missing part of my journal. From that time on and because of other episodes of lost files, I am compulsive about backing up my files. 

Now, OCR is far from a "new" technology. I use it frequently when I need to make changes to a document stored in PDF format although I very seldomly use OCR from paper documents. 

I have always been somewhat incredulous about the fact that genealogists have been so slow to incorporate OCR technology in their digitization efforts. I was interested to see the above blog post about OCR and machine learning. Some of the larger genealogy companies have been using OCR to index records for years. Billions of pages of newspapers from around the world are now fully searchable as well as millions of books. FamilySearch.org has about 491,000 digital books online according to the FamilySearch Facts of September 2020

Digitization of historical records is immensely useful for historical and genealogical research. The fact that we can now access billions of records online has revolutionized how we do genealogical research. But searching through historical records, even in digital format, is time-consuming and inefficient. Indexed records speed up the research process. FamilySearch.org has an online counter that graphically shows the increasing number of images from "the world's largest collection of historical documents." The number is constantly changing but at the time I wrote this down, the number was 4,252,458,500 and counting. There is also a link to view the most recently added images

https://www.familysearch.org/records/images/search-results?startDateAvailable=2020-10-13

Due to the current pandemic, we are currently excluded from many libraries, including the Brigham Young University Family History Library where I serve as a Church Service Missionary for The Church of Jesus Christ of Latter-day Saints and the famous Salt Lake City, Family History Library. Were it not for the huge online collections of documents, my own research and my ability to help others with their research would be dead in the water. 

It is gratifying to see from the article linked at the beginning of this post that FamilySearch is finally making a major effort to not only digitize records but use long-standing technology to make them available with OCR and machine learning indexes. There will always be a component of this automated indexing that requires human review but as the huge online indexes to newspapers and books have shown, OCR and machine learning can go a long way towards opening-up billions of pages of records for searches. They are not going to put the indexing volunteers out of business, as the article explains, but the task of the human indexers will change over time. 

Huge advances have also been made over the past few years in handwriting recognition. The blog post above makes an oblique reference to being able to read handwritten documents. I believe the technology is in place to make significant advances. One of the most direct ways to speed up the review process is to allow those who do research into handwritten historical documents to "correct" the entries found by the OCR process. This is allowed in a very limited number of documents on FamilySearch.org but has been in generally in place on Ancestry.com for some time. 

I look forward to continued advances but for now, I am glad to search through images. Looking at the original document is always a requirement for careful research. 

No comments:

Post a Comment