Pages

Tuesday, February 9, 2016

Handwriting Recognition, OCR and Genealogy

By No machine-readable author provided. GJo assumed (based on copyright claims). - No machine-readable source provided. Own work assumed (based on copyright claims)., CC BY-SA 3.0, https://commons.wikimedia.org/w/index.php?curid=2185642
One of the hot topics of genealogy today is the development of an adequate system to computerize handwriting recognition. At the 17th Annual Brigham Young University Family History Technology Workshop in Provo which I attended the day before going to Salt Lake for #RootsTech 2016, there were some extensive reviews of the progress being made towards this ultimate goal.

Optical Character Recognition (OCR) has come a long way from its sketchy beginnings nearly 200 years ago. The first developments involved systems for aiding the blind to read developed in the early 1800s. Here are some references to this history of OCR that you might find helpful.

Cheriet, M. Character Recognition Systems: A Guide for Students and Practioners. Hoboken, N.J.: Wiley-Interscience, 2007.

International Conference and Exhibition on Multi-lingual Computing (Arabic and Roman Script), University of Durham, Centre for Middle Eastern and Islamic Studies, and Documentation Unit, eds. Proceedings of the 3rd International Conference and Exhibition on Multi-Lingual Computing (Arabic and Roman Script). [Durham]: Documentation Unit, CMEIS, University of Durham, 1992.

Netherlands Historical Data Archive, and Nijmeegs Instituut voor Cognitie en Informatie, eds. Optical Character Recognition in the Historical Discipline: Proceedings of an International Workshop. St. Katharinen: Max-Planck-Institut für Geschichte In Kommission bei Scripta Mercaturae Verlag, 1993.

Schantz, Herbert F. The History of OCR, Optical Character Recognition. [Manchester Center, Vt.]: Recognition Technologies Users Association, 1982.

You might notice that these references are to fairly old books and articles. OCR has been very slow in developing and advances in the technology are incremental rather than revolutionary. In the genealogical world of today, we are almost saturated with OCR produced information in the form of newspapers, books and other printed documents. There are millions upon millions of documents online that have been fully scanned and read by OCR programs. Every time you look at one of the digital book or newspaper websites you are benefiting from OCR. Many of our most common activities such as sending a letter through the U.S. Post Office are supported by OCR technology.

Despite all of the improvements in the standard OCR technology, for many years the goal of handwriting recognition has been elusive. Here is a statement from the University of Southern California in an article entitled Optical Character Recognition written several years ago.
The next hurdle for optical recognition is handwriting. Currently, OCR technology works at its optimum level with clean, standardized text documents (i.e., typewritten, first-generation). This is what allows the recognition mechanism to work best. Handwriting is another matter altogether — the individuality of handwriting makes it indecipherable by standard OCR software.
Here is an idea of the challenges facing those who are trying to implement handwritten OCR software:

"Puerto Rico, registros parroquiales, 1645-1969," images, FamilySearch (https://familysearch.org/pal:/MM9.3.1/TH-1-10481-5936-33?cc=1807092 : accessed 9 February 2016), Cataño > Nuestra Señora del Carmen > Bautismos 1779-1862 > image 4 of 333; paróquias Católicas, Puerto Rico (Catholic Church parishes, Puerto Rico).
If you would like to get a perspective of what is currently being done, you can review the Archive on the Brigham Young University Family History Technology Workshop. The papers from 2016 will likely be published in the Archive some time soon, but meanwhile you can read the papers under the Program tab. I was particularly fascinated by the progress being made in the area of handwriting recognition.


3 comments:

  1. Hello, James! And thank you for your interesting article!
    As you have already said, the most difficult problem is handwriting recognition. Optical character recognition is developing, although not as fast as we would like. More than 30 years ago, ocr was available in a few programs. Now any person can find an ocr handwriting recognition service. What awaits us after another 30 years? It seems to me that machine learning should significantly speed up the process of recognizing ancient handwritten texts. Computers that can learn to recognize and memorize text themselves will help to bring optical recognition to a new level.

    ReplyDelete
  2. If you have the right o computerize your complete work so do that because that makes it much easier for all of us to overcome the problems and
    type a text for the improvements you have in your mind.

    ReplyDelete