Pages

Saturday, September 7, 2013

A Look at Optical Character Recognition (OCR) for Genealogists

Depending on how you do you research, you may very well have documents that you wished the text of which could be entered into a computer. Any scanner that will produce a readable quality image of the text can be used by an optical character recognition (OCR) program to produce a word processing or text file. Most flatbed scanners sold today come with some sort of OCR software. There are dozens of OCR programs on the market, recently, the standard programs that come with a scanner purchase have been getting more and more accurate but there are still limitations.

There are many free OCR software programs and a fairly large number of commercial programs. If you have a program such as Microsoft Office, you may already have OCR capabilities in the Microsoft OneNote software sold with Office. In addition, you may also have Microsoft Office Document Imaging. Search for OCR and free on Google and you will probably find dozens of programs to look at and evaluate. But before you do this, check to see if you already have a program on your computer or one that came with a scanner purchase. Another option is Adobe Acrobat

It may not be obvious at first, but the quality of the original makes a huge difference in the final product produced by an OCR program. Any multi-generation photocopy will almost always produce unsatisfactory results. Clean, crisp text on contrasting paper will do the best, but many of the programs available are sophisticated enough to produce acceptable copy. Any OCR output should be carefully proof read for strange mistakes. The most common problems are reading numbers as letters and letters as numbers. Double letters and other similar character sets in the original can cause a major increase in the error rate.

I use OCR for long text documents such as stories and journals. I once lost the file of my personal journal, but fortunately had a printed version. So I used OCR software to read the entire journal back into the computer. Unfortunately, the accuracy of handwritten documents is extremely low and it will still take a while to develop an effective system, in any event, it is very unlikely, at least with the present technology, that any kind of software that will read old handwritten documents will be developed, but there are quite a number of people working on the problem. There is a development in OCR software called Intelligent Character Recognition (ICR) that is much better than OCR.

One of the leading software programs, that includes some ICR as well as OCR capabilities is ABBYY FineReader now in Version 11. I have used some of the older versions of the program and was quite impressed with its capabilities. I do have some OCR projects piling up, so I may be back into the market here shortly.



2 comments:

  1. I'm embarrassed to admit that I had no idea that OCR software would be available so easily. For some reason I thought it would be beyond the reach of mere mortals and only available for big institutions like libraries or companies. At the moment I am struggling to think how I would use it. I only have handwritten documents that I might want translated to computer as it were and we're not quite there yet as you say.

    ReplyDelete
  2. I find a free online ocr, it can recognize text from jpg, png, tiff, bmp and gif image.

    ReplyDelete