|Chris Tensmeyer at BYU Family History Technology Workshop|
Enabling Efficient Chinese Jiapu Information Extraction by Stephen Liddle et al. of BYU
My reporting and impressions:
There is a real challenge in using OCR technology to transcribe Chinese characters for Jiapu. They are having some of the same problems that have I have experienced over the years with OCR software in general. One of the problems is integrating the relationship charts to the extracted characters. See download.
CONFIRM - Clustering of Noisy Form Images Using Robust Metrics by Chris Tensmeyer et el.
See this link for a summary. This is a proposal to resolve the problem of extracting data from historic forms. Indexing issues need to be addressed when forms vary in the form type. For example, two census forms may contain different fields in different positions. CONFIRM is a method of extracting the data from different form structures by creating templates that correspond to the forms. The idea here is to extract the useful information with OCR-type programming and not miss information because of changes in forms. One process is to extract the lines, either visible or virtual, from the form. Actually, in doing graphic design for the past 30 years or so, I have encountered some similar problems in matching forms.
Learning Alternative Name Spellings on Historical Records by Jeffrey Sukharev et al. from Ancestry.com
Quoting from the above article:
In this article we discuss the problem of finding alternative name spelling, an important component of name matching (part of the record linkage field). We started this project primarily because of real issues that we encountered while working on name matching for various Ancestry.com projects: tree node de-duplication (people in family trees) and search query reformulation. - See more at: http://blogs.ancestry.com/techroots/learning-alternative-name-spellings-technical-report/#sthash.VYugz0i2.dpufAncestry.com has 16 billion records. This is an interesting number. Name matching is a key operation in genealogy searches. As I see it, there are really two or more issues, one is alternative spellings of the name of one individual and the other is alternative spellings of different individuals.
It is one thing if the name is spelled differently in the two different cases. This is an area that very quickly gets into statistical analysis and probability. This is also a case where alignment models are used to compute the expectation maximization. See Alignment Models and Algorithmsfor Statistical Machine Translation and Wikipedia: Expectation–maximization algorithm.
Intelligent Pen: A Least-Cost Search Approach to Historical Document Image Segmentation and Stroke Extraction by Kevin Bauer and William Barrett, Brigham Young University.
From the above document:
A subject of increasing interest in document processing and analysis is the movement to index the handwriting contained in scanned images of documents. Many of these documents are historical in nature such as census records, birth and death records, parish and church records, journals, marriage certificates, and lists of passengers on ships.The object of this inquiry is the development of accurate handwriting recognition software. Rather than translate what was said I suggest the linked article above. It is hard to understand without looking at the images they are examining.