Some people eat, sleep and chew gum, I do genealogy and write...

Monday, March 2, 2020

MyHeritage Adds Huge Collection of Historical U.S. City Directories

https://blog.myheritage.com/2020/02/myheritage-adds-huge-collection-of-historical-u-s-city-directories/
Quoting from the MyHeritage blog post,
We are pleased to announce the publication of a huge collection of historical U.S. city directories — an effort that has been two years in the making. The collection was produced exclusively by MyHeritage from 25,000 public U.S. city directories published between 1860 and 1960. It comprises 545 million aggregated records that have been consolidated from 1.3 billion records, many of which included similar entries for the same individual. This addition brings the total number of historical records on MyHeritage to 11.9 billion records.
This huge collection of new online records was announced this past week at the annual RootsTech Conference in Salt Lake City, Utah. You need to read the blog post linked above to understand the depth of this mammoth undertaking. Here is a short quote to help you begin to understand this valuable resource.
The city directories in this collection were published by thousands of cities and towns all over the U.S., and each directory is formatted differently. The huge amount of content and its variety made the project more challenging and required the development of special technology to process the city directories. 
We first used Optical Character Recognition (OCR) to convert the scanned images of the directories into text. This process can result in errors in the output, and we created algorithms to detect and correct some of these errors. 
Then, we needed to parse the records to identify the different fields in each record: names, occupations, addresses, and more. The differences in formatting between the books presented an additional challenge. Our team employed methods such as Name Entity Recognition (NER) and Conditional Random Field (CRF) to train an algorithm using a per-book model — meaning that for each of the 25,000 books, we manually labeled a sample of the records and used it to train the algorithm how to parse that directory. Using this model, the algorithm was able to parse the entire book into a structured index of valuable historical information.
One major development implemented in this collection is the consolidation of entries across years of the directories' publications. Here is a further explanation of this process.
After all the information was parsed, we consolidated the records in an unprecedented way. We identified records thought to describe the same individual who lived at one particular address over several years, as published in multiple editions of the city directories. We then consolidated all of those entries into one aggregated record that covers a span of years. This reduced “search engine pollution,” wherein a search for a person would have returned multiple, very similar entries from successive years, obscuring other records. The aggregation makes it easier to spot career changes, approximate marriage dates, re-marriages, and plausible death dates. To our knowledge, the algorithmic deduction of marriage and death events from city directories is unique to MyHeritage. 

No comments:

Post a Comment