Pages

Sunday, August 4, 2013

Who benefits from genealogical data standards?

In an insightful post on Justin's Genealogy Blog, entitled, "Everyone Benefits from Data Portability," Justin makes some interesting comments concerning standardization of the interchange of genealogical data. As I understand it, his argument is that everyone, including the developers of genealogical software, would benefit from a common information exchange standard. But he also admits the following:
I have never paid for a genealogy product precisely because the world I just described doesn't exist. The pain of keeping multiple trees in sync is greater than the benefit of features which products offer (at least for me). If it were trivial to keep all desktop products and online trees in sync, I would start buying.
So part of his argument, at least, is that some genealogists do not make purchases due to data incompatibility.  I believe that this is the first time I have heard that particular view expressed.

I think that perhaps I have not been clear enough in my earlier posts as to exactly what I am talking about. To explain, I will resort to my usual use of hypothetical situations. But before getting into hypotheticals, let's review a little history and go back before the beginning of genealogical software development. In the early days of personal computers, there were dozens of manufacturers of different computer systems using a variety of computer processors. I remember both the Altair 8800b and the IMSAI 8080 computers, although I only owned an IMSAI (but didn't ever use it). The first personal computer I ever spent any time working on was the TRS 80 from Tandy Corporation (Radio Shack). At that time, the three big names in computers were Apple, Tandy and Commodore. In 1979, the TRS-80 had the largest available selection of software in the microcomputer market.

Now, in 1979 or even into the 1980s when I started using Apple II computers, there was not even a concept of data exchange. Any existing genealogy programs were rudimentary, text-based and not very useful. I was talking with a friend yesterday who related how her son wrote her a genealogy program back then, which of course, only had capital letters and couldn't be printed. Just as there were a variety of computer platforms, such as Atari, Texas Instruments (TI) and in 1981, the IBM PC, there were different operating systems and different programming languages. There was no way to connect two different computers together and no one was really concerned about doing that anyway.

Along came some genealogy programs and it was almost a miracle just to have your genealogical data in a computer file where you could search for duplicate names and find the information you had entered. Of course, I could have shared a file with someone else, had I known anyone who was interested and had the same computer brand and software program that I did. The point of this review is to show that there were no computer standards from the very beginning of the personal computer revolution. Computer programs were written for a specific computer with a specific operating system.

Was data file exchange an issue back then? Yes, it certainly was. Was it in the interests of the developers and manufacturers to make their data files compatible? Can you imagine Apple and IBM getting together to formulate a data standard?

Fast forward to the present. Exactly the same situation exists today. We have dozens of different computer manufacturers and still have incompatible different operating systems. When was the last time you tried to open a data file or document and found that the file type could not be opened because you did not have that particular program on your computer? In a perfect world, with no economic competition, maybe someone could dictate absolute file compatibility. But even then, with the changes in technology and the development of new processors, data incompatibility is inevitable.

Can programs be written to "translate" the data from one program to another? Yes, sometimes and with the cooperation of the various manufacturers. I can presently run Windows programs on my OS operating system Macintosh computer with a program. But even that level of integration does not make the data files compatible.

Now, a hypothetical. Suppose I am a developer of genealogical software. If I am going to spend my money and my time writing a program, I might like to make a profit. Do I start out with the idea of making my program as compatible as possible with every other program on the market? Not if I can help it. I make my program as unique as possible so that I can differentiate my program from all of the others already being sold. Ultimately, I would hope that my program became so popular that it becomes the de facto standard for programs for genealogy.

But, you say, you are confusing operating systems, file formats and data. Yes, I am. At each level there is a challenge in exchanging information. Yes, people do write translator programs to move information from one program to another. For example, there are dozens of different file formats for images, such as .jpeg, .tiff, .png, .CR2 etc., Most imaging programs can read some of the more common file formats and you can see the photo or image, but there is no "standard." I use Camera Raw files from a Canon Camera and store them as .dng files, that is Adobe Digital Negatives. Very few of the popular programs can read my files. Is this a problem? Yes. Do I worry about the format and file type? Yes. Is there any movement to make a single image file standard? Likely, but probably it will not be effective.

Given the history of computers and given the history of computer programming, is it likely that all of the existing genealogical program developers will suddenly decide that everyone will benefit from a common standard? Not at all likely.

The next question is, would everyone benefit from a common data exchange standard assuming it was possible to design one and it became universally adopted? Maybe and maybe not. Do we really want to stop program development and freeze it at some arbitrary level. Oh, but you say, standards can be revised and updated. If that happens the standard is following the market, not imposing the standard on the market.

Will data become easier to move from one program to another? Yes, certainly. GEDCOM with all its present limitations is a good example of a way to move information between programs without impinging on their own file structure. So, a standard in genealogy has to be semi-independent of the programs. It needs to be useful and relatively easy to use and apply, but it also has to be independent.

More on this, I am sure.

9 comments:

  1. "Do we really want to stop program development and freeze it at some arbitrary level?"

    But that wouldn't happen. "Software" contains (at least) two components - data and algorithms. Making the data exchangeable allows plenty of scope for the algorithms to be wholly different. The latter part gives opportunity for a Unique Selling Point. Further, most of us would be quite satisfied with interchanging a sub-set of the data, i.e. the genealogically relevant part. Other stuff, like research plans, could stay in one place. (OK - I'm skating over a lot of stuff there about how the research stuff points to the research subjects, i.e. the genealogy, but it's do-able.)

    To take one example - I uploaded my tree as a GEDCOM into Ancestry once. The Ancestry hints have been surprisingly helpful (being flippant about it, the score is about Ancestry 20, I-can-do-it-all-myself 0). If I could transfer a GEDCOM without loss from my desktop software into FTM and vice versa, then I'd buy a copy of FTM solely to sync my tree with Ancestry. As it is, I have no intention of buying FTM and only update major people on my Ancestry tree.

    The flaw of course in all this is what you alluded to earlier - the demand from most people is nothing like this sophisticated.

    ReplyDelete
    Replies
    1. I agree about the problem with sophistication. Transferring data is interesting to say the least. I do use FTM to synch my data with my local computer.

      Delete
  2. The biggest question here James, IMHO, is that of "scope". For people with a strong software background, like myself, it goes without saying that a representation can be conceived that would make our data transportable between different products, on different machines, and in different locales. This has happened in many industry sectors previously.

    So, other than misguided fears that a vendor may be losing market share by opening up their data, what arguments are there against a comprehensive new standard?

    Well, the extra effort involved could be one argument. GEDCOM is relatively simple, although its model is also rather simplistic. Where would we draw the line, though, with enhanced representations. There are many useful features that our software products could implement, but without a more comprehensive standard the associated data could not be exchanged with other products. Should we expect the scope of our products to remain fixed in a GEDCOM-inspired world for another decade?

    ReplyDelete
    Replies
    1. This leads to an interesting question raised by other commentators also, what effect does having a standard impede development of other new features in software programs?

      Delete
    2. I feel sure you'll be writing separately about this soon James ;-)

      It's a good question that deserves some analysis. I personally feel that GEDCOM is extremely limiting in its scope, but its concepts and its data model(*) have influenced the design of most existing software products. Maybe this is because they have to keep some level of compatibility for effective sharing - I don't really know.

      When I joined FHISO, I really wanted to see the industry adopt a more comprehensive data model. However, after talking to a number of people - including end-users - I can now see good arguments for also fixing GEDCOM (as explained elsewhere), in addition to that bigger goal. FHISO is looking at several possibilities here.

      I've heard a number of arguments against a comprehensive data model, including: 'it's not possible to devise a one-size-fits-all' (I believe 95% is definitely possible), 'why should I change my product to use this new model?' (you don't - it's for data exchange rather than replacing any internal data model), 'it would become too complex for small vendors to implement' (it's merely a data representation). Obviously there would be some effect on the internals of a product for it to be capable of importing such data, but that's the price of progress.

      My own personal take is that any new data model will need periodic revisions of its standard. For instance, in order to adopt some new technology such as DNA profiles. This is normal but we don't yet have a good starting point. We've had a GEDCOM fixation for far too long and that has resulted in a very fragmented industry (because vendors have implemented things that don't fit smoothly into GEDCOM), and a functional void (i.e. we cannot share anything now that we couldn't over 20 years ago). In effect, when a comprehensive new model arrives (either through collaboration via FHISO, or by being forced upon from some commercial source) then that first adoption stage will be the most painful. Periodic revisions will be easier by comparison.

      For people reading this who may not be sure why I use the term 'data model', I'd just like to clarify that a 'data model' describes the general structure/share of the data but without specifying a physical representation, or file format. As an example, it's quite easy (even 'mechanical') to convert GEDCOM files to any number of XML files. They're all physically different file formats but they'd still be using the same data model.

      Delete
    3. Up until now, the genealogical data model used by nearly all of the programs has been based on a Western European model with a Northern European bias towards English-speaking people. Do we include Spanish based naming systems, Chinese, etc?

      What is the data standard we use?

      Delete
    4. A data model needs to take a "step back" in order to accommodate a generic picture James. Assuming the whole world has the same conventions as the English-speaking world is obviously wrong, but then trying to address every single case as a separately-supported mechanisn if totally impractical.

      I did make a suggestion for such a generic scheme to FHISO's call-for-papers at: http://fhiso.org/files/cfp/cfps21.pdf. There may be other generic possibilities but people need to propose them if we expect to compare-and-contrast as part of the design for a new standard.

      Delete
    5. I agree. But one limitation is the lack of general knowledge about the different social and cultural conventions.

      Delete