How I think about changes to GEDCOM
© 2024-07-18 Luther Tychonievich
This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License
other posts
Reflections from 5 years as an editor of the GEDCOM standard.

It’s been about 12 years since I started working on family history data formats and 5 since I began earnest work on the GEDCOM standard11 I’m just one voice of many in the GEDCOM steering committee – even when I was authoring almost the entire text of version 7.0.0, I was doing so with frequent input from many others. That said, I want to share some of the things I think about in discussing and working on the GEDCOM standard.

No standard has 100% compliance

When I write HTML, I always consult tables of the quirks of different browsers. When I write SQL, I often need to know what back-end RDBMS I’m writing for. When I parse JPEGs, I have code that handles dozens of inconsistent ways metadata is created and consumed. When I write C, I often need to consider the specific compiler that will consume it. When I’m given a markdown document, I routinely have to edit it to match the nuances of my markdown processors.

There may be somewhere a standard that has 100% compliance, but I have not encountered it yet. 100% compliance is not the natural state of standards. That said, some standards have much higher compliance than others, and part of my goal on the GEDCOM steering committee is to steer it in directions that are more likely to achieve high compliance.

Over my years working in standards I have noticed many forces that seem to me to influence adoption rates. These observations form the subjects of the remaining sections of this post.

Expressive vs. Simple

I want my app to have features that other apps do not. I want to store the data created by those features in the files I save, and want to be able to share that data with other apps that do have comparable features. Thus, I want GEDCOM to be very expressive, with support for everything I might care to do.

I want my app to import GEDCOM files created by apps with a very different focus than mine. I don’t want to need a lot of code to handle the GEDCOM they export that I don’t care about. Thus, I want GEDCOM to be very simple, with no need to think about parts my app doesn’t handle.

The easiest way to create simplicity is to allow apps to ignore parts they don’t care about or understand. But not all data can be ignored without invalidating other data. At least three kinds of errors can be created by ignoring data:

In practice, this means that sometimes I argue against new expressiveness, even expressiveness that I personally want to see in GEDCOM, because it is the kind of data that is problematic to omit.

There are other forms of simplicity to be considered, too. It’s simpler to implement a standard if a small number of organizational principles are applied across the standard. It’s simpler to work with a small standard than a large one. Both of these can sometimes be reasons not to implement some new feature, or to implement it in a different way than is originally proposed.

Simplicity also has conflicting pulls on the text of the specification itself: short text in a small vocabulary is simpler to read, share with users, and translate into other languages; but more detailed and technical text is simpler to agree upon, share with developers, and use to inform automated tooling. Fortunately, the wording topic rarely results in disagreements or frustrations in the community: it’s a challenge to specification authors, but as long as it’s handled reasonably by them then people generally roll with it.

Backwards-compatible vs. Clean and welcoming

Family history is a space with a few large well-established companies and many small and newer companies. In general, these want different things from the standard.

Given a large, well-tested, broadly-deployed code base, change to the standard is problematic. Change creates work for developers, implies inserting new less-tested code into the app, and requires explaining the change to users. The more mature the software, the more likely that even a simple-seeming change will interact in complicated ways with some part of a large, interconnected software system.

Given a clean-slate new app development project, change to the standard is desired. Because it has evolved gradually over 40 years, GEDCOM does not look like someone with a consistent vision had written the whole thing. It could be redesigned to be more consistent, it could have features changed to be more expressive, it could be refactored to leverage currently-popular technologies. As a new app developer, all of these are desirable, and will remain fairly desirable for several years after the app is delivered because the code base is still fairly simple and easy to change.

As a standards committee, we need to consider both perspectives. Standards are only standard if the community agrees to use them, and our community includes both perspectives.

In practice, this means that sometimes I argue against changes that I personally think are really good ideas: a technically elegant, sound idea is not always a step towards greater standardization and interoperability. It also means that sometimes I argue for a sweeping change that I personally dread implementing: some barriers to improvement require major changes. Because the balance of voices shared in the community skews towards small companies and change champions I more often find myself voicing resistance and caution, but a balance of both is important.

Leading vs. Following Change

GEDCOM 7.0 introduced GEDZIP, negative assertions, and documented extensions even though few if any current applications were doing these things. But many other ideas have been deferred from inclusion until such time as applications are known to implement them. Why?

A standard can lead change. We can add a new feature and use its presence in the spec as an incentive for spec-following applications to implement that feature. Once they implement it and start exporting files, parsing those files will become an incentive for less spec-minded applications to implement it too. This is an exciting power for improving the quality of family history software! However, it cannot be used too often: if each new version of the standard is seen as the standard committee telling us what we should do to our software then the standard will lose credibility and stop being effective in being a standard, let alone a change leader.

However, a standard must also following changes. Once several applications support some feature, it should be added to the spec; if it is not, a variety of bespoke methods of communicating that feature between supporting applications will emerge, often with conflicting definitions that will make later standardization challenging.

Specific vs. General

Should GEDCOM have a general way of recording a marriage-like union-related event, to be used for engagements and marriages and marriage contracts and honeymoons and so on; or should it have a separate structure for each of these?

This question does not have a simple answer. General categories tend to have hard-to-define boundaries (is a prenuptial contract (specifying before marriage how property will be divided upon divorce) a marriage-type event or a divorce-type event?), and what seems to deserve its own event type depends on the focus of the app. Specific categories can yield dauntingly large lists (well over 200 event types have been proposed on GEDCOM’s issue tracker) and also have hard-to-define boundaries (is ondertrouw the Dutch word for marriage announcement, or does the Dutch church’s tradition of when and how this announcement is given mean it’s a separate structure)?

Although GEDCOM has not yet adopted a way of handling these conflicting goals, there are several partial solutions to it in other fields:

In addition to having multiple options (difficult to chose between in part because of the other items in this post), we also have other challenges that these do not address. We still need to decide if related ideas are related enough to be the same or not (or pick their supertypes or set of tags). We also have push-back against merging ideas where some should admit particular qualifiers and others should not: engagements don’t have officiators so is it OK to represent them with the same type of structure as weddings which do? Many people say they don’t care, but some apps do care, and care in opposite directions.

Striving for One Way

There should be one way to record one idea in the standard, not several. If there are several then it is likely that two apps or two users will pick different ways to support and then have trouble communicating with one another.

Having one way to do things is a surprisingly hard goal to achieve. There are apps that wish to store information that others wish to ignore, resulting in things like the ALIA which for a few apps represent research but for most are treated as a confusing way of splitting a person into many parts. Any fuzzy boundary between concepts results in things on those boundaries that have ambiguous encoding. Some concepts have long-established large overlaps in meaning, such as address vs. place, which mean that any clarification on how to encode specific cases would contradict many extant examples. Others concepts, like DNA-related genealogy, have not yet come to consensus on what the right way to handle them is.

I spend quite a bit of the time I spend with GEDCOM looking at proposals for additions or changes and asking myself is there already a way to do this? Is there a way to do something similar enough to work? Is it awkward enough to justify adding a new way? How can the specification and associated resources better steer people towards that existing way?

Should vs. Must

One of the most widely-cited technical standards is RFC 2119 which defines 10 words and phrases to use in distinguishing between things that those following a standard must do and things they should do. Every GEDCOM file must have a header indicating the GEDCOM version it conforms to. Each GEDCOM file should document individuals, families, and sources. It is expected that apps will be unable to read a file without a proper header. It doesn’t make much sense to have a file with no contents, but such a file can still be read and understood.

The current GEDCOM specification does not reference RFC 2119, in part because so much of the must/should conversations are complicated by being human-facing. For example, we define RELI as storing a religious denomination, which is intended to be a must-like definition: if you use it to instead store a pseudonym or a favorite poem you are not following the GEDCOM specification. But this field is user-entered, and how is an app to know whether shave and a haircut, two bits is a valid religious denomination or not? Since an app can’t tell if its users are following the rules, an fully-correct app could both produce and parse data that breaks the rules, so these rules aren’t really must-type rules. But they’re also not really should-type rules because breaking them isn’t just inadvisable, it breaks the ability of GEDCOM to carry meaning and serve its purpose.

There are apps that say they support GEDCOM but don’t implement it correctly. There are also GEDCOM files that don’t follow some parts of GEDCOM that their generating app does follow because the users didn’t follow the rules; that might be partly the app’s fault for not signaling what the rules are, but it might also be users who know them and still break them or who don’t pay attention to the app’s instructions. Either way, the existence of these not-right GEDCOM files causes angst and calls for changes to the standard to close loopholes, require more, and otherwise make it harder for bad files to be created. Their existence also causes pessimism that changes can fix things: if so many of the problem cases now are already in violation of the standard, can any change to the standard fix them?

Ideas vs. Proposals

Here’s an idea: we should be able to record hypotheses in GEDCOM, not just conclusions. It’s a great idea. It’s an idea we should definitely add to a future version of GEDCOM. But it’s not an actionable proposal.

A proposal is much more concrete and actionable than an idea. A proposal includes a draft of how it might be implemented. It includes analysis of some of the challenges noted in other sections of this post. It may still have parts that aren’t fully worked out, but it’s moving toward a specific solution.

It can take many years for an idea to turn into a workable proposal. I saw almost a dozen separate incomplete or impractical proposals for more structured citations (including a few I worked on myself) before I saw one that seemed to both integrate into GEDCOM well and meet other criteria, for example, and that one’s still not quite finished. This process of finding the right proposal for an idea can be quite frustrating, with failed ideas and challenges faced by those working on it coupled with confused impatience at the delay by those not working on it.