It’s been about 12 years since I started working on family history data formats and 5 since I began earnest work on the GEDCOM standard11 gedcom.io. I’m just one voice of many in the GEDCOM steering committee – even when I was authoring almost the entire text of version 7.0.0, I was doing so with frequent input from many others. That said, I want to share some of the things I think about in discussing and working on the GEDCOM standard.
When I write HTML, I always consult tables of the quirks of different browsers. When I write SQL, I often need to know what back-end RDBMS I’m writing for. When I parse JPEGs, I have code that handles dozens of inconsistent ways metadata is created and consumed. When I write C, I often need to consider the specific compiler that will consume it. When I’m given a markdown document, I routinely have to edit it to match the nuances of my markdown processors.
There may be somewhere a standard that has 100% compliance, but I have not encountered it yet. 100% compliance is not the natural state of standards. That said, some standards have much higher compliance than others, and part of my goal on the GEDCOM steering committee is to steer it in directions that are more likely to achieve high compliance.
Over my years working in standards I have noticed many forces that seem to me to influence adoption rates. These observations form the subjects of the remaining sections of this post.
I want my app to have features that other apps do not. I want to store the data created by those features in the files I save, and want to be able to share that data with other apps that do have comparable features. Thus, I want GEDCOM to be very expressive, with support for everything I might care to do.
I want my app to import GEDCOM files created by apps with a very different focus than mine. I don’t want to need a lot of code to handle the GEDCOM they export that I don’t care about. Thus, I want GEDCOM to be very simple, with no need to think about parts my app doesn’t handle.
The easiest way to create simplicity is to allow apps to ignore parts they don’t care about or understand. But not all data can be ignored without invalidating other data. At least three kinds of errors can be created by ignoring data:
Omitting clarifying or contextualizing data can make the data it clarified or contextualized become incorrect. For example,
1 NO EMIG
2 DATE TO 1880
means did not emigrate before 1880
; ignoring or removing the second line makes it mean did not emigrate
, a broader assertion that may not be compatible with the original.
Ignoring and not re-exporting can be mistaken for removing. If I send you a file, you edit it and send it back, and the returned file is missing something, how can I tell if you removed it because you disagreed with it or your app removed it because it didn’t understand it?
Ignoring but still re-exporting can create nonsense. If my app imports a file but only shows me part of it and I add something to it, the things I add may be inconsistent with the parts I can’t see.
In practice, this means that sometimes I argue against new expressiveness, even expressiveness that I personally want to see in GEDCOM, because it is the kind of data that is problematic to omit.
There are other forms of simplicity to be considered, too. It’s simpler to implement a standard if a small number of organizational principles are applied across the standard. It’s simpler to work with a small standard than a large one. Both of these can sometimes be reasons not to implement some new feature, or to implement it in a different way than is originally proposed.
Simplicity also has conflicting pulls on the text of the specification itself: short text in a small vocabulary is simpler to read, share with users, and translate into other languages; but more detailed and technical text is simpler to agree upon, share with developers, and use to inform automated tooling. Fortunately, the wording topic rarely results in disagreements or frustrations in the community: it’s a challenge to specification authors, but as long as it’s handled reasonably by them then people generally roll with it.
Family history is a space with a few large well-established companies and many small and newer companies. In general, these want different things from the standard.
Given a large, well-tested, broadly-deployed code base, change to the standard is problematic. Change creates work for developers, implies inserting new less-tested code into the app, and requires explaining the change to users. The more mature the software, the more likely that even a simple-seeming change will interact in complicated ways with some part of a large, interconnected software system.
Given a clean-slate new app development project, change to the standard is desired. Because it has evolved gradually over 40 years, GEDCOM does not look like someone with a consistent vision had written the whole thing. It could be redesigned to be more consistent, it could have features changed to be more expressive, it could be refactored to leverage currently-popular technologies. As a new app developer, all of these are desirable, and will remain fairly desirable for several years after the app is delivered because the code base is still fairly simple and easy to change.
As a standards committee, we need to consider both perspectives. Standards are only standard if the community agrees to use them, and our community includes both perspectives.
In practice, this means that sometimes I argue against changes that I personally think are really good ideas: a technically elegant, sound idea is not always a step towards greater standardization and interoperability. It also means that sometimes I argue for a sweeping change that I personally dread implementing: some barriers to improvement require major changes. Because the balance of voices shared in the community skews towards small companies and change champions I more often find myself voicing resistance and caution, but a balance of both is important.
GEDCOM 7.0 introduced GEDZIP, negative assertions, and documented extensions even though few if any current applications were doing these things. But many other ideas have been deferred from inclusion until such time as applications are known to implement them. Why?
A standard can lead change. We can add a new feature and use its presence in the spec as an incentive for spec-following applications to implement that feature. Once they implement it and start exporting files, parsing those files will become an incentive for less spec-minded applications to implement it too. This is an exciting power for improving the quality of family history software! However, it cannot be used too often: if each new version of the standard is seen as the standard committee telling us what we should do to our software
then the standard will lose credibility and stop being effective in being a standard, let alone a change leader.
However, a standard must also following changes. Once several applications support some feature, it should be added to the spec; if it is not, a variety of bespoke methods of communicating that feature between supporting applications will emerge, often with conflicting definitions that will make later standardization challenging.
Should GEDCOM have a general way of recording a marriage-like union-related event, to be used for engagements and marriages and marriage contracts and honeymoons and so on; or should it have a separate structure for each of these?
This question does not have a simple answer. General categories tend to have hard-to-define boundaries (is a prenuptial contract (specifying before marriage how property will be divided upon divorce) a marriage-type event or a divorce-type event?), and what seems to deserve its own event type depends on the focus of the app. Specific categories can yield dauntingly large lists (well over 200 event types have been proposed on GEDCOM’s issue tracker) and also have hard-to-define boundaries (is ondertrouw the Dutch word for marriage announcement, or does the Dutch church’s tradition of when and how this announcement is given mean it’s a separate structure)?
Although GEDCOM has not yet adopted a way of handling these conflicting goals, there are several partial solutions to it in other fields:
Single inheritance. A full set of specific structures are present, but each is flagged as a more specific subtype of one other structure. An ondertrouw is a type of marriage announcement, which is a type of pre-marriage event, which is a type of marriage event, which is a type of partnership event, which is a type of event. This is used by otologies like schema.org, directory systems like the file organization on disks, and many object-oriented programming languages.
Multiple inheritance. This is like single inheritance, except a specific structure may be a subtype of two otherwise-unrelated other structures. A prenuptial agreement is both a pre-marriage event → marriage event → partnership event → event; and a family contract → contract → legal document → official record → artifact. This is used by some object-oriented programming languages, human self-identification in groups, and file systems with shortcuts and hard links.
Two-layer model. This is a limited form of single inheritance where there are specific types and general types and only those two. An engagement is a partnership event. A well-known example of the two-layer model is media types. Some, like the image/
media type group, require specific types: you can have image/jpeg
and image/png
but no file can have just a generic image
type. Others, like the text/
media type group, have a generic text/plain
that can be used in place of any specific subtype with only partial loss of functionality.
Type-and-tag model. This has a small set of types determined by functionality, but each value can have any number of additional tags clarifying its meaning and purpose. An engagement is a partnership event with tags such as declaration of intent
, announcement
, advance scheduling
, and/or others a user finds applicable. Organizing collections of creative content is a common example: a blog post is a separate type from a video, but both can have as many tags as we see fit.
In addition to having multiple options (difficult to chose between in part because of the other items in this post), we also have other challenges that these do not address. We still need to decide if related ideas are related enough to be the same or not (or pick their supertypes or set of tags). We also have push-back against merging ideas where some should admit particular qualifiers and others should not: engagements don’t have officiators so is it OK to represent them with the same type of structure as weddings which do? Many people say they don’t care, but some apps do care, and care in opposite directions.
There should be one way to record one idea in the standard, not several. If there are several then it is likely that two apps or two users will pick different ways to support and then have trouble communicating with one another.
Having one way to do things is a surprisingly hard goal to achieve. There are apps that wish to store information that others wish to ignore, resulting in things like the ALIA
which for a few apps represent research but for most are treated as a confusing way of splitting a person into many parts. Any fuzzy boundary between concepts results in things on those boundaries that have ambiguous encoding. Some concepts have long-established large overlaps in meaning, such as address vs. place, which mean that any clarification on how to encode specific cases would contradict many extant examples. Others concepts, like DNA-related genealogy, have not yet come to consensus on what the right
way to handle them is.
I spend quite a bit of the time I spend with GEDCOM looking at proposals for additions or changes and asking myself is there already a way to do this? Is there a way to do something similar enough to work? Is it awkward enough to justify adding a new way? How can the specification and associated resources better steer people towards that existing way?
One of the most widely-cited technical standards is RFC 2119 which defines 10 words and phrases to use in distinguishing between things that those following a standard must do and things they should do. Every GEDCOM file must have a header indicating the GEDCOM version it conforms to. Each GEDCOM file should document individuals, families, and sources. It is expected that apps will be unable to read a file without a proper header. It doesn’t make much sense to have a file with no contents, but such a file can still be read and understood.
The current GEDCOM specification does not reference RFC 2119, in part because so much of the must/should conversations are complicated by being human-facing. For example, we define RELI
as storing a religious denomination,
which is intended to be a must-like definition: if you use it to instead store a pseudonym or a favorite poem you are not following the GEDCOM specification. But this field is user-entered, and how is an app to know whether shave and a haircut, two bits
is a valid religious denomination or not? Since an app can’t tell if its users are following the rules, an fully-correct app could both produce and parse data that breaks the rules, so these rules aren’t really must
-type rules. But they’re also not really should
-type rules because breaking them isn’t just inadvisable, it breaks the ability of GEDCOM to carry meaning and serve its purpose.
There are apps that say they support GEDCOM but don’t implement it correctly. There are also GEDCOM files that don’t follow some parts of GEDCOM that their generating app does follow because the users didn’t follow the rules; that might be partly the app’s fault for not signaling what the rules are, but it might also be users who know them and still break them or who don’t pay attention to the app’s instructions. Either way, the existence of these not-right GEDCOM files causes angst and calls for changes to the standard to close loopholes, require more, and otherwise make it harder for bad files to be created. Their existence also causes pessimism that changes can fix things: if so many of the problem cases now are already in violation of the standard, can any change to the standard fix them?
Here’s an idea: we should be able to record hypotheses in GEDCOM, not just conclusions. It’s a great idea. It’s an idea we should definitely add to a future version of GEDCOM. But it’s not an actionable proposal.
A proposal is much more concrete and actionable than an idea. A proposal includes a draft of how it might be implemented. It includes analysis of some of the challenges noted in other sections of this post. It may still have parts that aren’t fully worked out, but it’s moving toward a specific solution.
It can take many years for an idea to turn into a workable proposal. I saw almost a dozen separate incomplete or impractical proposals for more structured citations (including a few I worked on myself) before I saw one that seemed to both integrate into GEDCOM well and meet other criteria, for example, and that one’s still not quite finished. This process of finding the right proposal for an idea can be quite frustrating, with failed ideas and challenges faced by those working on it coupled with confused impatience at the delay by those not working on it.