by Michael Kay, 3 June 1999
- The appearance of this specification is very welcome. I believe it represents an important step
forward for interchange of genealogical information. I particularly welcome the fact that the
specification is simply a new encoding of the existing GEDCOM data model: if the model and
the encoding were changed at the same time, the new standard would have very little chance
of practical success.
- In the Introduction, the "purpose" section is about the intended readership.
It would be useful to have a section outlining the design goals of the standard, since this
would help readers and commentators in assessing the design choices that have been made.
- Chapter 1 page 7: specifying linkage. This refers to the XML mechanisms for specifying
cross-references without being explicit. There are two relevant mechanisms, the ID/IDREF
mechanism and the XLink/XPointer mechanism (the latter not yet ratified).
ID/IDREF is actually almost identical to the GEDCOM linkage model so it
is hard to see why it has not been used. Benefits in using ID and IDREF
attributes for linkage would include:
- Validating XML parsers will perform many of the validation checks
- Higher-level XML software (for example XSL stylesheets) recognise IDs and can exploit their uniqueness
The document mentions use of compound keys, which indeed ID/IDREF does not directly support, but I have not seen any use for compound keys in GEDCOM.
- Chapter 1 page 7: attributes vs. elements. In general the choice of using elements
rather than attributes is sound. the only place where I would definitely use attributes
in preference is in handling ID and IDREF linkages, because of the special semantics
attached to these in XML, and in one or two other cases where XML recognises attributes
with a special meaning, for example xml:lang (mentioned below). Some people recommend
using elements for information that is meaningful to the human reader, and attributes
for information that is meaningful only to software.
- Choice of tags. The specification appears to have left the existing GEDCOM tags unchanged.
I believe it would be better in some cases to change them:
- There is some overloading of tags: the same tag used in different ways depending on context.
For example the SUBM tag has a different structure and interpretation when used within HEAD
than when it is used at top level (within <GEDCOM>). This is best avoided in XML,
because the DTD cannot express the structure rules, and because it complicates software analysing the document.
- Some tags abbreviate a five character word to four characters, which creates obscurity for no useful purpose.
(I have also seen cases where tags were misused because users had not read the spec properly).
Since GEDCOM tags often end up in user interfaces to control import and export from software
packages, I would prefer to see tags expanded to be more meaningful.
- Web-based search engines are starting to appear which will recognise meaning associated
with XML tags. Again, this encourages the use of meaningful names.
- Chapter 1 page 8. Continuation lines. It is technically true that continuation lines are not needed in XML.
However, there is still a user need to represent line endings. There are two ways of achieving this in XML.
Either the white space in the element content (including line endings) is regarded as significant,
or it is regarded as insignificant, in which case special tags (such as the HTML <P> and <BR>)
are needed to format the text. My approach would be to make white space insignificant in GEDCOM text fields,
and adopt a selection of HTML tags to provide simple formatting. John Cowan has defined a suitable subset
of tags (called The Itsy Bitsy Teeny Weeny Simple Hypertext DTD) here.
It may be convenient to support a tag such as <PRE> for text that is preformatted into lines,
for ease of transition from current GEDCOM.
- While on the subject of text fields, it would also be useful to allow semantic markup of text.
Example: a NOTE containing an abstract of a will, in which people’s names are tagged thus:
"And I bequeathe to my daughter <INDI_REF REF="I000134">Elisabeth</INDI_REF>
the sum of one hundred pounds". Such tags can be used for indexing and searching,
or to generate hyperlinks on display. (Note the use of an attribute here for
information that is not part of the textual content and is not intended for display).
Similarly place names and dates appearing in text can usefully be tagged.
- Chapter 1 page 8, Character sets. Yes, the character set specified in XML is UNICODE.
It may be worth mentioning, however, that XML permits different encodings and subsets
of this character set: the encoding must be declared in the document prolog if it
is not UTF-8. For example, it would even be legitimate to use an ANSEL encoding,
though an XML parser would not be obliged to recognise it! All XML parsers must
accept UTF-8 encoding, but many also recognise ISO 8859 or Microsoft’s extension
of ISO 8859 known as "Windows ANSI". The character encoding is not
visible to the application, which always sees Unicode characters.
- Chapter 2. Notation. I sympathise with the decision not to produce the specification
in the form of a DTD, though in fact a well-annotated DTD can be quite readable and has
the advantage of stating many of the rules unambiguously. I would certainly advocate
producing a DTD as an appendix. Meanwhile, I find the pseudo-entity notation used
slightly bizarre. A trivial point is that XML entities are written with a terminating
semicolon, not exclamation mark. Use of XML parameter entities (starting with percent
sign rather than ampersand) would seem more natural, although I can see that in this
"specification by example" notation they don’t quite fit the bill either.
- One difference between the hierarchic structure of traditional GEDCOM and that of XML
is that in traditional GEDCOM, the text content of an element must precede any child
elements. You seem to be carrying this forward as a convention for GEDCOM-XML, but
you have not stated explicitly whether it is a requirement.
- The address example on page 9 reveals how careful you need to be with line endings:
a general rule is needed about the significance of white space.
- p13: The LANG tag should probably be replaced by the standard xml:lang attribute.
The advantage of using xml:lang is that search engines will recognise it.
- In the Multimedia record <FILE> element, and in other places where filenames are used,
I would mandate use of a valid absolute URL (or its generalisation, a URN). In the example given,
the URL should be prefixed "http://" to make it valid. (Omitting the protocol part of the
URL is a convenient shortcut supported by many browsers, but the short form is not a correct URL).
- Multimedia record. It is possible to include binary data embedded within the XML stream in
Base64 encoding, but most people discourage it. Alternative mechanisms are a link to a URL,
or an unparsed external entity. I prefer the URL mechanism myself, but to allow data
interchange and archiving there may be a need to define some concept of a
"GEDCOM package": e.g. a ZIP file containing a GEDCOM file and
all the binary files (attachments?) referenced in it.
- p20, structure for events and attributes. In my GedML proposal, I advocated using a single
tag <EVEN> (or perhaps <EVENT> and <ATTRIBUTE>) for all events and attributes,
with a <TYPE> or similar sub-tag or attribute to distinguish them. The primary reason for
this is extensibility: it creates a mechanism where new event types and attributes can be
introduced without a change to the DTD and therefore with less impact on existing software.
I think it would be better if GEDCOM had a completely general mechanism for recording events
and attributes, together with a register of preferred names for common events and attributes,
with this list having no structural significance. There could also be an open process
for registering additional event and attribute names in the list.
- PLACE_STRUCTURE has been converted mechanically from its traditional GEDCOM encoding to XML.
I think it is possible to do better. Firstly, if the substructure of a place name is known
it should be represented by XML tags rather than commas. Secondly, the FORM element could
be replaced by attributes of the place name parts. So we could
write <PLAC><PPART NAME="STREET">28 Westover
Road</PPART><PPART NAME="TOWN">Fleet</PPART><PPART NAME="COUNTY">Hampshire</PPART></PLAC>. This gives a much more natural XML representation and is readily convertible to and from traditional GEDCOM.
- DATE is an example (the hardest example) of an attribute that has a great deal of internal syntax,
and the question is how much of this should be represented using XML tagging. I wouldn’t go
as far as some people, who have recommended putting tags around the day, month, and year fields.
But I would certainly split it a little further than it is at present, for example using <FROM>
and <TO> tags in a date range, and probably using XML attributes to replace the strange-looking
DATE_CALENDAR_ESCAPE values.
- NAME_PERSONAL: it is very simple to replace the /surname/ convention with an XML tag,
and I would strongly recommend doing this. There are many reasons: for example,
a stylesheet that displays the data does not need to parse out the slashes, and
the surnames can be readily indexed by standard XML search engines. I also think
it would be useful to allow other parts of the name (e.g. prefix and suffix) to
be tagged in-situ, as an alternative to the separate NAME_PIECE structure which
is rarely used and is much less natural in XML. So I could be represented as
<NAME><CHRISTIAN>Michael</CHRISTIAN><SURNAME>KAY</SURNAME></NAME>.
(In GedML I used <S> as the tag for surnames, for brevity).
- Page 43: Packaging. Some standard XML software (e.g. web serves and browsers) may require the
file extension ".xml". It may be better to recommend "filename.ged.xml" which
will work in most modern operating systems. It would also be desirable to register a MIME type,
e.g. application/gedcom, which is guaranteed unique and is operating-system independent.
- The facilities for splitting a file across multiple diskettes are not really appropriate.
No XML parser will recognise a file that is split this way. To achieve a physical split,
standard archivers such as ZIP will do the job. More useful would be a way of achieving
a logical split: a mechanism for maintaining a collection of related GEDCOM files
allowing cross-references between them. For putting large data sets on the web,
such a mechanism is essential, and I think it should be defined by this standard.
Once again, let me state my encouragement for this work. My comments are not intended
to denigrate what has been done, only to improve it further.
best regards,
Michael Kay
mhkay@iclweb.com