Technical
Site

Download

ELF Serialisation Format

This is an exploratory draft of the serialisation format for FHISO's proposed suite of Extended Legacy Format (ELF) standards. This document is not endorsed by the FHISO membership, and may be updated, replaced or obsoleted by other documents at any time.

Comments on this draft should be directed to the tsc-public@fhiso.org mailing list.

FHISO's Extended Legacy Format (or ELF) is a hierarchical serialisation format and genealogical data model that is fully compatible with GEDCOM, but with the addition of a structured extensibility mechanism. It also clarifies some ambiguities that were present in GEDCOM and documents best current practice.

The GEDCOM file format developed by The Church of Jesus Christ of Latter-day Saints is the de facto standard for the exchange of genealogical data between applications and data providers. Its most recent version is GEDCOM 5.5.1 which was produced in 1999, but despite many technological advances since then, GEDCOM has remained unchanged.

Strictly, [GEDCOM 5.5] was the last version to be publicly released back in 1995. However a draft dated 2 October 1999 of a proposed [GEDCOM 5.5.1] was made public; it is generally considered to have the status of a standard and has been widely implemented as such.

FHISO are undertaking a program of work to produce a modernised yet backward-compatible reformulation of GEDCOM under the name ELF, the new name having been chosen to avoid confusion with any other updates or extensions to GEDCOM, or any future use of the name by The Church of Jesus Christ of Latter-day Saints. This document is one of three that form the initial suite of ELF standards, known collectively as ELF 1.0.0:

Conventions used

Where this standard gives a specific technical meaning to a word or phrase, that word or phrase is formatted in bold text in its initial definition, and in italics when used elsewhere. The key words must, must not, required, shall, shall not, should, should not, recommended, not recommended, may and optional in this standard are to be interpreted as described in [RFC 2119].

An application is conformant with this standard if and only if it obeys all the requirements and prohibitions contained in this document, as indicated by use of the words must, must not, required, shall and shall not, and the relevant parts of its normative references. Standards referencing this standard must not loosen any of the requirements and prohibitions made by this standard, nor place additional requirements or prohibitions on the constructs defined herein.

Derived standards are not allowed to add or remove requirements or prohibitions on the facilities defined herein so as to preserve interoperability between applications. Data generated by one conformant application must always be acceptable to another conformant application, regardless of what additional standards each may conform to.

This standard depends on FHISO's Basic Concepts for Genealogical Standards standard. To be conformant with this standard, an application must also be conformant with the referenced parts of [Basic Concepts]. Concepts defined in that standard are used here without further definition.

In particular, precise meaning of string, character, whitespace, whitespace normalisation and term are given in [Basic Concepts].

Certain facilities in this standard are described as deprecated, which is a warning that they are likely to be removed from a future version of this standard. This has no bearing on whether a conformant application must implement the facility: they may be required, recommended or optional as described in this standard.

Indented text in grey or coloured boxes does not form a normative part of this standard, and is labelled as either an example or a note.

Editorial notes, such as this, are used to record outstanding issues, or points where there is not yet consensus; they will be resolved and removed for the final standard. Examples and notes will be retained in the standard.

The grammar given here uses the form of EBNF notation defined in §6 of [XML], except that no significance is attached to the capitalisation of grammar symbols. Conforming applications must not generate data not conforming to the syntax given here, but non-conforming syntax may be accepted and processed by a conforming application in an implementation-defined manner, providing a warning is issued to the user, except where this standard says otherwise.

In this form of EBNF, whitespace is only permitted where it is explicitly stated in the grammar. It is not automatically permitted between arbitrary tokens in the grammar.

The grammar productions in this standard uses the S and Char productions defined in §2 of [Basic Concepts] to match any non-empty sequence of whitespace characters or any valid character, respectively.

This standard uses the prefix notation, as defined in §4.3 of [Basic Concepts], when discussing specific terms. The following prefix bindings are assumed in this standard:

elf https://terms.fhiso.org/elf/
ex https://example.com/
Although prefix notation is included in this standard document (see §6.2.2), that is only in the context of serialised data. When used outside of a serialised example, prefix notation is simply a notational convenience to make the standard easier to read.
Review the previous note.

Overview

The ELF serialisation format is a structured, line-based text format for encoding data in a hierarchical manner that is both machine-readable and human-readable.

At a logical level, an ELF document is built from structures, the name ELF gives to the basic hierarchical data structures used to represent data. Each structure consists of:

GEDCOM includes a means for splitting a logical document into multiple physical documents, sometimes called volumes. This dates to an era when documents were commonly stored and shipped on floppy disks, and a large GEDCOM document might exceed the storage capacity of a single disk. This functionality is no longer necessary and is not widely implemented in present applications. This functionality is not included in ELF.

A top-level structure which is not a substructure of any other structure is called a record. An ELF document or dataset can have arbitrarily many records.

This is either not strictly true or at least misleading, because HEAD and TRLR are not records. Probably.
The expressiveness of ELF is similar to that of XML. ELF's structures serve the same role as elements in XML, and nest similarly. But unlike XML, which has a single root-level element, an ELF dataset typically has multiple records.

The ELF serialisation format is a general purpose format that can be used to represent arbitrary data, depending on the type identifiers used in the dataset. A particular set of type identifiers, together with their meanings and restrictions on how they are to be used, is called a data model.

The ELF serialisation format is designed to be useable with various data models; however it is anticipated that most files using the ELF serialisation format will use the data model described in [ELF Data Model], which is based on and compatible with GEDCOM's lineage-linked form.

At a lexical level, a structure is encoded as sequence of lines, each terminated with a line break. The first line encodes the type identifier and payload of the structure, while any substructures are encoded in order on subsequent lines. Each line consists of the following components, in order:

0 HEAD
1 GEDC
2 VERS 5.5.1
2 ELF 1.0.0
2 FORM LINEAGE-LINKED
1 CHAR UTF-8
0 INDI
1 NAME Charlemagne
0 TRLR

This ELF document has three lines with level 0 which mark the start of the three top-level structures or records. These records have, respectively, three, one and zero substructures, which are denoted by the lines with level 1. The structure represented by the line with a CHAR tag is a substructure of the HEAD record because there is no intervening line with level one less than 1 (i.e. 0); the structure represented by the NAME line naming Charlemagne is a substructure of the INDI record, as that is the preceding line with a level 0. The TRLR record is an example of a record with no substructures.

Five of the lines in this example document have a payload. For example, the payload of the FORM line is the string "LINEAGE-LINKED", while the payload of the NAME line is the string "Charlemagne". None of the lines in this example have payload which are pointers, nor do any have a structure identifier.

ELF applications

A conformant application which parses the ELF serialisation format is called an ELF parser. A conformant application which outputs data in the ELF serialisation format is callen an ELF writer.

Many applications will be both ELF parsers and ELF writers.

The input to an ELF parser and output of an ELF writer is an octet stream, which is a sequence of 8-bit bytes or octets each with a value between 0 and 255.

An octet stream is typically read from or written to a disk or the network. This standard does not define how these should be read, nor how the octets are represented in storage or in transit on a network.

This standard defines how an octet stream is parsed into a dataset, and how a dataset is serialised into an octet stream. Overviews of these processes can be found in §2.2 and §2.3, respectively. An octet stream which this standard requires an ELF parser to be able to read is called a conformant source.

An octet stream which an ELF parser must be able to read successfully, but can process in an implementation-defined manner is nonetheless a conformant source.

If the input to an ELF parser is not a conformant source, unless this standard says otherwise, the application must either terminate processing that octet stream or present a warning or error message to the user. If it continues processing, it does so in an implementation-defined manner.

This standard also recognises a class of application which reads data in the ELF serialisation format, applies a small number of changes to that data, and immediately produces output in the ELF serialisation format which is identical to the input, octet for octet, other where the requested changes have been made. Such an application is called an ELF editor.

ELF editors are intended to small programs or scripts that apply simple modifications to datasets, typically with little or no human interaction. For example, script which replaces some particular deprecated feature in the dataset with an equivalent would be an ELF editor. This definition of an ELF editor is not intended to include large, feature-rich applications which read ELF into an internal database, allow users to view and modify most aspects of the data, and later export it as ELF.

ELF editors are not required to conform to the full requirements of an ELF parser or ELF writer. The only requirement this standard places on ELF editors is that, when acting on a conformant source, they must either generate output which is a conformant source, or present a warning or error message to the user, or terminate.

This is a considerably weaker requirement than that placed on ELF parsers and ELF writers. In particular, there is no requirement for an ELF editor to detect invalid input, as an ELF parser is generally required to; nor do the stricter requirements on the output allowed from ELF writers apply. These relaxations allow ELF editors to edit an octet stream in place, without fully parsing those parts of their input which are not going to be changed.

Parsing

The parsing process can be summarised as follows:

  1. An octet stream is converted to a sequence of line strings by

    1. determining its character encoding by

      1. detecting a character encoding per §3.1, and
      2. using that detected character encoding to look for a specified character encoding in the serialisation metadata per §3.2;
    2. converting octets to characters using that character encoding; and
    3. splitting on line breaks per §3.4.

  2. Line strings are parsed as lines by

    • parsing the level, tag, xref_id, and payload of each line;
    • creating an error line if that fails.
  3. Lines are parsed into xref structures by

    • re-merging CONC and CONT-split payloads; violations of splitting rules are ignored
    • using levels to properly nest xref structures
    • converting to error lines before parsing if the levels are inconsistent
  4. xref structures are parsed into tagged structures by simultaneously

    • converting xrefs to pointers, with a special "point to null" if this fails
    • unescaping @ characters
    • preserving valid escapes and removing others
    • converting unicode escapes into their represented characters,
  5. the tagged structures that represent the schema are parsed

  6. tags are converted into structure type identifiers using the schema and the resulting structures placed in the metadata or document as appropriate. Tags with no corresponding structure type identifier are converted into appropriate undefined tag identifiers.

Serialisation

The semantics of serialisation are defined by the following procedural outline.

  1. Each structure is assigned a tag based on its structure type identifier, superstructure type identifier, and a schema which may be augmented during serialisation to allow all structures to have a tag.

  2. The tagged structures are ordered and additional tagged structures created to represent serialisation metadata.

    This step cannot happen before tagging because tagging may generate serialisation metadata that needs to be included in the tagged structures.

  3. Payloads are converted to create xref structures by simultaneously

    • assigning xref_ids and replacing pointer-valued payloads with string-valued xrefs
    • escaping @ characters
    • preserving valid escapes
    • escaping unrepresentable characters

    Semantically, these actions must happen concurrently because none of them should be applied to the others' results.

    This step cannot happen before tagging because tags are needed to determine the set of valid escapes. This step cannot happen before adding serialisation metadata because it is applied to the serialisation metadata as well.

  4. The dataset is converted to a sequence of lines by

    • assigning levels
    • splitting payloads, if needed, using CONT and CONC
    • ordering substructures in a preorder traversal of the tagged structures

    This step cannot happen before payload conversion because valid split points are dependant on proper escaping. This step must happen before encoding as octets because valid split points are determined by character, not octet.

  5. The sequence of lines is converted to an octet stream by

    • concatenating the lines with line-break terminators
    • converting strings to octets using the character encoding

Constructs

This document uses five externally-visible constructs: dataset, metadata, document, structure, and octet stream. For clarity of presentation, it also uses several intermediate constructs internally: line, xref structure, and tagged structure. Each is defined in §2.5.

Glossary

Line, octet, octet stream, record and structure are now defined in §2, while character encoding is defined in §3, and line break is defined in §3.4. The notion of a delimiter is being removed. Dataset and document are very nearly defined in §2 too, but we don't currently discuss metadata there — this is an issue which needs resolving.
Character encoding
The scheme used to map between an octet stream and a string of characters.
Dataset
Metadata and a document.
Delimiter

A sequence of one or more space or tabulation characters.

Delim ::= [#20#9]+

During serialisation, a single space (U+0020) SHOULD be used each place a delimiter is expected.

Document
An unordered set of structures.
Line
  1. A level, a non-negative integer
  2. An optional xref_id
  3. A tag, a string matching production Tag
  4. An optional payload, which is a string containing any number of characters, but which must not contain a line-break.
Line break

A sequence of one or more newline and/or carriage return characters.

LB ::= [#A#D]+

During serialisation, each line break MUST be one of

  • a single newline (U+000A)
  • a single carriage return (U+000D)
  • a single carriage return followed by a single newline (U+000D U+000A)

The same string SHOULD be used each place a line break is expected.

Metadata

A collections of structures intended to describe information about the dataset as a whole.

The relative order of structures with the same structure type identifier SHALL be preserved within this collection; the relative order of structures with distinct structure type identifiers is not defined by this specification.

Octet
One of 256 values, often represented as the numbers 0 through 255. Also called a "byte."
Octet Stream
A sequence of octets.
Record
A structure, tagged structure, or xref structure whose superstructure is the document.
ELF Schema
Information needed to correctly parse tagged structures into structures: a mapping between structure type identifiers and tags and metadata relating to valid escapes and prefixes.
Serialisation Metadata
Tagged structures inserted during serialisation and removed (with all its substructures) during parsing. They are used to serialise the character encoding and ELF schema as well as to separate the metadata and the document.
Structure
  • A structure type identifier, which is a term.
  • Optionally, a payload which is one of
    • A pointer to another structure, which must be a record within the same dataset.
    • A string or subtype thereof.
  • One superstructure, which is one of
    • Another structure; superstructure links MUST be acyclic.
    • The document.
    • The metadata.
  • A collection of any number of substructures, which are structures.

    The relative order of structures with the same structure type identifier SHALL be preserved within this collection; the relative order of structures with distinct structure type identifiers is not defined by this specification.

Superstructure type identifier
A term identifying the type of the superstructure of a structure. If the superstructure is the document, this is elf:Document. If the superstructure is the metadata, this is elf:Metadata. Otherwise, this is the structure type identifier of a structures's superstructure.
Superstructure type identifier is not transitive, applying only to the immediate superstructure.

Suppose an elf:INDIVIDUAL_RECORD is the superstructure of an elf:GRADUATION and the elf:GRADUATION is the superstructure of an elf:AGE_AT_EVENT. The superstructure type identifier of the elf:AGE_AT_EVENT is elf:GRADUATION, not elf:INDIVIDUAL_RECORD.

Tagged Structure

Like a structure, except

  • it has a tag instead of a structure type identifier.
  • its substructures are stored in a sequence with defined order, not in a partially-ordered collection.
Undefined tag identifier
A term containing a single # (U+0023) with elf:Undefined before it and a string matching production Tag after it. The string after the U+0023 is called the tag of the undefined tag identifier.
Xref Structure

Like a tagged structure, except

  • it may have an optional xref_id.
  • its payload, if present, is always a string, not a pointer.

Parsing and serialising line strings

In order to parse an ELF document, an ELF parser shall first convert the octet stream into a sequence of line strings, which are strings containing the unparsed lexical representations of lines.

The way in which octets are mapped to characters is called the character encoding of the document. ELF supports several different character encodings. Determining which is used is a two-stage process, with the first stage being to determine the detected character encoding of the octet stream per §3.1. Frequently there will be no detected character encoding.

The purpose of this step is twofold: first, it allows non-ASCII-compatible character encodings like UTF-16 to be supported; and secondly, it removes any byte-order mark that might be present in the octet stream.

Next, the initial portion of the octet stream is converted to characters using the detected character encoding, failing which in an ASCII-compatible manner. This character sequence is then scanned for a CHAR line whose payload identifies the specified character encoding. This process is described in §3.2. If there is a specified character encoding, it is used as the character encoding for the ELF document; otherwise the detected character encoding is used, failing which the default is the ANSEL character encoding. Considerations for reading specific character encodings can be found in §3.3.

Once the character encoding is determined, the octet stream can be converted into a sequence of characters which are assembled into line strings as described in §3.4. The process of serialising a line string back into an octet stream is far simpler as the intended character encoding is already known; this process is described in §3.5.

Detecting a character encoding

For applications that choose not to support the optional UTF-16 character encoding, the process described in this section can be as simple as skipping over a UTF-8 byte-order mark, and determining the detected character encoding to be UTF-8 if a byte-order mark was present.

If a character encoding is specified via any supported external means, such as an HTTP Content-Type header, this should be taken as to be the detected character encoding.

Suppose the ELF file was download using HTTP and the response included this header:

Content-Type: text/plain; charset=UTF-8

If an application supports taking the detected character encoding from an HTTP Content-Type header, the detected character encoding should be UTF-8.

Note that the use of the MIME type text/plain is not recommended for ELF. It is used here purely as an example.

Otherwise, if the octet stream begins with a byte-order mark (U+FEFF) encoded in UTF-8, the detected character encoding shall be UTF-8; or if the application supports the optional UTF-16 encoding and the octet steam begins with a byte-order mark encoded in UTF-16 of either endianness, the detected character encoding shall be UTF-16 of the appropriate endianness. The byte-order mark shall be removed from the octet stream before further processing.

Otherwise, if the application supports the optional UTF-16 encoding and the octet stream begins with any ASCII character (U+0001 to U+007F) encoded in UTF-16 of either endianness, this encoding shall be the detected character encoding.

ELF files typically begin with the character "0". In the big endian form of UTF-16, sometimes called UTF-16BE, this is encoded with the hexadecimal octets 00 30. These two octets will be detected as an ASCII character encoded in UTF-16, and the detected charcter encoding will be determined to be UTF-16BE.

Otherwise, applications may try to detect other encodings by examining the octet stream in an implementation-defined manner, but this is not recommended.

One situation where it might be necessary to try to detect another encoding is if the application needs to support (as an extension) a character encoding like EBCDIC or UTF-32 which is not compatible with ASCII.

Otherwise, there is no detected character encoding.

In this case, for the octet stream to be understood, it must use a 7- or 8-bit character encoding that is sufficiently compatible with ASCII that the CHAR line can be read. The only 7 or 8-bit character encodings defined in this standard are ASCII, ANSEL and UTF-8 which encode ASCII characters identically. These will all be understood correctly if there is no detected character encoding.

Some character encodings with minor differences from ASCII can also be understood correctly. An example is the Japanese Shift-JIS character encoding which uses the octets 5C and 7E to encode the yen currency sign (U+00A5) and overline character (U+203E) where ASCII has a backslash (U+005C) and tilde (U+007E). An application does not need to understand these characters in order to scan for a CHAR line.

These cases can be summarised as follows, where xx denotes any octet with a hexadecimal value between 01 and 7F, inclusive:

Initial octets Detected character encoding
EF BB BF UTF-8, with byte-order mark
FF FE UTF-16, little endian, with byte-order mark
FE FF UTF-16, big endian, with byte-order mark
xx 00 UTF-16, little endian, without byte-order mark
00 xx UTF-16, big endian, without byte-order mark
Otherwise None

Specified character encodings

To determine the specified character encoding, the initial portion of the octet stream shall temporarily be converted to characters using the detected character encoding.

If there is no detected character encoding, the application shall convert each octet to the character whose code point is the value of octet. An application may given an error if the null octet 00 is encountered.

This is equivalent to using the ISO-8859-1 character encoding if there is no detected character encoding. As defined in §2 of [Basic Concepts], code point U+0000 does not correspond to a valid character. In principle, the octet 00 might occur in the representation of a valid character in some character encoding, but most character encodings avoid this and it cannot happen in the ASCII, ANSEL or UTF-8 character encodings. The same is true of octets corresponding to those control characters matching the RestrictedChar production in §2 of [Basic Concepts], which [Basic Concepts] allows an application to reject.

Characters from the initial portion of the octet stream are parsed into lines strings as described in §3.4, each line string is whitespace normalised as described in §2 of [Basic Concepts], and all lowercase ASCII characters (U+0061 to U+007A) converted to the corresponding uppercase characters (U+0041 to U+005A).

Whitespace normalisation and conversion to uppercase only applies for the purpose of determining the specified character set. Neither process is otherwise applied to all line strings. It is done here to simplify scanning for the specified character set, but without requiring full parsing of line strings into a lines, which might result in errors if actual character encoding is different to the one being used provisionally while scanning for the specified character encoding.

Once normalised in this manner, the first line string of the file must be exactly "0 HEAD"; otherwise the application must issue an error and cease parse the octet stream as ELF. If the application encounters a subsequent normalised line string beginning with a 0 digit (U+0030) followed by a space character (U+0020), the application shall stop scanning for a specified character encoding.

A line string beginning with a "0" encodes the start of the next record, and therefore the end of the HEAD record. The specified character encoding is given in a CHAR line in the HEAD record; a CHAR line found elsewhere in the file must not be used to supply the specified character encoding.

If the application encounters a line string beginning with "1 CHAR" followed by a space character (U+0020) while scanning for the specified character encoding, then the remainder of the line string shall be used to determine the specified character encoding.

If the remainder of the line string is exactly "ASCII", "ANSEL" or "UTF-8", then the specified character encoding shall be ASCII, ANSEL or UTF-8, respectively.

It is recommended that all ELF documents use UTF-8 and record this using a CHAR line as follows:

0 HEAD
1 CHAR UTF-8

This CHAR line string will be found while scanning for the specified character encoding. The line string begins with "1 CHAR" followed by a space character; the remainder of the line string is "UTF-8" so the specified character encoding is recognised as UTF-8.

Otherwise, if the remainder of the line string is exactly "UNICODE" and the detected character encoding is UTF-16 in either endianness, the specified character encoding shall be the UTF-16 in that endianness.

[GEDCOM 5.5.1] says that the string "UNICODE" is used to specify the UTF-16 encoding, though without naming the encoding as such, and without specifying which endianness is meant. If the octet stream is a valid ELF document encoded in UTF-16 and the application supports UTF-16, then the detected character encoding will have been determined accordingly.

Otherwise, the application may determine the specified character encoding from the remainder of the line string and the detected character encoding in an implementation-defined way. The application may read one further line string, and if it begins with "2 VERS" followed by a space character (U+0020), the application may also use the remainder of that line string in determining the specified character encoding.

It is fairly common to find "ANSI" on the CHAR line, though this has never been a legal option in any version of GEDCOM. It typically refers to one of several Windows code pages, most frequently CP-1252 which was the Windows default code page for English language installations and for several other Western European languages. However other code pages exist, and an application localised for, say, Hungarian might encode the file using CP-1250. In principle a VERS line could contain information to specify the particular code page used, as in the following ELF fragment, but in practice this is rare.

0 HEAD
1 CHAR ANSI
2 VERS 1250

Otherwise, there is no specified character encoding.

If there is a specified character encoding, it shall be used as the character encoding of the octet stream. Otherwise, if there is a detected character encoding, it shall be used as the character encoding of the octet stream. Otherwise, the character encoding shall default to be ANSEL.

ANSEL is the default character encoding for compatibility with GEDCOM, despite being it being deprecated in ELF.

If the character encoding is one which the application does not support, the application shall issue an error and stop reading the file.

Character encodings

ELF parsers are required to support reading the ASCII, ANSEL and UTF-8 character encodings. ELF writers are only required to support the UTF-8 character encoding. Support for the UTF-16 character encoding is optional, and applications may support it in either its big or little endian forms, both, or neither. The ANSEL and UTF-16 character encodings are both deprecated.

We considered making support for ANSEL optional, but after researching how frequently current GEDCOM files were encoded using ANSEL (as opposed to claiming to be ANSEL but actually using the ASCII subset of ANSEL), the TSC felt it had to be required.

The UTF-8 and UTF-16 character encodings are the Unicode encoding forms defined in §9.2 of [ISO 10646], and the specifics of the big and little endian forms of UTF-16 are defined in §9.3 of [ISO 10646].

Work out whether we're going to cite ISO 10646 or the Unicode standard, and get check the section numbers.
UTF-8 is a variable-width character encoding that uses between one and four octets to encode a character. It is backwards compatible with ASCII, so ASCII characters are encoded to a single octet and other characters require more. For example, the Czech given name "Miloš" is encoded using the octet sequence 4D 69 6C 6F C5 A1 where the last two octets encode the character "š". Only characters outside Unicode's Basic Multilingual Plane — that is characters with a code point of U+10000 or higher — are encoded with four octets. An example is the ancient Chinese character "𠀡" which is encoded using the octets F0 A0 80 A1. Such characters can occasionally be found encoded using six octets (e.g. ED A1 80 ED B0 A1 for "𠀡"). This form, which is called CESU-8 and is not valid UTF-8, typically results from an incorrect serialisation of UTF-16 data as UTF-8. Input containing CESU-8 forms but purporting to be UTF-8 is not a conformant source, however ELF parsers may read it providing they issue a warning to the user. ELF writers must not generate CESU-8 when serialising data as UTF-8.
UTF-16 is also a variable-width character encoding which normally uses two octets to encode a character, but uses four octets for characters outside the Basic Multilingual Plane. When only two octets are used, UTF-16 is identical to an earlier fixed-width character encoding called UCS-2 which was unable to encode characters outside the Basic Multilingual Plane. Conformant applications are required by §2 of [Basic Concepts] to support characters outside the Basic Multilingual Plane, and therefore applications which opt to support UTF-16 must ensure they do not implement support for only UCS-2.
As UTF-8 and UTF-16 are encodings of Unicode, they naturally decode into a sequence of Unicode characters without requiring conversion between character sets.

The character encoding referred to as ASCII in this standard is the US version of ASCII which, for the purpose of this standard, is defined as the subset of UTF-8 which uses only Unicode characters U+0001 to U+007F.

The US ASCII character encoding is normally defined in [ASCII], but this standard defines it in terms of [ISO 10646]. This is partly to avoid uncertainty over which of several incompatible definitions of ASCII is meant, partly because the Unicode standard is much more readily available than the ASCII one, and partly because ASCII allows certain punctuation marks a be used as combining diacritics when they follow the backspace character (U+0008). This use of ASCII combining diacritics is not included in [ISO 10646], and is forbidden in both GEDCOM and ELF as the backspace character must not occur. Unicode provides a separate set of combining diacritics which are permitted in ELF.

ANSEL refers to the Extended Latin Alphabet Coded Character Set for Bibliographic Use defined in [ANSEL]. If an ELF file is determined to use the ANSEL character encoding it must be converted into a sequence of Unicode characters before it can be processed further. This is discussed in §3.3.1.

If other character encodings are supported, they too must be converted into a sequence of Unicode characters for further processing.

This standard makes no recommendation on how applications should represent sequences of Unicode characters internally, and the UTF-8, UTF-16 and UTF-32 character encodings each have advantages.
This standard currently makes no distinction between a character set and a character encoding, but arguably it would be cleaner to make this distinction. Then UTF-16 and UTF-8 are different character encodings of the same Unicode character set, and ASCII may be regarded as such too for our purpose; but ANSEL is a different character set and requires conversion to Unicode. [ISO 10646] makes a further distinction between encoding forms like UTF-8 and UTF-16, and encoding schemes like UTF-16BE and UTF-16LE.

Converting ANSEL to Unicode

Add material from ansel-to-unicode.md.

Line strings

Before characters from the octet stream can be parsed into lines, they must be assembled into line strings. This is done by appending characters to the line string until a line break is encountered, at which point the character or characters forming the line break are discarded and a new line string is begun.

A line break is defined as a line feed (U+000A), or carriage return (U+000D) followed by an optional line feed (U+000A). It matches the following LB production:

LB  ::=  #xD #xA? | #xA
This definition of a line break matches the form of line endings used on Unix, Linux and modern Mac OS (U+000A), the traditional Mac OS form (U+000D), and Windows line endings (U+000D U+000A). Unlike the equivalent production in [GEDCOM 5.5.1], this does not match a line feed followed by a carriage return (U+000A U+000D) which was used as a line ending on BBC and Acorn computers in some specific contexts. In ELF, this sequence is parsed as two line breaks with an intervening blank line string which gets ignored.

Any leading whitespace shall be removed from the line string, and any trailing whitespace should also be removed. If this results in a line string which is an empty string, the empty line string is discarded.

These operations resolve ambiguities in [GEDCOM 5.5.1], and might therefore be a change from some current implementations' interpretation of the GEDCOM standard. On the one hand, §1 of [GEDCOM 5.5.1] say that leading whitespace, including extra line terminators, should be allowed and ignored when reading; on the other hand, the relevant grammar production does not permit any such leading whitespace. For maximal compatibility with existing data, a conformant ELF application must accept and ignore leading whitespace and blank lines, but must not generate them.

For trailing whitespace, [GEDCOM 5.5.1] is even less clear. Twice, once in §2 and again Appendix A, it states that applications sometimes remove trailing whitespace, but without saying whether this behaviour is legal; certainly it implies it is not required. There is little consistency in the behaviour of current applications, so any resolution to this will result in an incompatibility some applications. In ELF, the removal of trailing whitespace is recommended, and a future version of ELF is likely to make this required. It is explicitly not required in this standard so that applications may decide whether to remove trailing whitespace based on the identity of the application that produce the file. Applications should not leave trailing whitespace on line strings unconditionally.

The Unicode escape mechanism defined in §5.3 provides ELF applications with a way of serialising a value which legitimately ends in whitespace without it being removed.

FHISO's general philosophy regarding whitespace is that, where possible, ELF should avoid placing any significance on invisible difference in whitespace. This includes the difference between spaces and tabs, and trailing whitespace on lines.

Serialising line strings

Line strings are serialised by concatenating them together to form a single string, inserting a line break between each line string and after the last one. All the inserted line breaks must have identical lexical forms matching the LB production in §3.4.

Applications can choose whether to use Windows line endings (U+000D U+000A), traditional Mac OS line endings (U+000D), or the line endings used on Unix, Linux and modern Mac OS (U+000A), but must not to use mix these in the same file.

Finally, the resulting string is encoded into an octet stream using the character encoding that was documented in the serialisation metadata tagged structure with tag "CHAR" (see §6.1). ELF writers are only required to support the UTF-8 character encoding, and this should be the default in applications supporting additional character encodings.

Check the above paragraph. We probably want a later section to define an output encoding.

If the character encoding is one which allows a byte-order mark (U+FEFF) to be encoded, an ELF writer may prepend one the octet stream. This is recommended when serialising to UTF-16, but is not recommended when serialising to UTF-8.

This follows the advice in §2.6 of [Unicode] that "Use of a BOM is neither required nor recommended for UTF-8".

Parsing and serialising structures

Parsing lines

For a line string to be parsed into a line, it must match the following Line production:

Line ::= Number S (XRefID S)? Tag (S Pointer | S String)?

The Line production does not allow leading or trailing whitespace because this has already been removed in the process of creating line strings. The S production is defined in §2 of [Basic Concepts] and matches any non-empty sequence of whitespace characters, though because carriage returns and line feeds are always treated as line breaks which delimit line strings, in practice the S production can only match space or horizontal tab characters. Allowing tabs or multiple space characters is a departure from [GEDCOM 5.5.1], but one that is commonly implemented in current applications.

Whitespace is required between each of the four components of the line. This is a change from [GEDCOM 5.5.1] where the grammar says that the delimiter is an optional space character. Documents written using very early versions of GEDCOM frequently took advantage of this to merge the level, structure identifier and tag together, as in "0@I1@INDI". This is not permitted permitted in ELF.

It would be simple enough to modify the grammar so that "0@I1@INDI" would still be supported, and this could make ELF Serialisation backwards compatible with GEDCOM 1.0. However the facility is almost never found in GEDCOM 5.5 or 5.5.1 files, and is often not supported in applications. It may even be a error that it is still permitted in the GEDCOM grammar. Files created for earlier versions of GEDCOM are only very rarely encountered and their data model is incompatible with [ELF Data Model]. There seems to be little benefit to supporting earlier versions of GEDCOM in the serialisation layer but not in the data model.

The Line production contains an ambiguity as any string which matches the Pointer production necessarily also matches the String production. ELF parsers must treat the payload as a pointer if it matches the Pointer production, and only as a string if it does not.

An earlier draft of this standard used the following PayloadString production in place of the general-purpose String production. This ensures that only strings with correctly escaped "at" signs (U+0040) are allowed in a payload. This removes this ambiguity from the grammar by ensuring pointers do not match the PayloadString production.

PayloadString ::= PayloadItem*
PayloadItem   ::= [^#x40] | "@@" | Escape
Escape        ::= "@#" [^#x40#xA#xD]* "@"

This draft does not do this because it would require all "at" signs to be correctly escaped. In practice, unescaped "at" signs are fairly commonly found in GEDCOM files, particularly in the payload of EMAIL lines. It is fairly easy to specify ELF so that these can be accommodated and this draft does so at the cost of introducing this ambiguity into the grammar. In practice it is not anticipated that the ambiguity will cause implementers difficulties and many current products appear to allow unescaped "at" signs in the manner proposed here.

The Line production contains a second ambiguity: when there are multiple characters of whitespace following the tag on a line whose payload is a string, the second and subsequent whitespace characters can be parsed by the S production or the String production. Applications should resolve this ambiguity by matching the S production to the shortest possible non-empty sequence of whitespace so that any additional whitespace is considered to be part of the payload.

This ambiguity is inherited from [GEDCOM 5.5.1]. On the one hand, the grammar in §1 of [GEDCOM 5.5.1] allows at most one space character and any further space characters are therefore considered part of the payload. On the other hand, Appendix A says that some applications look for the first non-space character to denote the start of the payload, and recommends against doing this with the payload of the CONT tag.

This ambiguity is most problematic on CONC lines where it can result in unwanted whitespace being inserted in the middle of a word, or necessary whitespace being lost between words; it can also cause problems with any payload which has leading whitespace which needs to be preserved. To avoid this, using the CONC mechanism to split the payload next to whitespace is not recommended, and the Unicode escaped in §5.3 provides a means of preventing important leading whitespace from being lost. These allow ELF applications to avoid depending on this undefined behaviour.

This standard could have required leading whitespace to be preserved in payloads. This standard doesn't do that in part to discourage applications from relying on it instead of avoiding the problem. In general, we would rather the standard did not attach significance to differences in whitespace, and it is possible FHISO might want to remove the ambiguity in the future by requiring the payload to begin with a non-whitespace character. Ensuring valid ELF files do not have leading whitespace in payloads should help make this possible.

The Number, XRefID and Tag productions encodes the level, the structure identifier and the tag of the line, respectively. The String and Pointer productions encode the payload of the line, depending on whether the payload is a string or a pointer, respectively. The structure identifier and payload are both optional.

Number  ::= "0" | [1-9] [0-9]*
XRefID  ::= "@" [a-zA-Z0-9_] [^@:!]* "@"
Tag     ::= [0-9a-zA-Z_]+
String  ::= Char+
Pointer ::= "@" [a-zA-Z0-9_] [^@]* "@"
The current working draft of [Basic Concepts] has a String production which is defined as Char*. The String production here only matches a non-empty string. This is to avoid any difference arising between a line with no payload and a line whose payload is an empty string. However it may be better to remove this potential ambiguity in a different way.

0 @I1@ INDI
1 NAME Cleopatra
1 FAMC @F2@

This ELF fragment contains three lines. The first line has a level of 0, a structure identifier of @I1@, and a tag of INDI; it has no payload. Neither the second nor the third line has a structure identifier, and both have a payload: on the second line the payload is the string "Cleopatra", while the payload of the third line is a pointer, @F2@.

A line string which does not match the Line production is called an unparsable line string. An ELF parser may issue an error on encountering an unparsable line string and stop parsing the input stream. If the application continues processing, the unparsable line string shall be converted into an error line as described in §4.1.1.

Empty line strings or line strings consisting only of whitespace are never unparsable line strings because they have already been removed from the input stream.

The previous level of a line or unparsable line string is defined as the level of the closest preceding line with a tag other than CONT, CONC or ERROR. The first line in the input stream has no previous level.

0 INDI
1 NOTE The 16th President of the United States.
2 CONT Assassinated by John Wilkes Booth.
0 TRLR

In this example, the previous level of the TRLR line is 1, which is the level of the NOTE line, because the line immediately preceding the TRLR line has a CONT tag.

A too-deep line is a line that has a level more than one greater than its previous level. If there is no previous level then line is a too-deep line unless its level is 0. If an application parses a too-deep line, it may issue an error and stop parsing the input stream.

0 @S1@ SOUR
2 NOTE text
0 @N1@ NOTE This is text
1 CONT more text
2 CONT still more text

The second and fifth lines of this example are both too-deep lines. In both cases, they have a level of 2 and a previous level of 0. In the case of the fifth line, this is because the previous line has a CONT tag and so is ignored when determining the previous level.

ELF parsers are required to check that the first line string is exactly "0 HEAD" while determining the specified character encoding per §3.2, so the first line cannot be a unparsable line string or a too-deep line, which in turn guarantees that all subsequent lines have a previous level.

Error lines

An error line is an optional facility used to encode malformed data and continue reading the input file. It is a line which encodes the data found in an unparsable line string or lines that are erroneous in certain specific ways. It has a tag of ERROR, and a level that is one greater than its previous level.

If the error line is being generated from an unparsable line string, it shall have no structure identifier, and a payload which is a string consisting of the entire unparsable line string.

0 HEAD
unexpected content
0 TRLR

In this example, the second line string does not match the Line production and is therefore an unparsable line string. The previous level of this unparsable line string is 0, the level of the HEAD line. If the application does not stop processing on encountering it, the unparsable line string is converted into an error line with level 1, an ERROR tag, and a payload of "unexpected content".

0 HEAD
1 ERROR unexpected content
0 TRLR

If an ELF parser reads a line with a tag of ERROR, this is also considered to be an error line, and the application may issue an error and stop parsing the input stream.

This can occur when an ELF document containing errors is read, the errors converted into error lines which are later exported.

Assembling tagged structures

Once line strings have been parsed into lines, the sequence of lines is converted into a hierarchy of tagged structures. A tagged structure is an incompletely processed form of a structure, and consists of a tag, an optional payload which is either a string or a pointer, and a sequence of zero or more child tagged structures known as its substructures.

The definition of a tagged structure is identical to the definition of a structure, except that a tagged structure has a tag where a structure has a type identifier. Tags are converted to type identifiers at a later stage of parsing.

The conversion of lines into tagged structures is defined recursively. To read a tagged structure, the parser starts by reading its first line, and creating a tagged structure using the tag and payload of the line as the tag and payload of the new tagged structure.

The parser then repeatedly inspects the next line to determine whether it represents the start of a substructure of the tagged structure being read. When the next line has a level less than or equal to the level of the first line of the tagged structure, then there are no further substructures and the application has finished reading the tagged structure.

1 DEAT Y
0 TRLR

In the above ELF fragment, the parser reads the first line and creates a tagged structure with a DEAT tag and a payload of "Y". It then inspects the following line, but because the following line has a level of 0 which is less than the level of the first line of the DEAT structure, this indicates that the DATE structure has no substructures.

If the next line is not a too-deep line and has a level exactly one greater than the level of the first line of the tagged structure being read, the parser shall recursively parse this line as the first line of a new tagged structure, and append the resulting new tagged structure to the list of substructures being read. Parsing continues by inspecting the following line to see if it is the start of another substructure, as described above.

0 @I1@ INDI
1 NAME Elizabeth
1 BIRT
2 DATE 21 APR 1926
0 TRLR

In this fragment, an application reads the first line and creates an INDI structure. The next line has a level one greater than the level of the INDI line, so is parsed as the start of a substructure. The parser creates a NAME structure, and as the level of the following line is no greater than the level of the NAME line, the NAME structure has no substructures. The NAME structure is appended as a substructure of the INDI structure.

The parser then repeats the process, looking for further substructures of the INDI tagged structure. The BIRT line is also one greater than the level of the INDI line, so is also parsed as the start of a substructure, but this time it has a substructure of its own, namely the DATE structure. The TRLR line has a level of 0 which tells the parser there are no further substructures of the INDI structure.

The result is an INDI structure with two substructures with tags NAME and BIRT, respectively, the latter of which has a substructure of its own with tag DATE.

If the next line is a too-deep line, an ELF parser shall recursively parse the too-deep line as the first line of a new tagged structure. Once this new tagged structure has been read, its tag is replaced with a tag of ERROR, and its payload shall be replaced with a line string generated from the too-deep line by serialising it per §4.4. The resulting tagged structure shall be appended to the list of substructures being read, and parsing continues from the next line.

The following ELF fragment has a missing line.

0 @I1@ INDI
2 PLAC Москва
3 ROMN Moscow
1 NAME Ivan IV
0 TRLR

The parser successfully reads the first line and then start looking for substructures. Because the next line has a level of 2, which is more than one greater than 0, it is a too-deep line. The parser must now read the level-2 PLAC structure with its level-3 ROMN substructure. The application then re-serialises the too-deep line back into a line string, "3 PLAC Москва", which is used to overwrite the payload of the PLAC structure. The tag of this structure is replaced with ERROR, and it is added as a substructure of the INDI structure. Parsing then continues, and recognises the NAME line as a well-formed substructure of the INDI structure. The result is as if the application had read the following fragment:

0 @I1@ INDI
1 ERROR 2 PLAC Москва
2 ROMN Moscow
1 NAME Ivan IV
0 TRLR

A tagged structure with a tag of ERROR is called a error structure, and support for them is optional. Applications not supporting error structures must terminate processing on encountering an unparsable line string, too-long line or error line.

Error lines

If the error line is being generated from a too-deep line, it shall have the same structure identifier as the too-deep line, and a payload which is a string consisting of the entire too-deep line serialised as a line string per §4.4.

To do: pick one of the following:

The following invalid input

0 @N1@ NOTE This is text
1 SOUR @S1@
1 CONT attached to nothing

One other line will be identified as elf:Undefined:

Serialising lines

The payload needs escaping, either here or in the next section.

Each line shall be converted to a line string by concatenating together the level, structure identifier, tag and payload as described by the Line production given in §4.1.

If most of the lines being serialised originated from the same ELF file and are being re-serialised unaltered, a conformant application may preserve the original syntactic form of these line string. Otherwise, the application must serialise all line strings with a single space character (U+0020) for each S production in the Line production.

This standard allows applications to preserve the syntactic form of line strings so that applications which carry out small in-place edits to an ELF file are conformant. Otherwise, this is not recommended.

Serialising

Each xref structure is encoded as a sequence of one or more lines.

These are of three kinds, in order:

  1. The first line of the xref structure
  2. Zero or more additional lines of the xref structure
  3. The lines that encode each of the xref structure's substructures (if any)
The constraint that additional lines come before the lines of substructures is never mentioned by [GEDCOM 5.5.1]. ELF includes it because it appears to have been universally applied by GEDCOM implementations, and some may depend upon it.

The level of each line is a non-negative integer. The level of a first line is 0 if the xref structure is a record or the serialisation metadata tagged structures with tag "HEAD" and "TRLR"; otherwise it is one greater than the level of the first line of its superstructure. The level of an additional line is one greater than the level of its xref structure's first line.

Each first line has the same xref_id (if any) and tag as its corresponding xref line. Each additional line has no xref_id and either "CONT" or "CONC" as its tag.

Because an xref structure must not have either "CONC" or "CONT" as its tag (see §7.2), it is unambiguous which lines are additional lines and which first line they correspond to.

The payload of the xref structure is the concatenation of the payloads of the first line and all additional lines, with a line break inserted before the payload of each additional line with tag "CONT". Because the payload of a line must not contain a line-break, there must be exactly one "CONT"-tagged additional line per line-break in the xref structure's payload. The number of "CONC"-tagged additional lines may be picked arbitrarily, subject to the following:

GEDCOM required that lines not exceed 255 characters; this does not seem to be a real restriction in most current applications, and hence has been reduced to recommended status. We recommend bytes instead of characters because the implied purpose of this limit (enabling code to use fixed-width buffers) would limit by bytes, not characters.

[GEDCOM 5.5.1] is inconsistent in its discussion of leading and trailing whitespace.

The RECOMMENDATIONS above are compatible with the most restrictive of these, while the REQUIREMENTS with the most limiting of them.

Suppose an xref structure tag is "NOTE"; it's payload is "This is a test\nwith one line break"; and its superstructure's superstructure is a record. This xref structure requires at least two lines (because it contains one line break) and may use more. It could be serialised in many ways, such as

2 NOTE This is a test
3 CONT with one line break

or

2 NOTE This i
3 CONC s a test
3 CONT with on
3 CONC e line break
[GEDCOM 5.5.1] gives no guidance how to handle unmatched "@", but they are relatively common in gedcom files. The above policy is intended to resolve common invalid files in an intuitive way.

Given the following non-conformant data

1 EMAI name@example.com
2 DATE @#DGREG
3 CONC ORIAN@ 2 JAN 2019

a conformant application will concatenate these lines normally during parsing

1 EMAI name@example.com
2 DATE @#DGREGORIAN@ 2 JAN 2019

creating a valid date escape in the DATE-tagged extended line. The unmatched @ in the EMAI-tagged line is left unchanged during parsing.

Upon re-serialisation, the unmatched @ in the "EMAI" will be doubled when converting to an xref structure, but the date escape will not be modified

1 EMAI name@@example.com
2 DATE @#DGREGORIAN@ 2 JAN 2019

If the serialisation decides to split either extended line with CONCs, it must not do so in a way that splits up the pairs of "@"s.

The following example needs moving somewhere more sensible.

0 @I1@ INDI
2 DATE 3 JUN 2019 
3 TIME 01:15:00
@ TRLR

In this example, the second line has a level which is two greater than the level of the previous line. It is therefore considered an unparsable line string. The third line is also considered an unparsable line string because the previous line is an unparseable line string meaning the closest preceding line which is not an unparsable line string is the first line which had level 0. The fourth line is not an unparsable line string because it has the same level as its closest preceding line which is not an unparsable line string.

Encoding with @

ELF uses the character U+0040 (commercial at, @) to encode several special cases when encoding a tagged structure as an xref structure. In particular,

All of these steps involve @s, and must not be applied to one another's @s; semantically, they are applied concurrently.

During parsing, there is an inherent ambiguity when there are several contiguous @ in the payload. These shall be resolved in an earliest-match-first order.

The following xref structure's payloads are split into sequences as indicated:

payload of xref structure decomposed as
"name@example.com" "name", "@", "example.com"
"name@@example.com" "name", "@@", "example.com"
"name@@@example.com" "name", "@@", "@", "example.com"
"name@@@@example.com" "name", "@@", "@@", "example.com"
"some@#XYZ@ thing" "some", "@#XYZ@ ", "thing"
"some@@#XYZ@ thing" "some", "@@", "#XYZ", "@", " thing"
"some@@@#XYZ@ thing" "some", "@@", "@#XYZ@ ", "thing"

Pointer conversion

If a tagged structure is pointed to by the pointer-valued payload of another tagged structure, the pointe-to tagged structure's corresponding xref structure shall be given an xref_id, a string matching production XrefID.

XrefID  ::= "@" ID "@"
ID      ::= [0-9A-Z_a-z] [#x20-#x3F#x41-#x7E]*

It must not be the case that two different xref structures be given the same xref_id. Conformant implementations must not attach semantic importance to the contents of an xref_id.

It is recommended that an xref_id be no more than 22 characters (20 characters plus the leading and trailing U+0040)

[GEDCOM 5.5.1] required that xref_id be no more than 22 characters. ELF weakens this to a RECOMMENDATION.

Each record should be given an xref_id; each non-record structure should not; and each serialisation metadata tagged structure must not be given an xref_id.

Since a pointed-to structure shall have an xref_id and a non-record must not, implicitly a structure should not point to a non-record. We should probably either make that explicit or remove it---the latter may make more sense as what is pointed to seems to be more a data model decision than a serialisation decision. However, GEDCOM is fairly clear that pointers to non-records might in the future be enabled with a non-standard xref_id syntax.

The xref structure that corresponds to a tagged structure with a pointer-valued payload has, as its payload, an xref: a string identical to the xref_id of the xref structure corresponding to the pointed-to tagged structure.

When parsing, if xref payloads are encountered that do not correspond to exactly one xref structure's xref_id, that payload shall be converted to to a pointer to a record with tag "UNDEF", which shall not have a payload nor substructures. It is recommended that one such "UNDEF" tagged structure be inserted for each distinct xref.

The undefined pointer rule is designed to minimize the information loss in the event of a bad serialised input.
This rule does not handle pointer-to-wrong-type; information needed to determine that is not known be serialisation and thus must be handled by the data model instead.
We could also allow pointer-to-nothing or pointer-to-multiple-things to be dropped from the dataset, and/or provide disambiguation heuristics for pointer-to-multiple-things situations. This draft does not do so as it is not obvious that the benefit is worth the complexity.

Escape preservation and removal

An escape is a substring of a string-valued payload of either a tagged structure or xref structure which matched production Escape. Its escape type is the portion of the escape that was matched by EscType.

Escape   ::= "@#" EscType EscText "@ "
EscType  ::= [A-Z]
EscText  ::= [^#xA#xD#x40]*

If the escape type is U (U+0055), the escape is a unicode escape and its handling is discussed in §5.3; otherwise, it is handled according to this section.

Serialisation

If an escape is in the payload of an tagged structure whose tag is an escape preserving tag, and if the escape's escape type* is in the tag's set of preserved escape types, then the escape shall be preserved unmodified in the corresponding xref structure's payload.

If a "DATE" tagged structure has payload "ABT @#DJULIAN@ 1540", its corresponding xref structure's payload is also "ABT @#DJULIAN@ 1540".

Otherwise, a modification of the escape shall be placed in the xref structure's payload which is identical to the original escape except that each of the two @ shall each be replaced with a pair of consecutive U+0040 @.

If a "NOTE" tagged structure has payload "ABT @#DJULIAN@ 1540", its corresponding xref structure's payload is "ABT @@#DJULIAN@@ 1540".

Parsing

If an escape is in the payload of an xref structure whose tag is an escape preserving tag, and the escape's escape type* is in the tag's set of preserved escape types, the escape shall be preserved unmodified in the corresponding tagged structure's payload.

If a "DATE" xref structure has payload "ABT @#DJULIAN@ 1540", its corresponding tagged structure's payload is also "ABT @#DJULIAN@ 1540".

Otherwise, the escape shall be omitted from the corresponding tagged structure's payload.

If a "NOTE" xref structure has payload "ABT @#DJULIAN@ 1540", its corresponding tagged structure's payload is "ABT 1540".
The decision to remove most escapes is motivated in part because [GEDCOM 5.5.1] does not provide any meaning for an escape other than a date escape. This caused some ambiguity in how such escapes were handled, which ELF seeks to remove. Lacking a semantics to assign these escapes, ELF chooses to simply remove them. Implementations that had assigned semantics to them were actually imposing non-standard semantics to those payloads which are more accurately handled by using an alternative ELF schema to map those tags to different structure type identifiers with those semantics documented.

Unicode escapes

[GEDCOM 5.5.1] neither has a notion of unicode escape nor any other feature for achieving the same end. Unicode escapes are designed to provide a means for encoding any character in any character encoding in a way that is maximally backwards-compatible from [GEDCOM 5.5.1].

Any character may be represented with a unicode escape consisting of:

  1. The three characters U+0040, U+0023, and U+0055 (i.e., "@#U")
  2. A hexadecimal encoding of the character's code point
  3. The two characters U+0040 and U+0020 (i.e., "")

A unicode escape must be used for each character that cannot be encoded in the target character encoding; and should not be used otherwise.

Earlier drafts of this specification suggested using @#U20@ in place of U+0020 when a line's payload begins or ends with a space. Given the inherent ambiguity in the handling of delimiters at the ends of a line's payloads, it is not clear if that idea was better than simply clarifying that ambiguity.
If a tagged structure's payload is "João" and the character encoding is ASCII, the xref structure's payload is "Jo@#UE3@ o" (or "Joa@U#303@ o" if the original used a combining diacritic).

Encoding @s

It might be worthwhile to restrict this entire section to non-escape preserving tags; without that we have a (somewhat obscure) problem with the current system:

Consider the escape-preserving tag DATE. A serialisation/parsing sequence applied to the string "@@#Dx@@ yz" yields

  1. encoded "@@#Dx@@ yz"
  2. decoded "@#Dx@ yz"
  3. encoded "@#Dx@ yz" -- not with @@ because it matches a date escape

During serialisation, each U+0040 (@) that is not part of an escape shall be encoded as two consecutive U+0040 (@@).

The tagged structure payload "name@example.com" is serialised as the xref structure payload "name@@example.com"

During parsing, each consecutive pair of U+0040 (@@) shall be parsed as a single U+0040 (@).

The xref structure payload "name@@example.com" is parsed as the tagged structure payload "name@example.com"

During parsing, a lone U+0040 is left unmodified.

If an xref structure's payload is "name@example.com", it is parsed as the tagged structure payload "name@example.com"; that in turn will be re-serialised as "name@@example.com".

Serialisation metadata

The tagged structures representing the dataset are ordered as follows:

  1. A serialisation metadata tagged structure with tag "HEAD" and the following substructures:

    • A serialisation metadata tagged structure with tag "CHAR" and payload identifying the character encoding used; see §6.1 for details.

    • A serialisation metadata tagged structure with tag "SCHMA" and no payload, with substructures encoding the ELF Schema; see §6.2 for details.

    • Each tagged structure with the superstructure type identifier elf:Metadata, in an order consistent with the partial order of structures present in the metadata.

  2. Each tagged structure with the superstructure type identifier elf:Document, in arbitrary order.

  3. A serialisation metadata tagged structure with tag "TRLR" and no payload or substructures.

Charcter encoding names

The character encoding shall be serialised in the "CHAR" tagged structure's payload encoding name in the following table:

Encoding Description
ASCII The US version of ASCII defined in [ASCII].
ANSEL The extended Latin character set for bibliographic use defined in [ANSEL].
UNICODE Either the UTF-16LE or the UTF-16BE encodings of Unicode defined in [ISO 10646].
UTF-8 The UTF-8 encodings of Unicode defined in [ISO 10646].
This value is read as the specified character encoding per §3.2.

It is required that the encoding used should be able to represent all code points within the string; unicode escapes (see §5.3) allow this to be achieved for any supported encoding. It is recommended that UTF-8 be used for all datasets.

ELF Schema

The ELF Schema is a serialisation metadata tagged structure with tag "SCHMA" and no payload; it may contain as substructures any number of external schema structures, prefix abbreviation structures, IRI definition structures, and escape preservation structures.

If, during parsing, no ELF Schema is found, the default ELF schema defined in §9 shall be used.

Do we need to make the default dependant on the GEDC metadata?

If multiple ELF Schemas are found, they shall be treated as if all of their substructures were part of the same ELF schema.

During serialisation exactly one ELF Schema should be included.

External schema structure

An external schema structure is a tagged structure with an ELF Schema as its superstructure, tag SCHMA, no substructures, and an IRI as its payload. The IRI should use the http or https scheme and an HTTP GET request sent to it with an Accept header of application/x-fhiso-elf1-schema should return a dataset serialised in accordance with this specification containing an ELF Schema defining the full data model in structure type descriptions.

Is application/x-fhiso-elf1-schema a MIME-type we are happy with?

When using the [ELF Data Model] version 1.0.0, the serialisation schema could be serialised as

0 HEAD
1 SCHMA
2 SCHMA https://fhiso.org/TR/elf-data-model/v1.0.0
An HTTP GET request sent to it with an Accept header of application/x-fhiso-elf1-schema to https://fhiso.org/TR/elf-data-model/v1.0.0 will return the contents of §9 or the equivalent.

When retrieving a serialised dataset via an HTTP GET request to the IRI of an external schema structure, all contents of that dataset except ELF Schemas shall be ignored. Additional external schema structure should not be present within that ELF Schema and if they are, they may be ignored.

The recommendation against external schema structures inside other external schema structures is designed to simplify parsing.

Prefix abbreviation structure

Should this section cite §4.3 of [Basic Concepts] instead of its current text?

A prefix abbreviation structure is a tagged structure with an ELF Schema as its superstructure, tag PRFX, and no substructures. Its payload consist of two whitespace-separated tokens: the first is a prefix and the second is that prefix's corresponding IRI.

To prefix expand a string, if that string begins with a defined prefix followed by a colon (U+003A :) then replace that prefix and colon with the prefix's corresponding IRI. To prefix shorten a string, replace it with a string that prefix expansion would convert to the original string.

Given a PRFX

2 PRFX elf https://fhiso.org/elf/

the IRI https://fhiso.org/elf/ADDRESS may be abbreviated as elf:ADDRESS.

IRI definition structure

An IRI definition structure is a tagged structure with an ELF Schema as its superstructure and tag "IRI". Its payload is an IRI, which may be prefix shortened during serialisation and must be prefix expanded during parsing. The remainder of this section calls this prefix expanded payload I. An IRI definition structure may have, as substructures, any number of supertype definition structures and tag definition structures.

A supertype definition structure is a tagged structure with an IRI definition structure as its superstructure, tag "ISA", and no substructures. Its payload is a structure type identifier which may be prefix shortened during serialisation and must be prefix expanded during parsing. The remainder of this section calls this prefix expanded payload I. Each supertype definition structure encodes a single supertype definition, specifying that I is a supertype of I.

That elf:ParentPointer is a supertype of elf:PARENT1_POINTER can be encoded in a supertype definition structure as

2 IRI elf:PARENT1_POINTER
3 ISA elf:ParentPointer

A tag definition structure is a tagged structure with an IRI definition structure as its superstructure, tag "TAG", and no substructure. Its payload is a whitespace-separated list of two or more tokens. The first token T must match production Tag; each remaining token S is an IRI, which may be prefix shortened during serialisation and must be prefix expanded during parsing. Each such S encodes an tag definition between structure type identifier I and (tag, superstructure type identifier) pair (T, S).

The following tag definitions

can be encoded in tag definition structures as

0 HEAD
1 SCHMA
2 PRFX elf https://fhiso.org/elf/
2 IRI elf:PARENT1_POINTER
3 TAG HUSB elf:FAM_RECORD
2 IRI elf:Parent1Age
3 TAG HUSB elf:FamilyEvent
2 IRI elf:MULTIMEDIA_FORMAT
3 TAG FORM elf:MULTIMEDIA_FILE_REFERENCE elf:MULTIMEDIA_RECORD
2 IRI elf:ADDRESS_EMAIL
3 TAG EMAIL elf:Agent
3 TAG EMAI elf:Agent

Escape-preserving tags

This entire section, and all of the related functionality, is present to help cope with the idiosyncratic behaviour of date escapes in [GEDCOM 5.5.1]. Escapes in previous editions of GEDCOM were serialisation-specific and if encountered in ELF should generally be ignored, but date escapes are instead part of a microformat. While escape-preserving tags are not elegant, they are adequate to handle this idiosyncrasy.
I wrote the above note from somewhat fuzzy memory. It might be good to review and summarise all the uses of escapes in various GEDCOM releases...

Some tags may be defined as escape-preserving tags, each with a list of single-character preserved escape types each of which must match production UserEscType.

UserEscType ::= [A-TV-Z]

An escape preservation structure is a tagged structure with an ELF schema as its superstructure, tag "ESC", and no substructures. Its payload is composed of two whitespace-separated tokens; the first is the escape-preserving tag and the second is a concatenation of all preserved escape types of that tag; each preserved escape type should be included in the second token only once.

Two escape preservation structures must not differ only in the set of preserved escape sequences they define for a given tag.

Escape-preserving tags are included for backwards compatibility, and must not be used for new extensions.

The only known escape-preserving tag is "DATE", with the preserved escape type of "D"

The following is the only escape preservation structure in ELF 1.0.0:

0 HEAD
1 SCHMA
2 ESC DATE D

The following defines tag _OLD_EXTENSION to preserve G and Q escapes:

0 HEAD
1 SCHMA
2 ESC _OLD_EXTENSION QG

The ESC could have equivalently been written as

2 ESC _OLD_EXTENSION GQ

or even

2 ESC _OLD_EXTENSION QGGQQQGGGG

... though that last version is needlessly redundant and verbose and is not recommended.

Such a definition must not be used except as backwards compatibility support for an escape-dependent _OLD_EXTENSION that predates ELF 1.0.0.

This specification uses tag and not structure type to indicate escape preservation because the main motivating case (DATE) applies it to all of the several structure types that share that tag.

Tags

Definitions

A tag is a string that matches production Tag

Tag ::= [0-9a-zA-Z_]+

A tag should be no more than 15 characters in length.

[GEDCOM 5.5.1] required tags to be unique within the first 15 characters and no more than 31 characters in length. As memory constraints that motivated those requirements are no longer common, ELF has changed that recommended status instead.

A tag should begin with an underscore (_, U+005F) unless it is defined in a FHISO standard.

[GEDCOM 5.5.1] required all tags other than those it defined to begin with an underscore. ELF's use of structure type identifiers largely obviates that need, but it remains recommended in ELF 1.0.0 to support legacy systems that have special-case handling for underscore-prefixed tags. FHISO is considering removing that recommendation in a subsequent version of ELF.
"HEAD" is a valid tag; so is "_UUID". "23" and "UUID" are also valid, but should not be used as they are not defined in a FHISO standard and do not begin with an underscore. "_UNCLE_OF_THE_BRIDE" is valid, but should not be used as it is 19 characters long, more than the 15-character recommended maximum length.

Structure type identifiers are serialised as tags by utilizing tag definitions and supertypes, as outlined below.

Supertypes

A supertype definition specifies one structure type identifier that is defined to be a supertype of another.

The following are example supertype definitions in the default ELF schema:

An eventual supertype of a structure type identifier is either

Continuing the previous example,

If X is an eventual supertype of Y, then Y is an eventual subtype of X.

Continuing the previous example,

The supertype defined in this specification is only intended to facilitate tag definitions and must not be taken to indicate any semantic relationship between the structure types they describe.

It is expected that underlying data models will often define a semantic supertype-like relationship that mirrors the supertype definitions in this document; see [Elf-DataModel] for an example of what this might look like. The prohibition against assuming such from the supertype definitions alone provides a clearer separation between data model and serialisation.
We could decide to REQUIRE that any supertype definition has meaning in the underlying data model; I chose not to do so in this draft as it required discussing semantics, which this specification otherwise does not need to do.

Tag definitions

The correspondence between tags and structure type identifiers is provided by a set of tag definitions. Each tag definition gives the unique structure type identifier that a particular tag corresponds to if its superstructure type identifier is an eventual subtype of a given superstructure type identifier.

The following are example tag definitions in the default ELF schema:

If a tagged structure has tag "CAUS" and superstructure type identifier elf:MARRIAGE, it's structure type identifier is elf:CAUSE_OF_EVENT because of the last of the above tag definitions and because elf:MARRIAGE is an eventual subtype of elf:Event.

The set of tag definitions and supertype definitions must not provide two (or more) different structure type identifiers for any single structure.

The following, taken together, are not permitted

These provide two contradictory tag definitions for the tag "_EX_KIND" as a substructure of an elf:SUBMITTER_RECORD.

The following, taken together, are permitted

These provide two tag definitions for the tag "_EX_KIND" as a substructure of an elf:SUBMITTER_RECORD, but because both provide the same structure type identifier they are permitted.

A tag definition is said to apply to a structure if and only if the structure's structure type identifier is that of the tag definition and its superstructure type identifier is an eventual subtype of the tag definition's superstructure type identifier.

A tag definition is said to apply to a tagged structure if and only if the tagged structure's tag is that of the tag definition and its superstructure type identifier is an eventual subtype of the tag definition's superstructure type identifier.

Serialisation

During serialisation, a conformant application shall ensure the presence of sufficient tag definitions that at each structure has a defined tag, creating new tag definitions if needed to achieve this end.

The above is not the same as saying that a tag definition is created for each structure type identifier because a structure with identifier "elf:Undefined" or an undefined tag identifier has a defined tag without a tag definition.

New tag definitions may be selected arbitrarily, subject to the limitations on tags (see §7.1) and tag definitions (see §7.1.2) and to the following:

"CONT", "CONC", "ERROR", and "UNDEF" are special tags that can be created at any location within the dataset during deserialisation.
"elf:Undefined" structures are used for errors and are serialised differently than other structures.
These tags and contexts are reserved for encoding serialisation metadata.
[GEDCOM 5.5.1] never intentionally violates the above RECOMMENDATION, but via a typo it provides both EMAI and EMAIL as tags for elf:ADDRESS_EMAIL. Other aliases exist due to similar mistakes in applications and to multiple extensions inserting the same concept via different tags. The ability to handle these aliases is the reason this is a RECOMMENDATION, not a REQUIREMENT, in ELF.

Each structure is converted to a tagged structure with the tag being

If processing structures into tagged structures in place, it may be easiest to perform a postorder traversal of each structure hierarchy; this way the superstructure of a structure being converted will still have a structure type identifier, not a tag, which will simplify looking up applicable tag definitions.

The substructures of a tagged structure are stored in a sequence, not set. This ordering of substructures of a tagged structure must maintain the relative order of those substructures that were ordered in the corresponding structure. It is recommended that all substructures with the same tag be grouped together, but doing so is NOT required.

Consider the following structure hierarchy

This may be converted to any of the following three tagged structure hierarchies, though the second is not recommended:

However, the following puts the tagged structure graduations in a different order than the corresponding structure graduations and must not be used:

Parsing

When parsing tagged structures into structures, add the structure type identifier from the the applicable tag definition.

If there is no applicable tag definition, or if there are multiple applicable tag definitions providing different structure type identifiers, then the structure type identifier shall be elf:Undefined if the tag is UNDEF, or the undefined tag identifier constructed by concatenating elf:Undefined# and the tag otherwise.

The special tag "ERROR" does not require special handling; because it never has a tag definition, it becomes the undefined tag identifier elf:Undefined#ERROR.

References

Normative references

[ANSEL]
NISO (National Information Standards Organization). ANSI/NISO Z39.47-1993. Extended Latin Alphabet Coded Character Set for Bibliographic Use. 1993. (See http://www.niso.org/apps/group_public/project/details.php?project_id=10.) Standard withdrawn, 2013.
[Basic Concepts]
FHISO (Family History Information Standards Organisation). Basic Concepts for Genealogical Standards. Public draft. (See https://fhiso.org/TR/basic-concepts.)
[ASCII]
ANSI (American National Standards Institute). ANSI X3.4-1986. Coded Character Sets -- 7-Bit American National Standard Code for Information Interchange (7-Bit ASCII). 1986.
[ISO 10646]
ISO (International Organization for Standardization). ISO/IEC 10646:2014. Information technology — Universal Coded Character Set (UCS). 2014.
[RFC 2119]
IETF (Internet Engineering Task Force). RFC 2119: Key words for use in RFCs to Indicate Requirement Levels. Scott Bradner, 1997. (See http://tools.ietf.org/html/rfc2119.)
[XML]
W3C (World Wide Web Consortium). Extensible Markup Language (XML) 1.1, 2nd edition. Tim Bray, Jean Paoli, C. M. Sperberg-McQueen, Eve Maler, François Yergeau, and John Cowan eds., 2006. W3C Recommendation. (See https://www.w3.org/TR/xml11/.)

Other references

[GEDCOM 5.5.1]
The Church of Jesus Christ of Latter-day Saints. The GEDCOM Standard, draft release 5.5.1. 2 Oct 1999.
[GEDCOM 5.5]
The Church of Jesus Christ of Latter-day Saints. The GEDCOM Standard, release 5.5. 1996.
[XML Names]
W3 (World Wide Web Consortium). Namespaces in XML 1.1, 2nd edition. Tim Bray, Dave Hollander, Andrew Layman and Richard Tobin, eds., 2006. W3C Recommendation. See https://www.w3.org/TR/xml-names11/.
[ELF Data Model]
FHISO (Family History Information Standards Organisation) Extended Legacy Format (ELF): Data Model.
[Unicode]
The Unicode Consortium. The Unicode Standard – Core Specification, version 12.1.0. See https://www.unicode.org/versions/Unicode12.1.0/.

Appendix A: Default Schema

The following is a minimal ELF file with the default ELF Schema, which includes all tag definitions and supertype definitions listed in [Elf-DataModel].

0 HEAD
1 CHAR UTF-8
1 GEDC 
2 VERS 5.5.1
2 FORM LINEAGE-LINKED
2 ELF 1.0.0
1 SCHMA
2 PRFX elf https://terms.fhiso.org/elf/
2 PRFX elfm https://terms.fhiso.org/elf/metadata/
2 ESC DATE D
2 IRI elf:ADDRESS
3 TAG ADDR elf:Agent elf:Event
2 IRI elf:ADDRESS_CITY
3 TAG CITY elf:ADDRESS
2 IRI elf:ADDRESS_COUNTRY
3 TAG CTRY elf:ADDRESS
2 IRI elf:ADDRESS_EMAIL
3 TAG EMAIL elf:Agent
3 TAG EMAI elf:Agent
2 IRI elf:ADDRESS_FAX
3 TAG FAX elf:Agent
2 IRI elf:ADDRESS_LINE1
3 TAG ADR1 elf:ADDRESS
2 IRI elf:ADDRESS_LINE2
3 TAG ADR2 elf:ADDRESS
2 IRI elf:ADDRESS_LINE3
3 TAG ADR3 elf:ADDRESS
2 IRI elf:ADDRESS_POSTAL_CODE
3 TAG POST elf:ADDRESS
2 IRI elf:ADDRESS_STATE
3 TAG STAE elf:ADDRESS
2 IRI elf:ADDRESS_WEB_PAGE
3 TAG WWW elf:Agent
2 IRI elf:ADOPTED_BY_WHICH_PARENT
3 TAG ADOP elf:ADOPTIVE_FAMILY
2 IRI elf:ADOPTION
3 ISA elf:IndividualEvent
3 TAG ADOP elf:INDIVIDUAL_RECORD
2 IRI elf:ADOPTIVE_FAMILY
3 TAG FAMC elf:ADOPTION
2 IRI elf:ADULT_CHRISTENING
3 ISA elf:IndividualEvent
3 TAG CHRA elf:INDIVIDUAL_RECORD
2 IRI elf:AGE_AT_EVENT
3 TAG AGE elf:IndividualEvent elf:Parent1Age elf:Parent2Age
2 IRI elf:ALIAS_POINTER
3 TAG ALIA elf:INDIVIDUAL_RECORD
2 IRI elf:ANCESTOR_INTEREST_POINTER
3 TAG ANCI elf:INDIVIDUAL_RECORD
2 IRI elf:ANNULMENT
3 ISA elf:FamilyEvent
3 TAG ANUL elf:FAM_RECORD
2 IRI elf:ASSOCIATION_STRUCTURE
3 TAG ASSO elf:INDIVIDUAL_RECORD
2 IRI elf:ATTRIBUTE_DESCRIPTOR
3 ISA elf:IndividualAttribute
3 TAG FACT elf:INDIVIDUAL_RECORD
2 IRI elf:AUTOMATED_RECORD_ID
3 TAG RIN elf:Record
2 IRI elf:Agent
2 IRI elf:BAPTISM
3 ISA elf:IndividualEvent
3 TAG BAPM elf:INDIVIDUAL_RECORD
2 IRI elf:BAR_MITZVAH
3 ISA elf:IndividualEvent
3 TAG BARM elf:INDIVIDUAL_RECORD
2 IRI elf:BAS_MITZVAH
3 ISA elf:IndividualEvent
3 TAG BASM elf:INDIVIDUAL_RECORD
2 IRI elf:BINARY_OBJECT
3 TAG BLOB elf:MULTIMEDIA_RECORD
2 IRI elf:BIRTH
3 ISA elf:IndividualEvent
3 TAG BIRT elf:INDIVIDUAL_RECORD
2 IRI elf:BLESSING
3 ISA elf:IndividualEvent
3 TAG BLES elf:INDIVIDUAL_RECORD
2 IRI elf:BURIAL
3 ISA elf:IndividualEvent
3 TAG BRI elf:INDIVIDUAL_RECORD
2 IRI elf:CASTE_NAME
3 ISA elf:IndividualAttribute
3 TAG CAST elf:INDIVIDUAL_RECORD
2 IRI elf:CAUSE_OF_EVENT
3 TAG CAUS elf:Event
2 IRI elf:CENSUS#Family
3 ISA elf:FamilyEvent
3 TAG CENS elf:FAM_RECORD
2 IRI elf:CENSUS#Individual
3 ISA elf:IndividualEvent
3 TAG CENS elf:INDIVIDUAL_RECORD
2 IRI elf:CERTAINTY_ASSESSMENT
3 TAG QUAY elf:SOURCE_CITATION
2 IRI elf:CHANGE_DATE
3 TAG CHAN elf:Record
2 IRI elf:CHANGE_DATE_DATE
3 TAG DATE elf:CHANGE_DATE
2 IRI elf:CHILD_LINKAGE_STATUS
3 TAG STAT elf:CHILD_TO_FAMILY_LINK
2 IRI elf:CHILD_POINTER
3 TAG CHIL elf:FAM_RECORD
2 IRI elf:CHILD_TO_FAMILY_LINK
3 TAG FAMC elf:INDIVIDUAL_RECORD
2 IRI elf:CHRISTENING
3 ISA elf:IndividualEvent
3 TAG CHR elf:INDIVIDUAL_RECORD
2 IRI elf:CONFIRMATION
3 ISA elf:IndividualEvent
3 TAG CONF elf:INDIVIDUAL_RECORD
2 IRI elf:CONTINUED_BINARY_OBJECT
3 TAG OBJE elf:MULTIMEDIA_RECORD
2 IRI elf:COPYRIGHT_GEDCOM_FILE
3 TAG COPR elf:Metadata
2 IRI elf:COPYRIGHT_SOURCE_DATA
3 TAG COPR elf:NAME_OF_SOURCE_DATA
2 IRI elf:COUNT_OF_CHILDREN#Family
3 TAG NCHI elf:FAM_RECORD
2 IRI elf:COUNT_OF_CHILDREN#Individual
3 ISA elf:IndividualAttribute
3 TAG NCHI elf:INDIVIDUAL_RECORD
2 IRI elf:COUNT_OF_MARRIAGES
3 ISA elf:IndividualAttribute
3 TAG NMR elf:INDIVIDUAL_RECORD
2 IRI elf:CREMATION
3 ISA elf:IndividualEvent
3 TAG CREM elf:INDIVIDUAL_RECORD
2 IRI elf:DATE_PERIOD
3 TAG DATE elf:EVENTS_RECORDED
2 IRI elf:DATE_VALUE
3 TAG DATE elf:Event
2 IRI elf:DEATH
3 ISA elf:IndividualEvent
3 TAG DEAT elf:INDIVIDUAL_RECORD
2 IRI elf:DEFAULT_PLACE_FORMAT
3 TAG PLAC elf:Metadata
2 IRI elf:DESCENDANT_INTEREST_POINTER
3 TAG DESI elf:INDIVIDUAL_RECORD
2 IRI elf:DESCRIPTIVE_TITLE
3 TAG TITL elf:MULTIMEDIA_FILE_REFERENCE elf:MULTIMEDIA_LINK elf:MULTIMEDIA_RECORD
2 IRI elf:DIVORCE
3 ISA elf:FamilyEvent
3 TAG DIV elf:FAM_RECORD
2 IRI elf:DIVORCE_FILED
3 ISA elf:FamilyEvent
3 TAG DIVF elf:FAM_RECORD
2 IRI elf:DOCUMENT_SOURCE
3 TAG SOUR elf:Metadata
2 IRI elf:Document
2 IRI elf:EMIGRATION
3 ISA elf:IndividualEvent
3 TAG EMIG elf:INDIVIDUAL_RECORD
2 IRI elf:ENGAGEMENT
3 ISA elf:FamilyEvent
3 TAG ENGA elf:FAM_RECORD
2 IRI elf:ENTRY_RECORDING_DATE
3 TAG DATE elf:SOURCE_CITATION_DATA
2 IRI elf:EVENT#Family
3 ISA elf:FamilyEvent
3 TAG EVEN elf:FAM_RECORD
2 IRI elf:EVENT#Individual
3 ISA elf:IndividualEvent
3 TAG EVEN elf:INDIVIDUAL_RECORD
2 IRI elf:EVENTS_RECORDED
3 TAG EVEN elf:SOURCE_RECORD_DATA
2 IRI elf:EVENT_OR_FACT_CLASSIFICATION
3 TAG TYPE elf:Event
2 IRI elf:EVENT_TYPE_CITED_FROM
3 TAG EVEN elf:SOURCE_CITATION
2 IRI elf:Event
2 IRI elf:FAM_RECORD
3 ISA elf:Record
3 TAG FAM elf:Document
2 IRI elf:FILE_NAME
3 TAG FILE elf:Metadata
2 IRI elf:FIRST_COMMUNION
3 ISA elf:IndividualEvent
3 TAG FCOM elf:INDIVIDUAL_RECORD
2 IRI elf:FamilyEvent
3 ISA elf:Event
2 IRI elf:GEDCOM_CONTENT_DESCRIPTION
3 TAG NOTE elf:Metadata
2 IRI elf:GEDCOM_FORM
3 TAG FORM elf:GEDCOM_FORMAT
2 IRI elf:GEDCOM_FORMAT
3 TAG GEDC elf:Metadata
2 IRI elf:GRADUATION
3 ISA elf:IndividualEvent
3 TAG GRAD elf:INDIVIDUAL_RECORD
2 IRI elf:IMMIGRATION
3 ISA elf:IndividualEvent
3 TAG IMMI elf:INDIVIDUAL_RECORD
2 IRI elf:INDIVIDUAL_RECORD
3 ISA elf:Record
3 TAG INDI elf:Document
2 IRI elf:IndividualAttribute
3 ISA elf:Event
2 IRI elf:IndividualEvent
3 ISA elf:Event
2 IRI elf:LANGUAGE_OF_TEXT
3 TAG LANG elf:Metadata
2 IRI elf:LANGUAGE_PREFERENCE
3 TAG LANG elf:SUBMITTER_RECORD
2 IRI elf:MAP_COORDINATES
3 TAG MAP elf:PLACE_STRUCTURE
2 IRI elf:MARRIAGE
3 ISA elf:FamilyEvent
3 TAG MARR elf:FAM_RECORD
2 IRI elf:MARRIAGE_BANN
3 ISA elf:FamilyEvent
3 TAG MARB elf:FAM_RECORD
2 IRI elf:MARRIAGE_CONTRACT
3 ISA elf:FamilyEvent
3 TAG MARC elf:FAM_RECORD
2 IRI elf:MARRIAGE_LICENSE
3 ISA elf:FamilyEvent
3 TAG MARL elf:FAM_RECORD
2 IRI elf:MARRIAGE_SETTLEMENT
3 ISA elf:FamilyEvent
3 TAG MARS elf:FAM_RECORD
2 IRI elf:MULTIMEDIA_FILE_REFERENCE
3 TAG FILE elf:MULTIMEDIA_LINK elf:MULTIMEDIA_RECORD
2 IRI elf:MULTIMEDIA_FORMAT
3 TAG FORM elf:MULTIMEDIA_FILE_REFERENCE elf:MULTIMEDIA_LINK elf:MULTIMEDIA_RECORD
2 IRI elf:MULTIMEDIA_LINK
3 TAG OBJE elf:Event elf:FAM_RECORD elf:INDIVIDUAL_RECORD elf:SOURCE_CITATION elf:SOURCE_RECORD elf:SUBMITTER_RECORD
2 IRI elf:MULTIMEDIA_RECORD
3 ISA elf:Record
3 TAG OBJE elf:Document
2 IRI elf:Metadata
2 IRI elf:NAME_OF_BUSINESS
3 ISA elf:Agent
3 TAG CORP elf:DOCUMENT_SOURCE
2 IRI elf:NAME_OF_PRODUCT
3 TAG NAME elf:DOCUMENT_SOURCE
2 IRI elf:NAME_OF_REPOSITORY
3 TAG NAME elf:REPOSITORY_RECORD
2 IRI elf:NAME_OF_SOURCE_DATA
3 TAG DATA elf:DOCUMENT_SOURCE
2 IRI elf:NAME_PHONETIC_VARIATION
3 ISA elf:PersonalName
3 TAG FONE elf:PERSONAL_NAME_STRUCTURE
2 IRI elf:NAME_PIECE_GIVEN
3 TAG GIVN elf:PersonalName
2 IRI elf:NAME_PIECE_NICKNAME
3 TAG NICK elf:PersonalName
2 IRI elf:NAME_PIECE_PREFIX
3 TAG NPFX elf:PersonalName
2 IRI elf:NAME_PIECE_SUFFIX
3 TAG NSFX elf:PersonalName
2 IRI elf:NAME_PIECE_SURNAME
3 TAG SURN elf:PersonalName
2 IRI elf:NAME_PIECE_SURNAME_PREFIX
3 TAG SPFX elf:PersonalName
2 IRI elf:NAME_ROMANIZED_VARIATION
3 ISA elf:PersonalName
3 TAG ROMN elf:PERSONAL_NAME_STRUCTURE
2 IRI elf:NAME_TYPE
3 TAG TYPE elf:PERSONAL_NAME_STRUCTURE
2 IRI elf:NATIONAL_ID_NUMBER
3 ISA elf:IndividualAttribute
3 TAG IDNO elf:INDIVIDUAL_RECORD
2 IRI elf:NATIONAL_OR_TRIBAL_ORIGIN
3 ISA elf:IndividualAttribute
3 TAG NATI elf:INDIVIDUAL_RECORD
2 IRI elf:NATURALIZATION
3 ISA elf:IndividualEvent
3 TAG NATU elf:INDIVIDUAL_RECORD
2 IRI elf:NOBILITY_TYPE_TITLE
3 ISA elf:IndividualAttribute
3 TAG TITL elf:INDIVIDUAL_RECORD
2 IRI elf:NOTE_RECORD
3 ISA elf:Record
3 TAG NOTE elf:Document
2 IRI elf:NOTE_STRUCTURE
3 TAG NOTE elf:ASSOCIATION_STRUCTURE elf:CHANGE_DATE elf:CHILD_TO_FAMILY_LINK elf:Event elf:PLACE_STRUCTURE elf:PersonalName elf:Record elf:SOURCE_CITATION elf:SOURCE_RECORD_DATA elf:SOURCE_REPOSITORY_CITATION elf:SPOUSE_TO_FAMILY_LINK
2 IRI elf:OCCUPATION
3 ISA elf:IndividualAttribute
3 TAG OCCU elf:INDIVIDUAL_RECORD
2 IRI elf:ORDINATION
3 ISA elf:IndividualEvent
3 TAG ORDN elf:INDIVIDUAL_RECORD
2 IRI elf:PARENT1_POINTER
3 ISA elf:ParentPointer
3 TAG HUSB elf:FAM_RECORD
2 IRI elf:PARENT2_POINTER
3 ISA elf:ParentPointer
3 TAG WIFE elf:FAM_RECORD
2 IRI elf:PEDIGREE_LINKAGE_TYPE
3 TAG PEDI elf:CHILD_TO_FAMILY_LINK
2 IRI elf:PERSONAL_NAME_STRUCTURE
3 ISA elf:PersonalName
3 TAG NAME elf:INDIVIDUAL_RECORD
2 IRI elf:PHONETIC_TYPE
3 TAG TYPE elf:NAME_PHONETIC_VARIATION elf:PLACE_PHONETIC_VARIATION
2 IRI elf:PHONE_NUMBER
3 TAG PHON elf:Agent
2 IRI elf:PHYSICAL_DESCRIPTION
3 ISA elf:IndividualAttribute
3 TAG DSCR elf:INDIVIDUAL_RECORD
2 IRI elf:PLACE_HIERARCHY
3 TAG FORM elf:DEFAULT_PLACE_FORMAT elf:PLACE_STRUCTURE
2 IRI elf:PLACE_LATITUDE
3 TAG LATI elf:MAP_COORDINATES
2 IRI elf:PLACE_LONGITUDE
3 TAG LONG elf:MAP_COORDINATES
2 IRI elf:PLACE_PHONETIC_VARIATION
3 TAG FONE elf:PLACE_STRUCTURE
2 IRI elf:PLACE_ROMANIZED_VARIATION
3 TAG ROMN elf:PLACE_STRUCTURE
2 IRI elf:PLACE_STRUCTURE
3 TAG PLAC elf:Event
2 IRI elf:POSSESSIONS
3 ISA elf:IndividualAttribute
3 TAG PROP elf:INDIVIDUAL_RECORD
2 IRI elf:PROBATE
3 ISA elf:IndividualEvent
3 TAG PROB elf:INDIVIDUAL_RECORD
2 IRI elf:PUBLICATION_DATE
3 TAG DATE elf:NAME_OF_SOURCE_DATA
2 IRI elf:Parent1Age
3 TAG HUSB elf:FamilyEvent
2 IRI elf:Parent2Age
3 TAG WIFE elf:FamilyEvent
2 IRI elf:ParentPointer
2 IRI elf:PersonalName
2 IRI elf:RECEIVING_SYSTEM_NAME
3 TAG DEST elf:Metadata
2 IRI elf:RELATION_IS_DESCRIPTOR
3 TAG RELA elf:ASSOCIATION_STRUCTURE
2 IRI elf:RELIGIOUS_AFFILIATION
3 TAG RELI elf:Event
2 IRI elf:RELIGIOUS_AFFILIATION#Individual
3 ISA elf:IndividualAttribute
3 TAG RELI elf:INDIVIDUAL_RECORD
2 IRI elf:REPOSITORY_RECORD
3 ISA elf:Agent
3 ISA elf:Record
3 TAG REPO elf:Document
2 IRI elf:RESIDENCE
3 ISA elf:FamilyEvent
3 TAG RESI elf:FAM_RECORD
2 IRI elf:RESIDES_AT
3 ISA elf:IndividualAttribute
3 TAG RESI elf:INDIVIDUAL_RECORD
2 IRI elf:RESPONSIBLE_AGENCY
3 TAG AGNC elf:Event elf:SOURCE_RECORD_DATA
2 IRI elf:RESTRICTION_NOTICE
3 TAG RESN elf:Event elf:FAM_RECORD elf:INDIVIDUAL_RECORD
2 IRI elf:RETIREMENT
3 ISA elf:IndividualEvent
3 TAG RETI elf:INDIVIDUAL_RECORD
2 IRI elf:ROLE_IN_EVENT
3 TAG ROLE elf:EVENT_TYPE_CITED_FROM
2 IRI elf:ROMANIZED_TYPE
3 TAG TYPE elf:NAME_ROMANIZED_VARIATION elf:PLACE_ROMANIZED_VARIATION
2 IRI elf:Record
2 IRI elf:SCHOLASTIC_ACHIEVEMENT
3 ISA elf:IndividualAttribute
3 TAG EDUC elf:INDIVIDUAL_RECORD
2 IRI elf:SEX_VALUE
3 TAG SEX elf:INDIVIDUAL_RECORD
2 IRI elf:SOCIAL_SECURITY_NUMBER
3 ISA elf:IndividualAttribute
3 TAG SSN elf:INDIVIDUAL_RECORD
2 IRI elf:SOURCE_CALL_NUMBER
3 TAG CALN elf:SOURCE_REPOSITORY_CITATION
2 IRI elf:SOURCE_CITATION
3 TAG SOUR elf:ASSOCIATION_STRUCTURE elf:Event elf:FAM_RECORD elf:INDIVIDUAL_RECORD elf:PersonalName
2 IRI elf:SOURCE_CITATION_DATA
3 TAG DATA elf:SOURCE_CITATION
2 IRI elf:SOURCE_DESCRIPTIVE_TITLE
3 TAG TITL elf:SOURCE_RECORD
2 IRI elf:SOURCE_FILED_BY_ENTRY
3 TAG ABBR elf:SOURCE_RECORD
2 IRI elf:SOURCE_JURISDICTION_PLACE
3 TAG PLAC elf:EVENTS_RECORDED
2 IRI elf:SOURCE_MEDIA_TYPE
3 TAG MEDI elf:MULTIMEDIA_FORMAT elf:SOURCE_CALL_NUMBER
2 IRI elf:SOURCE_ORIGINATOR
3 TAG AUTH elf:SOURCE_RECORD
2 IRI elf:SOURCE_PUBLICATION_FACTS
3 TAG PUBL elf:SOURCE_RECORD
2 IRI elf:SOURCE_RECORD
3 ISA elf:Record
3 TAG SOUR elf:Document
2 IRI elf:SOURCE_RECORD_DATA
3 TAG DATA elf:SOURCE_RECORD
2 IRI elf:SOURCE_REPOSITORY_CITATION
3 TAG REPO elf:SOURCE_RECORD
2 IRI elf:SPOUSE_TO_FAMILY_LINK
3 TAG FAMS elf:INDIVIDUAL_RECORD
2 IRI elf:SUBMITTER_NAME
3 TAG NAME elf:SUBMITTER_RECORD
2 IRI elf:SUBMITTER_POINTER
3 TAG SUBM elf:FAM_RECORD elf:INDIVIDUAL_RECORD elf:Metadata
2 IRI elf:SUBMITTER_RECORD
3 ISA elf:Agent
3 ISA elf:Record
3 TAG SUBM elf:Document
2 IRI elf:Structure
2 IRI elf:TEXT_FROM_SOURCE
3 TAG TEXT elf:SOURCE_CITATION elf:SOURCE_CITATION_DATA elf:SOURCE_RECORD
2 IRI elf:TIME_VALUE
3 TAG TIME elf:CHANGE_DATE_DATE elf:TRANSMISSION_DATE
2 IRI elf:TRANSMISSION_DATE
3 TAG DATE elf:Metadata
2 IRI elf:USER_REFERENCE_NUMBER
3 TAG REFN elf:Record
2 IRI elf:USER_REFERENCE_TYPE
3 TAG TYPE elf:USER_REFERENCE_NUMBER
2 IRI elf:VERSION_NUMBER
3 TAG VERS elf:DOCUMENT_SOURCE elf:GEDCOM_FORMAT
2 IRI elf:WHERE_WITHIN_SOURCE
3 TAG PAGE elf:SOURCE_CITATION
2 IRI elf:WILL
3 ISA elf:IndividualEvent
3 TAG WILL elf:INDIVIDUAL_RECORD
2 IRI elf:WITHIN_FAMILY
3 TAG FAMC elf:BIRTH elf:CHRISTENING
1 SOUR https://fhiso.org/elf/
1 NOTE This file was automatically generated from data-model.md
2 CONT by schema-maker.py at 2019-01-04T17:32:07.909701
1 SUBM @fhiso_elf1@
0 @fhiso_elf1@ SUBM
1 NAME FHISO Extended Legacy Format, version 1
0 TRLR


Copyright © 2017–19, Family History Information Standards Organisation, Inc. The text of this standard is available under the Creative Commons Attribution 4.0 International License.