Basic Concepts for Genealogical Standards

This is an exploratory draft of a standard covering basic concepts that are expected to be used in multiple FHISO standards. This document is not endorsed by the FHISO membership, and may be updated, replaced or obsoleted by other documents at any time.

FHISO's Basic Concepts for Genealogical Standards standard defines various low-level concepts that will be used in many FHISO standards, and whose definitions do not logically belong in any one particular higher-level standard.

The definition of a string which is used in multiple FHISO standards is given in §2 of this standard, together with various related concepts such as characters and whitespace, and §3 defines briefly how FHISO standards use language tags. Terms are defined in §4 as a form of extensible identifier using IRIs; a shorthand notation for terms known as prefix notation is described in §4.1, and §4.2 discusses information that may be retrieved from these IRIs. The notion of a datatype is defined in §5, which also includes details on how to specify a new datatype.

The concepts of a classes, properties, the range of a property, and the type of a term are defined in §4.3, §4.4, §4.4.1 and §4.4.2, respectively. They provide an infrastructure for defining extensions to FHISO standards and new, compatible standards in such a way that applications can use a discovery mechanism to find out about unknown components, allowing them to be processed. The facilities in these sections will primarily be of use to parties defining extensions or implementing discovery.

It is anticipated that a future draft will include some standard, low-level datatypes for strings, integers and booleans. A pattern datatype will also be needed for the purpose of §5.1. The notion of cardinality may also be moved here from [CEV Concepts].

Conventions used

Where this standard gives a specific technical meaning to a word or phrase, that word or phrase is formatted in bold text in its initial definition, and in italics when used elsewhere. The key words must, must not, required, shall, shall not, should, should not, recommended, not recommended, may and optional in this standard are to be interpreted as described in [RFC 2119].

An application is conformant with this standard if and only if it obeys all the requirements and prohibitions contained in this document, as indicated by use of the words must, must not, required, shall and shall not, and the relevant parts of its normative references. Standards referencing this standard must not loosen any of the requirements and prohibitions made by this standard, nor place additional requirements or prohibitions on the constructs defined herein.

Derived standards are not allowed to add or remove requirements or prohibitions on the facilities defined herein so as to preserve interoperability between applications. Data generated by one conformant application must always be acceptable to another conformant application, regardless of what additional standards each may conform to.

If a conformant application encounters data that does not conform to this standard, it may issue a warning or error message, and may terminate processing of the document or data fragment.

Indented text in grey or coloured boxes does not form a normative part of this standard, and is labelled as either an example or a note.

Editorial notes, such as this, are used to record outstanding issues, or points where there is not yet consensus; they will be resolved and removed for the final standard. Examples and notes will be retained in the standard.

The grammar given here uses the form of EBNF notation defined in §6 of [XML], except that no significance is attached to the capitalisation of grammar symbols. Conforming applications must not generate data not conforming to the syntax given here, but non-conforming syntax may be accepted and processed by a conforming application in an implementation-defined manner.

This standard uses prefix notation, as defined in §4.1 of this standard, when discussing specific terms. The following prefix bindings are assumed in this standard:

The particular prefix assigned above have no relevance outside this standard document as prefix notation is not used in the formal data model defined by this standard. This notation is simply a notational convenience to make the standard easier to read.

Characters and strings

The concepts related to strings were originally defined in the CEV Concepts draft. This section has been moved here to be more generally usable.

Characters are specified by reference to their code point number in [ISO 10646], without regard to any particular character encoding. In this standard, characters may be identified in this standard by their hexadecimal code point prefixed with "U+".

The character encoding is a property of the serialisation, and not defined in this standard. Non-Unicode encodings are not precluded, so long as it is defined how characters in that encoding corresponds to Unicode characters.

Characters must match the Char production from [XML].

Char  ::=  [#1-#xD7FF] | [#xE000-#xFFFD] | [#x10000-#x10FFFF]
This includes all code points except the null character, surrogates (which are reserved for encodings such as UTF-16 and not characters in their own right), and the invalid characters U+FFFE and U+FFFF.

A string is a sequence of zero or more characters, and should only be used to encode textual data.

This definition of a string is identical to the definition of the string datatype defined in [XSD Pt2], used in many XML and Semantic Web technologies.

This definition of a string differs very slightly from JSON's definition of a string, as defined in [RFC 7159], as a JSON string may include the null character (U+0000). This is the only difference between a JSON string and FHISO's definition of a string. As a string should not be used to contain raw binary data, this difference is not anticipated to cause a problem. If an application needs to store binary data in string, it should encode it in a textual form, for example with the Base64 data encoding scheme defined in [RFC 4648].

Applications may convert any string into Unicode Normalization Form C, as defined in any version of Unicode Standard Annex #15 [UAX 15].

Normalization Form C and Normalization Form D allow easier searching, sorting and comparison of strings by picking a canonical representation of accented characters. The conversion between Normalization Forms C and D is lossless and therefore reversible, but the initial conversion to either form is not reversible. This allows a conformant application to normalise strings internally and not retain the unnormalised form; however, an application doing so must ensure the string is in Normalization Form C upon export, this being the more usual form for use in documents.

Characters matching the RestrictedChar production from [XML] should not appear in strings, and applications may process such characters in an implementation-defined manner or reject strings containing them.

RestrictedChar  ::=  [#x1-#x8] | [#xB-#xC] | [#xE-#x1F]
                       | [#x7F-#x84] | [#x86-#x9F]
This includes all C0 and C1 control characters except tab (U+0009), line feed (U+000A), carriage return (U+000D) and next line (U+0085).
As conformant applications can process C1 control characters in an implementation-defined manner, they can opt to handle Windows-1252 quotation marks in data masquerading as Unicode. Applications must not treat non-ASCII characters (other than C1 control characters) as ANSEL, the character set properly used in [GEDCOM], as [ANSEL]'s non-ASCII characters do not correspond to RestrictedChars.

Conformant applications must be able to store and process strings containing arbitrary characters other than those matching the RestrictedChar. In particular, applications must be able to handle characters which correspond to unassigned Unicode code points as they may be assigned in future versions of [ISO 10646]. Applications must also be able to handle characters outside Unicode's Basic Multilingual Plane — that is, characters with a code point of U+10000 or higher.

This means applications must not represent strings internally in the UCS-2 encoding which does not accommodate characters outside the Basic Multilingual Plane. The UTF-16 encoding defined in §2.6 of [ISO 10646] provides a 16-bit encoding that is backwards compatible with UCS-2 but allows arbitrary characters to be represented through the use of Unicode surrogate pairs.

Whitespace is defined as a sequence of one or more space characters, carriage returns, line feeds, or tabs. It matches the production S from [XML].

S  ::=  (#x20 | #x9 | #xD | #xA)+
This definition only includes common ASCII whitespace characters and does not include every character in [ISO 10646] that could be considered to be a whitespace. For example, the vertical tab (U+000B), no-break space (U+00A0) and em space (U+2003) are all excluded.

Whitespace normalisation is the process of discarding any leading or trailing whitespace, and replacing other whitespace with a single space (U+0020) character.

The definition of whitespace normalisation is identical to that in [XML].

In the event of a difference between the definitions of the Char, RestrictedChar and S productions given here and those in [XML], the definitions in the latest edition of XML 1.1 specification are definitive.

Language tags

The material in this section is new in this draft.

A language tag is a string that is used to represent a human language, and where appropriate the script and regional variant or dialect used. They are commonly used to tag other strings to identify their language in a machine-readable manner.

The language tag shall match the Language-Tag production from [RFC 5646], or from any future RFC published by the IEFT that obsoletes [RFC 5646] (hereinafter referred to as RFC 5646's successor RFC), and should be valid, as defined in §2.2.9 of [RFC 5646].

Valid language tags have the meaning that is assigned to them by [RFC 5646] and any successor RFC. Applications may discard any language tag that is not well-formed and replace it with und, meaning a undetermined language, but must not discard any language tag that is well-formed even if it is not valid.

[RFC 5646] says that to be valid, a language tag must consist of tags that have been registered in the [IANA Lang Subtags] registry. This is freely available online in a machine-readable form defined in §3.1.1 of [RFC 5646], and gives the meaning of every tag. Currently it includes:

The meanings of codes in the source ISO standards may change over time, but the procedure set out in §3.4 of [RFC 5646] governing the addition of tags to [IANA Lang Subtags] ensures the meanings there stable. This particularly affects [ISO 3166-1] country codes which historically have been reused, and may result in a gradual divergence between and [IANA Lang Subtags]. Applications should therefore avoid using [ISO 3166-1] codes that have not been registered in [IANA Lang Subtags].

A string tagged with the language tag hu-CS must be interpreted by a conformant application as being in the Hungarian language localised for use in the former state of Serbia and Montenegro, because this is how hu and CS are listed in [IANA Lang Subtags]. The code CS is perhaps better known as representing the former state of Czechoslovakia and appears in older lists of [ISO 3166-1] country codes as such, but neither IANA nor FHISO recognise this former meaning.

This is one of five country codes whose meaning has materially changed in [ISO 3166-1], the other four being AI, BQ, GE and SK. In each case, because the reuse occurred before the creation of [IANA Lang Subtags], it is the current meaning that is listed in [IANA Lang Subtags]. If there is further reuse of country codes in the future, [RFC 5646] requires that the current meaning of the tag be retained and a numeric code be given to the new country in [IANA Lang Subtags].

A conformant application may convert any language tag into its canonical form, as defined by §4.5 of [RFC 5646] or an equivalent section of a successor RFC.

The chief purpose of canonical form is to replace deprecated language codes and other subtags with the value found in the Preferred-Value field in [IANA Lang Subtags]. It never result in the removal of script subtag, even when they are the default script for the language as defined by a Suppress-Script field.
The language tag iw is listed in [IANA Lang Subtags] as a deprecated language code for Hebrew which has now been removed from [ISO 639-1]. Its Preferred-Value field is he, so an application may replace iw with he.

A conformant application may alter a language tag in any other way that leaves its canonical form unchanged when compared in a case-insensitive manner.

Such changes are permitted for three reasons. First, it allows applications to revert new tags to older deprecated forms when exporting data to an older application. Secondly, it allows applications to remain conformant even if they are basing conversions on an outdated copy of the [IANA Lang Subtags] registry. This is because §3.4 of [RFC 5646] only allows certain compatible changes to the registry. Thirdly, it allows applications to apply the conventional capitalisation of language tags defined in §2.1.1 of [RFC 5646].

A string which is accompanied by a language tag which identifies the language in which the string is written is called a language-tagged string.

The language tag is not itself part of string, but is stored alongside it.


The concept of a term was originally defined in the CEV Concepts draft. It has been moved here to be more generally usable. The material in §4.3 and §4.4 is new in this draft, but draws heavily on FHISO's Vocabularies policy.

A term is a form of identifier used in FHISO standards to represent a concepts which it is useful to be able to reference. A term consists of a unique, machine-readable identifier, known as the term name, paired with a clearly-defined meaning for the concept or idea that it represents. Term names shall take the form of an IRI matching the IRI production in §2.2 of [RFC 3987].

This standard uses terms to name datatypes, as defined in §5 of this standard, and also for classes and properties, defined in §4.3 and §4.4.
Give an actual example, e.g. xsd:integer.
IRIs have been chosen in preference to URIs because it is recognised that certain culture-specific genealogical concepts may not have English names, and in such cases the human-legibility of IRIs is advantageous. URIs are a subset of IRIs, and all the terms defined in this suite of standard are also URIs.

Term names are compared using the "simple string comparison" algorithm given in §5.3.1 of [RFC 3987]. If a term name does not compare equal to an IRI known to the application, the application must not make any assumptions about the term, its meaning or intended use, based on the form of the IRI or any similarity to other IRIs.

This comparison is a simple character-by-character comparison, with no normalisation carried out on the IRIs prior to comparison. It is also how XML namespace names are compared in [XML Names].

The following IRIs are all distinct for the purpose of the "simple string comparison" algorithm given in §5.3.1 of [RFC 3987], , even though an HTTP request to them would fetch the same resource.


An IRI must not be used as a term name unless it can be converted to a URI using the algorithm specified in §3.1 of [RFC 3987], and back to a IRI again using the algorithm specified in §3.2 of [RFC 3987], to yield the original IRI.

This requirement ensures that term names can be used in a context where a URI is required, and that the original IRI can be regenerated, for example for comparison with a list of known IRIs. The vast majority of IRIs, including those in non-Latin scripts, have this property. The effect of this requirement is to prohibit the use of IRIs that are already partly converted to a URI, for example through the use of unnecessary percent or punycode encoding.
Of the three IRIs given in the previous example on how to compare IRIs, only the first may be used as a term name. The second and third are prohibited as a result of the unnecessary percent-encoding, and the third is additionally prohibited as a result of unnecessary punycode-encoding.

The terms defined in FHISO standards all have term names that begin Subject to the requirements in the applicable standards, third parties may also define additional terms. It is recommended that any such terms use either the http or preferably the https IRI scheme defined in §2.7.1 and §2.7.2 of [RFC 7230] respectively, and an authority component consisting of just a domain name or subdomain under the control of the party defining the term.

An http or https IRI scheme is recommended because the IRI is used to fetch a resource during discovery, and it is desirable that applications implementing discovery should only need to support a minimal number of transport protocols. URN schemes like the uuid scheme of [RFC 4122] are not recommended as they do not have transport protocols that can be used during discovery.

The preference for a https IRI is because of security considerations during discovery. A man-in-the-middle attack during discovery could insert malicious content into the response, which, if undetected, could cause an application to process user data incorrectly, potentially discarding parts of it or otherwise compromising its integrity. It is harder to stage a man-in-the-middle attack over TLS, especially if public key pinning is used per [RFC 7469].

Prefix notation

Term names are sometimes referred using prefix notation. This is a system whereby prefixes are assigned to IRIs that occur frequently as the leading portion of a term name. Then, instead of writing the term name in full, the leading portion of the term name is replaced by its prefix followed by a colon (U+003A) separator.

The term name is used in several places in this standard. Instead of writing this in full, if the rdfs prefix is bound to the IRI, then this IRI can be written in prefix form as rdfs:Class.

IRI resolution

It is recommended that an HTTP GET request to a term name IRI with an http or https scheme (once converted to a URI per §4.1 of [RFC 3987]), should result in a 303 "See Other" redirect to a document containing a human-readable definition of the term if the request was made without an Accept header or with an Accept header matching the format of the human-readable definition. It is further recommended that this format should be HTML, and that documentation in alternative formats may be made available via HTTP content negotiation when the request includes a suitable Accept header, per §5.3.2 of [RFC 7231].

A 303 redirect is considered best practice for [Linked Data], so as to avoid confusing the term name IRI with the document containing its definition, which is found at the post-redirect URL. The terms defined in this suite of standards are not specifically designed for use in Linked Data, but the same considerations apply.

Parties defining terms should arrange for their term name to support discovery. This when an HTTP GET request to a term name IRI with an http or https scheme, made with an appropriate Accept header, yields 303 redirect to a machine-readable definition of the term.

This standard does not specify a specific version of HTTP, but at the current time, even though HTTP/2 is becoming more popular, HTTP 1.1 is the most widely implemented version of HTTP. While this remains true, applications and discovery servers are encouraged to support HTTP 1.1.

This standard does not define a discovery mechanism, but it is recommended that parties defining terms support FHISO's [Triples Discovery] mechanism, and may additionally support other mechanisms. Support for discovery by applications is optional.

Suppose an application wants to perform discovery on the hypothetical term used in several later examples in this standard. If the application supports FHISO's [Triples Discovery] mechanism, which uses [N-Triples] as its serialisation format, together with some other hypothetical discovery mechanism using the application/x-discovery MIME type, but prefers to use [Triples Discovery], it might make the following HTTP request:

GET /events/Baptism HTTP/1.1
Accept: application/n-triples, application/x-discovery; q=0.9

In this example, the q=0.9 in the Accept header is a quality value which, per §5.3 of [RFC 7231], indicates that the x-discovery format is less preferred than n-triples which by default has a quality value of 1.0.

If the server supports n-triples, it must respond with a 303 redirect:

HTTP/1.1 303 See Other
Vary: Accept

In this case the redirect is to the original IRI but with .n3 appended, however the actual choice of IRI is up to the party defining the term and running the web server. When a server's response is dependent on the contents of an Accept header, §7.1.4 of [RFC 7231] says that this should be recorded in a Vary header, as it is in this example.

The application would normally then make a second HTTP request to follow the redirect:

GET /events/Baptism.n3 HTTP/1.1
Accept: application/n-triples, application/x-discovery; q=0.9

This request uses the same Accept header as the first, as HTTP redirects contain no information about the MIME type of the destination resource, so at this point the application does not know which discovery mechanism the server is using, or whether the server does not support discovery or HTTP content negotiation and is serving a human-readable definition.

The server's response to this request should be an N-Triples file containing information about the Baptism term.

A party defining a term may support discovery without using HTTP content negotiation on their web server by serving a 303 redirect to a machine-readable definition of the term unconditionally, however it is recommended that such servers implement HTTP content negotiation respecting the Accept header.


This section defines a basic type system for terms. It is part of the infrastructure for defining extensions to FHISO standards or new, compatible standards, and is used by applications during discovery, support for which is optional.

Terms are used in many contexts in FHISO standards and it can be useful to have a concise, machine-readable way of stating the use for which it was defined.

A class is a term used to denote a particular context or use for which other terms may be defined. Standards defining such contexts should define a class to represent that context, and must do so if the third parties are permitted to define their own terms for use in that context.

A hypothetical standard might defined various terms representing events of genealogical interest that might occur during a person's lifetime. Examples might include:

The standard should provide a class to represent the abstract concept of an event type, and as the class is itself a term, it must have an IRI as its term name. Perhaps it might be:

This class might be referred to as the class of event types.

The words "class" and "type" are used in many contexts in computing. As used here, a class is similar to a datatype of which terms are values, or a class of which terms are instances, or a named enumeration type of which terms are values. FHISO's use of this word does not mean that the other notions associated with the word "class" in object-oriented programming apply here.

The term name of a class is also referred to as its class name.

When a term has been defined for use in the context denoted by some class, that class is referred to as the type of the term.

In prefix notation, with the prefix ex bound to, the type of ex:Baptism from the previous example is ex:EventType.

As a class is a term, defining a class is itself a context in which terms are defined, including by third parties. This means the general concept of a class needs a term defining to represent it. This standard uses the rdfs:Class term for this purpose:
This can be thought of as a class of classes. It is not merely an arcane abstraction: it serves a useful role in discovery. If discovery is carried out on the term name of a class, it is useful to be able to indicate that the term is a class. This can be done by saying the type of the term is rdfs:Class.

Although the rdfs:Class class is defined in §2.2 of [RDF Schema], this standard does not require support for any of the facilities in [RDF Schema], nor are parties defining classes or terms required to do so in a manner compatible with RDF. An implementer may safely use the rdfs:Class class for the purposes of this standard using just the information given in this section without reading [RDF Schema] or otherwise being familiar with RDF.

The decision to use rdfs:Class and other terms from [RDF Schema] is due to FHISO's practice of reusing facilities from existing standards when they are a good match for our requirements, rather than inventing our own versions with similar functionality. It also allows future standards and vendor extensions the option of reusing existing third-party vocabularies where appropriate, as most such vocabularies are also aligned with RDF.

The type of any class is therefore rdfs:Class.

There is no need for a further level of abstraction to represent the type of rdfs:Class. As rdfs:Class is just another class, albeit a fairly special one, the type of rdfs:Class is rdfs:Class.
Class definition


This section defines a simple vocabulary for describing terms. It is part of the infrastructure for defining extensions to FHISO standards or new, compatible standards, and is used by applications during discovery, support for which is optional.

During discovery, and in other situations when a formal definition of a particular term is needed, it is useful to have a formalism for providing information about that term.

A property is a particular piece of information that might be provided when defining another entity. The thing being defined is typically a term, and is called the subject of the property.

The subject of the property is only said to be typically a term so that citation elements terms (in CEV Concepts) can be made a subclass of property terms. The subject of a citation element is a source which is not a term as we don't require them to be identified by an IRI. It is likely that other genealogical concepts, possibly including individual attributes in ELF, may also be treated as properties whose subjects are not terms. In the case of individual attributes, the subject is an individual which is likely not identified by an IRI.

The property consists of two parts, both of which are required to be present:

The property name shall be a term that has been defined to be used as a property name in the manner required by this standard; a term defined for this purpose is called a property term.

This nomenclature draws a distinction between a property name and a property term. The former is part of a property, and is therefore part of the description of the subject of the property, while the latter is an item of vocabulary reference by that description. The property name is a property term.

The property value shall be a term, a string, or a language-tagged string.

Properties shall not have default property values that applies when the property is absent, however standards may define how an conformant application handles the absence of a property.

Standards which introduce such pieces of information should define a property terms to represent them, and must do so if third parties are permitted to define their own terms and if it is recommended or required that these third parties document or otherwise make available the information represented by the property.

An earlier example introduced several hypothetical terms for events of genealogical interest, such as birth, baptism, ordination, emigration and death. Many events can occur multiple times during a person's life: for example, a person might emigrate more than once. But other events cannot by definition occur more than once: birth and death are obvious examples. The number of times something is permitted to occur is sometimes called its cardinality, and if the authors of this hypothetical standard considered it a relevant concept, they should define a property term to represent the concept of cardinality:

If the hypothetical standard allows third parties to define additional types of event, and either recommends or requires that they state the cardinality of the new events, then the standard must define a property term representing cardinality.

The term name of a property term is also referred to as its property term name.

The class of property terms has the following class name:

Class definition
The rdf:Property class is defined §2.8 of [RDF Schema]. The fact that its term name IRI begin differently to the rdfs:Class term name used in §4.3 of this standard is due to historical reasons. As with the rdfs:Class term, an implementer may safely use the rdf:Property terms for the purposes of this standard without reading [RDF Schema].


This section defines a property term to describe way other properties are to be used. It is part of the infrastructure for defining extensions to FHISO standards or new, compatible standards, and is used by applications during discovery, support for which is optional.

The range of a property term is a formal specification of allowable property values for a property whose property name is that property term. The range shall be a class name or a datatype name.

Datatypes provide a formal description of the values allowed in a particular context. They are defined in §5 of this standard.

When the range is a class, the property value shall be a term whose type is that class; when the range is a datatype, the value associated with the property shall be a string in the lexical space of that datatype.

An earlier example gave a hypothetical cardinality property term that might be used when defining genealogical events. Most likely, the property value of this property would be a representation of "one" or "unbounded", depending on whether the event is one that can occur just once, or whether it can occur multiple times. The party defining this property would need to consider how best to represent these two values.

One option is to define two terms to represent these options, say:

The context in which these two terms can be used is when specifying a cardinality, so a Cardinality class would be defined:

The type of SinglyOccuring and MultiplyOccuring would be Cardinality, and the range of the cardinality property would be the Cardinality class. Having a property and the class that serves as its range only differing in capitalisation is a common idiom.

A second option is to use two strings to represent the possible cardinalities, perhaps "1" and "unbounded". A datatype would then be defined whose lexical space consisted of just these two strings, and the datatype given a name like:

As in the first option, the range of the cardinality property would be the Cardinality class.

A third and likely preferable option would be to name the cardinality property differently, say canOccurMultiply, so that its range could be a standard boolean datatype like xsd:boolean.

This standard has already defined one property term, namely the rdf:type property term in §4.4.2. The type of a term is the class which denotes the context in which it can be used. Therefore the range of rdf:type is rdfs:Class, as shown in the property definition table in §4.4.2.

Standards which define property terms should specify their range, and must do so if third parties are permitted to define their own terms and if it is recommended or required that these third parties document or otherwise make available the information represented by the property term.

This is the same wording that is used in §4.4 to specify when a property term must be defined. In circumstances where a property term must be defined, its range must also be defined.

The range of a property term is itself a property which is defined as follows:

Property definition

The range of the rdfs:range property is defined as rdfs:Class, despite the fact that the property value of a rdfs:range property can be either a class name or a datatype name. At an RDF level this is valid because

rdfs:Datatype rdfs:subClassOf rdfs:Class .

This standard does not currently have the notion of a subclass which makes this problematic.

We may need to introduce the concepts of the domain of a property term, currently in our Vocabularies policy. Careful consideration will be needed before the domain is introduced to ensure it does not cause forwards compatibility problems if new uses are found for the property.

The type property

This section defines a property term to denote type of subject. It is part of the infrastructure for defining extensions to FHISO standards or new, compatible standards, and is used by applications during discovery, support for which is optional.

The type of a term, as introduced in §4.3, is a piece of information which should normally be provided, albeit often implicitly, when defining a term. As such it needs a property term to represent it. This standard uses the rdf:type term for this purpose:

Property definition
The rdf:type property term is defined §3.3 of [RDF Schema], however implementers may safely use this property term for the purposes of this standard without reading [RDF Schema].


The concepts related to datatypes were originally defined in the CEV Concepts draft. This section and its subsections has been moved here to be more generally usable.

A datatype is a term which serves as a formal description of the values that are permissible in a particular context. Being a term, a datatype is identified by a term name which is an IRI. The term name of a datatype is also referred to as its datatype name.

A datatype has a lexical space which is the set of strings which are interpreted as valid values of the datatype. The definition of a datatype shall state how each string in its lexical space maps to a logical value, and state the semantics associated with of those values.

This definition of a datatype is sufficiently aligned with XML Schema's notion of a simple type, as defined in [XSD Pt2], that XML Schema's simple types can be used as datatypes in this standard. Best practice on how to get an IRI for use as the term name of XML Schema types can be found in [SWBP XSD DT]. Similarly, this standard's definition of a datatype is very similar to the definition of a datatype in [RDF Concepts], and RDF datatypes can be used as datatypes in this standard.

XML Schema defines an integer type in §3.4.13 of [XSD Pt2] which is well suited for use in this standard. XML Schema does not give its types IRIs, but it does give them ids, and following the best practice advice given in §2.3 of [SWBP XSD DT] gives it the following IRI:

This same type is also recommended for use in RDF by §5.1 of [RDF Concepts] which explicitly gives it the IRI above.

The lexical space of this datatype is the space of all strings consisting of a finite-length sequence of one or more decimal digits (U+0030 to U+0039, inclusive), optionally preceded by a + or - sign (U+002B or U+002D, respectively). Thus the string "137" is within the lexical space of this datatype, but "20.000" and "四十二" are not, despite being normal ways of representing integers in certain cultures.

The mapping from lexical representations to logical values need not be one-to-one. If a datatype has multiple lexical representations of the same logical value, a conformant application must treat these representations equivalently and may change a string of that datatype to be a different but equivalent lexical representation.

This allows applications to store such strings internally using as an entity (such as a database field or a variable) of some appropriate type without retaining the original lexical representation.
The XML Schema integer datatype used in the previous example is one where the mapping from lexical representation to value is many-to-one rather than one-to-one. This is due to lexical space including strings with a leading + sign as well as superfluous leading 0s, and means that "00137", "+137" and "137" all represent the same underlying value: the number one hundred and thirty-seven. Because conformant applications may convert strings between equivalent lexical representations, they may store them in a database in an integer field and regenerate strings in a canonical representation.

Strings outside the lexical space of a datatype must not be used where a string of that datatype is required. If an application encounters any such strings, it may remove them from the dataset or may convert them to a valid value in an implementation-defined manner. Any such conversion that is applied automatically by an application must either be locale-neutral or respect any locale given in the dataset.

XML Schema defines a date type in §3.3.9 of [XSD Pt2] which has a lexical space based on [ISO 8601] dates. If, in a dataset that is somehow identified as being written in German, an application encountering the string "8 Okt 2000" in a context where an XML Schema date is expected, it may convert this to "2000-10-08". However an application encountering the string "8/10/2000" must not conclude this represents 8 October or 10 August unless the document includes a locale that uniquely determines the date format. In this case, information that the document is in English is not sufficient as different English-speaking countries have different conventions for formatting dates.

This standard uses the rdfs:Datatype class as the class of datatypes, defined as follows:

Class definition


A party defining a datatype shall specify a pattern for that datatype. This is a regular expression which provides a constraint on the lexical space of the datatype. Matching the pattern might not be sufficient to validate a string as being in the lexical space of the datatype, but a string that fails to match the pattern is guaranteed not to be in the lexical space.

Patterns are included in this standard to provide a way for an application to find out about the lexical space of a unfamiliar datatype through discovery.

Our current intention is to define our own dialect of regular expression.

We also need to specify exactly what matching a pattern means. In particular we want the complete string to match the pattern, so that "Sept 2017" does not match the pattern [0-9]{4}, despite the lack of ^...$ around the pattern.

The XML Schema date type mentioned in a previous example has the following pattern (here split onto two lines for readability — the second line is an optional timezone which the XML Schema date type allows).


This pattern matches strings like "1999-02-31". Despite matching the pattern, this string is not part of the lexical space of this date type as 31 February is not a valid date.

The property term representing the pattern of a datatype is defined as follows:

Property definition
The range of types:pattern is types:Pattern, which will be the datatype for FHISO's regular expression dialect.
This standard does not use xsd:pattern as the property term, even though it is used as a predicate in OWL 2. Its use would pose a difficulty because none of the relevant W3C specifications indicate what the rdfs:domain of xsd:pattern is supposed to be. Possibly it is an owl:Restriction, which would be incompatible with this use. Using xsd:pattern would also require us to use precisely the form of regular expression defined in Appendix G of [XSD Pt2].

A datatype with a pattern other than .* is known as a structured datatype, while one with a pattern of .* is known as an unstructured datatype.


A datatype may be defined as a subtype of another datatype which is referred to as its supertype. This is used to provide a more specific version of a more general datatype. The lexical space of the subtype shall be a subset of the lexical space of the supertype, and if an application is unfamiliar with the subtype it may process it as if it were the supertype. The subtype must be defined in such a way that at most this results in some loss of meaning but does not introduce any false implications about the dataset.

This does not require a subtype to define a pattern if the supertype does. Because the lexical space of the subtype must be a subset of that of the supertype, the pattern of the supertype may be used if the subtype does not define one. This might be done if additional restrictions made on lexical space of the subtype cannot readily be expressed using a regular expression.
It is only the lexical space of the subtype that is required to be a subset of the lexical space of the supertype. The set of strings that match the pattern of the subtype might not necessarily be a subset of that of the supertype. This is because the pattern is permitted to match strings outside the lexical space, as in the example of the date "1999-02-31".
This section needs an example.

The property term representing the supertype of a datatype is defined as follows:

Property definition

An alternative option is to use the rdfs:subClassOf property term, however it is anticipated that it will be desirable to have a property term whose domain is exactly rdfs:Datatype. The domain of rdfs:subClassOf is rdfs:Class; nevertheless, it is possible to apply rdfs:subClassOf to datatypes because

rdfs:Datatype rdfs:subClassOf rdfs:Class .

In order to make our subTypeOf property term accessible to RDF reasoners, we should document that

</types/subTypeOf> rdfs:subPropertyOf rdfs:subClassOf .

We will need a way of explicitly saying that a datatype has no supertype. In RDF, all datatypes are subtypes of rdfs:Literal, so this datatype can be used as a special value to signify that.

The concept of a subtype in this standard corresponds to XML Schema's concept of derivation of a simple type by restriction per §3.16 of [XSD Pt1]. XML Schema does not have concept compatible with this standard's notion of an abstract datatype, as in XML Schema only complex types can be abstract. If it is desirable to describe a FHISO abstract datatype in XML Schema, it should be defined as a normal simple type, with the information that it is abstract conveyed by another means.

Abstract datatypes

A datatype may be defined to be a abstract datatype. An abstract datatype is one that must only be used as a supertype of other types. A string must not be declared to have a datatype which is an abstract datatype. Abstract datatypes may specify a pattern and shall have a lexical space.

The lexical space of an abstract datatype and any pattern defined on it serve to restrict the lexical space of all its subtypes. If no such restriction is desired, the lexical space may be defined as the space of all strings.

The property that represents whether or not a datatype is an abstract datatype has the following property name:
Property definition
Are abstract datatypes a necessary part of our data model at all? They were introduced to allow an AbstractDate datatype, but is it necessary for this datatype to be an abstract datatype?

Language-tagged datatypes

A language-tagged datatype is a datatype whose values are language-tagged strings consisting of both a string from the lexical space of the datatype and a language tag to identify the language in which that particular string is written.

Because the language tag is not part of the lexical space of the datatype, and is not embedded in the string, a pattern cannot be used to constrain the language tag.

Language-tagged datatypes should be used whenever a datatype is needed to represent textual data that is in a particular language or script and which cannot automatically be translated or transliterated as required, and should not be used otherwise.

In a context where a year Anno Domini is required, a language-tagged datatype should not be used, and the lexical space of the datatype should encompass strings like, say, "2015". Even though an application designed for Arabic researchers might need to render this year as "٢٠١٥" using Eastern Arabic numerals, this conversion can be done entirely in the application's user interface, so a language-tagged datatype is not required and should not be used.

The [CEV Vocabulary] defines a datatype for representing the names of authors and other people, which has the following term name:

A person's name is rarely translated in usual sense, but may be transliterated. For example, the name of Andalusian historian صاعد الأندلسي might be transliterated "Ṣā‘id al-Andalusī" in the Latin script. Because machine transliteration is far from perfect, a language-tagged datatype should be used to allow an application to store both names.

An author's names may also be respelled to conform to the spelling and grammar rules of the reader's language. An Englishman named Richard may be rendered "Rikardo" in Esperanto: the change of the "c" to a "k" being to conform to Esperanto orthography, while the final "o" marks it as a noun. The respelling would be tagged eo, the language code for Esperanto.

Patterns may be defined for language-tagged datatypes as for other datatypes. Because patterns only constrain the lexical space of the datatype, they cannot be used to constrain the language tag in the value of a language-tagged datatype.

A datatype that is not a language-tagged datatype is called a non-language-tagged datatype.

This means the classification of datatypes as language-tagged or non-language-tagged is orthogonal to their classification as structured or unstructured. It is anticipated that most non-language-tagged datatypes will be structured datatype.
The AgentName datatype from the previous example is a microformat which is constrained by a pattern meaning it is a structured datatype, but it is also a language-tagged datatype as names can be translated and transliterated.

Subtypes may be defined of language-tagged datatypes as well as of other datatypes. If the supertype is a language-tagged datatype then the subtype must also be; and if the supertype is not a language-tagged datatype then the subtype must not be.

Standard datatypes

This standard endorses the use of the following three datatypes defined in [XSD Pt2] to represent strings, booleans and integers.

Other datatypes from that standard may also be suitable for use with FHISO technologies. They are not included here as, when this standard was written, there was no particular demonstrated need for them.
FHISO will need one or more date datatype, but it is currently anticipated that the xsd:date type will not be sufficient for FHISO's purposes. This will be addressed, either in a future draft of this standard, or in a separate standard.

The xsd:string datatype

Some FHISO standard make limited use of the xsd:string datatype defined in §3.3.1 of [XSD Pt2]. This is an unstructured non-language-tagged datatype which has the following properties:

Datatype definition
Pattern .*
Supertype none
Abstract false

It is a general-purpose datatype whose lexical space is the space of all strings; however it is not a language-tagged datatype and therefore it should not be used to contain text in a human-readable natural language.

This type is not the ultimate supertype of all non-language-tagged datatypes. This is because many other XML Schema datatypes, including xsd:date and xsd:integer are not defined as subtypes of xsd:string in XML Schema.

Use of this datatype is generally not recommended: data that is in a human-readable form should use a language-tagged datatype, while data that is not human-readable should use a structured datatype.

If an application encounters a string with the xsd:string datatype in a context where a language-tagged string would be permitted, the application may change the datatype to rdf:langString and assign the string a language tag of und, meaning an undetermined language.

The xsd:string datatype is included in this standard in order to align this data model more closely with the RDF data model, and in particular the [CEV RFDa] bindings which use this datatype as the default when no language tag is present. The above rule allowing conversion to rdf:langString means that applications may ignore the xsd:string datatype.

The xsd:boolean datatype

A boolean is a datatype with precisely two logical values: true and false. FHISO standards represent booleans using the xsd:boolean datatype defined in §3.3.2 of [XSD Pt2]. This is a structured non-language-tagged datatype which has the following properties:

Datatype definition
Pattern true|false|1|0
Supertype none
Abstract false

The lexical space of this datatype includes four different strings so that the two logical values of the datatype each have two alternative lexical representations. The value true may be represented by either "true" or "1", while the value false may be represented by either "false" or "0". Conformant applications shall not attach any significance to which of the alternative lexical representations is used, and may replace any instance of "1" in a boolean string with "true", or "0" with "false", but not vice versa. Where possible, the numeric representations, "0" and "1", should not be used.

The numeric representations are allowed because xsd:boolean allows them, and alignment with the XML Schema datatype is desirable as it is widely used in third-party standards. Appendix E.4 of [XSD Pt2] defines the alphabetic representations, "true" and "false", to be the canonical forms of the datatype, and this standard does similarly.

The xsd:integer datatype

An integer datatype seems like something that FHISO will definitely need, in which case the standard xsd:integer datatype defined in §3.4.13 of [XSD Pt2] is the obvious choice. But at the moment there is no clear use case for an integer type.


Normative references

[ISO 10646]
ISO (International Organization for Standardization). ISO/IEC 10646:2014. Information technology — Universal Coded Character Set (UCS). 2014.
[RFC 2119]
IETF (Internet Engineering Task Force). RFC 2119: Key words for use in RFCs to Indicate Requirement Levels. Scott Bradner, eds., 1997. (See
[RFC 3987]
IETF (Internet Engineering Task Force). RFC 3987: Internationalized Resource Identifiers (IRIs). Martin Duerst and Michel Suignard, eds., 2005. (See
[RFC 5646]
IETF (Internet Engineering Task Force). RFC 5646: Tags for Identifying Languages. Addison Phillips and Mark Davis, eds., 2009. (See
[RFC 7230]
IETF (Internet Engineering Task Force). RFC 7230: Hypertext Transfer Protocol (HTTP/1.1): Message Syntax and Routing. Roy Fielding and Julian Reschke, eds., 2014. (See
[RFC 7231]
IETF (Internet Engineering Task Force). RFC 7231: Hypertext Transfer Protocol (HTTP/1.1): Semantics and Content. Roy Fielding and Julian Reschke, eds., 2014. (See
[Triples Discovery]
FHISO (Family History Information Standards Organisation). Simple Triples Discovery Mechanism. Exploratory draft.
[UAX 15]
The Unicode Consortium. "Unicode Standard Annex 15: Unicode Normalization Forms" in The Unicode Standard, Version 8.0.0. Mark Davis and Ken Whistler, eds., 2015. (See
W3C (World Wide Web Consortium). Extensible Markup Language (XML) 1.1, 2nd edition. Tim Bray, Jean Paoli, C. M. Sperberg-McQueen, Eve Maler, François Yergeau, and John Cowan eds., 2006. W3C Recommendation. (See

Other references

NISO (National Information Standards Organization). ANSI/NISO Z39.47-1993. Extended Latin Alphabet Coded Character Set for Bibliographic Use. 1993. (See Standard withdrawn, 2013.
[CEV Vocabulary]
FHISO (Family History Information Standards Organisation). Citation Elements: Vocabulary. Exploratory draft.
The Church of Jesus Christ of Latter-day Saints. The GEDCOM Standard, draft release 5.5.1. 2 Oct 1999.
[IANA Lang Subtags]
IANA (Internet Assigned Numbers Authority). Language Subtag Registry. Online data file. (See
[ISO 639-1]
ISO (International Organization for Standardization). ISO 639-1:2002. Codes for the representation of names of languages — Part 1: Alpha-2 code. 2002.
[ISO 639-2]
ISO (International Organization for Standardization). ISO 639-2:1998. Codes for the representation of names of languages — Part 2: Alpha-3 code. 1998. (See
[ISO 639-3]
ISO (International Organization for Standardization). ISO 639-3:2007. Codes for the representation of names of languages — Part 3: Alpha-3 code for comprehensive coverage of languages. 2007.
[ISO 639-5]
ISO (International Organization for Standardization). ISO 639-5:2007. Codes for the representation of names of languages — Part 5: Alpha-3 code for language families and groups. 2008.
[ISO 3166-1]
ISO (International Organization for Standardization). ISO 3166-1:2006. Codes for the representation of names of countries and their subdivisions -- Part 1: Country codes. 2006. (See
[ISO 15924]
ISO (International Organization for Standardization). ISO 15924:2004. Codes for the representation of names of scripts. 2004.
W3C (World Wide Web Consortium). RDF 1.1 N-Triples. David Becket, 2014. W3C Recommendation. (See
[RDF Concepts]
W3C (World Wide Web Consortium). RDF 1.1 Concepts and Abstract Syntax. Richard Cyganiak, David Wood and Markus Lanthaler, eds., 2014. W3C Recommendation. (See
[RDF Schema]
W3C (World Wide Web Consortium). RDF Schema 1.1. Dan Brickley and R. V. Guha, eds., 2014. W3C Recommendation. (See
[RFC 4648]
IETF (Internet Engineering Task Force). RFC 4648: The Base16, Base32, and Base64 Data Encodings. S. Josefsson, ed., 2006. (See
[RFC 7159]
IETF (Internet Engineering Task Force). RFC 7159: The JavaScript Object Notation (JSON) Data Interchange Format. T. Bray, ed., 2014. (See
[RFC 7469]
IETF (Internet Engineering Task Force). Public Key Pinning Extension for HTTP. C. Evans, C. Palmer and R. Sleevi, ed., 2015. (See
W3C (World Wide Web Consortium). XML Schema Datatypes in RDF and OWL. Jeremy J. Carroll and Jeff Z. Pan, eds., 2006. W3C Working Group Note. (See
[UN M.49]
United Nations, Statistics Division. Standard Country or Area Codes for Statistical Use, revision 4. United Nations publication, Sales No. 98.XVII.9, 1999.
[XML Names]
W3C (World Wide Web Consortium). Namespaces in XML 1.1, 2nd edition. Tim Bray, Dave Hollander, Andrew Layman and Richard Tobin, ed., 2006. W3C Recommendation. (See
[XSD Pt1]
W3C (World Wide Web Consortium). W3C XML Schema Definition Language (XSD) 1.1 Part 1: Structures. Shudi Gao (高殊镝), C. M. Sperberg-McQueen and Henry S. Thompson, ed., 2012. W3C Recommendation. (See
[XSD Pt2]
W3C (World Wide Web Consortium). W3C XML Schema Definition Language (XSD) 1.1 Part 2: Datatypes. David Peterson, Shudi Gao (高殊镝), Ashok Malhotra, C. M. Sperberg-McQueen and Henry S. Thompson, ed., 2012. W3C Recommendation. (See

Copyright © 2017, Family History Information Standards Organisation, Inc. The text of this standard is available under the Creative Commons Attribution 4.0 International License.