Warning: This may be an old version of the document. The current version can be found here.
This is a second public draft of the core part of FHISO’s proposed suite of standards on Citation Elements. This document is not endorsed by the FHISO membership, and may be updated, replaced or obsoleted by other documents at any time.
In particular, some examples in this draft use citation elements that are not even included in the draft Citation Elements: Vocabulary. These elements are very likely to be changed as the vocabulary progresses.
The public tsc-public@fhiso.org mailing list is the preferred place for comments, discussion and other feedback on this draft.
Latest public version: | https://fhiso.org/TR/cev-concepts |
This version: | https://fhiso.org/TR/cev-concepts-20170911 |
Previous version: | https://fhiso.org/TR/cev-concepts-20170626 |
FHISO’s suite of Citation Elements standards provides an extensible framework and vocabulary for encoding all the data about a genealogical source that might reasonably be included in a formatted citation to that source.
This document defines the general concepts used in FHISO’s suite of Citation Elements standards, and the basic framework and data model underpinning them. Other standards in the suite are as follows:
Citation Elements: Vocabulary. This standard defines a collection of citation elements allowing the representation of information normally found in formatted citations to diverse types of source.
Citation Elements: Bindings for RDFa. This standard defines a means by which citation elements may be identified and tagged using RDFa attributes within HTML and XML formatted citations, allowing a computer to extract them in a systematic manner.
Citation Elements: Bindings for GEDCOM X. This standard defines extensions to the GEDCOM X data model and its JSON and XML serialisations to allow citation elements to be represented in GEDCOM X.
Citation Elements: Bindings for ELF. This standard defines how citation elements should be represented in FHISO’s Extensible Legacy Format (ELF), a format based on and compatible with GEDCOM 5.5, but with the addition of a new extensibility mechanism.
Where this standard gives a specific technical meaning to a word or phrase, that word or phrase is formatted in bold text in its initial definition, and in italics when used elsewhere. The key words must, must not, required, shall, shall not, should, should not, recommended, not recommended, may and optional in this standard are to be interpreted as described in [RFC 2119].
An application is conformant with this standard if and only if it obeys all the requirements and prohibitions contained in this document, as indicated by use of the words must, must not, required, shall and shall not, and the relevant parts of its normative references. Standards referencing this standard must not loosen any of the requirements and prohibitions made by this standard, nor place additional requirements or prohibitions on the constructs defined herein.
If a conformant application encounters data that does not conform to this standard, it may issue a warning or error message, and may terminate processing of the document or data fragment.
Indented text in grey or coloured boxes, such as preceding paragraph, does not form a normative part of this standard, and is labelled as either an example or a note.
The grammar given here uses the form of EBNF notation defined in §6 of [XML], except that no significance is attached to the capitalisation of grammar symbols. Conforming applications must not generate data not conforming to the syntax given here, but non-conforming syntax may be accepted and processed by a conforming application in an implementation-defined manner.
A source is any resource from which information is obtained during the genealogical research process. Sources come in many forms, including manuscripts, artefacts, books, films, people, recordings and websites. A full mechanism for describing sources is beyond the scope of this standard.
A source derivation is a directional link between two sources, indicating that the first source was derived from, cites or otherwise references the second source. The first source is referred to as the derived source, and the second the base source.
A citation is an abstract reference to a specific source from which information has been used in some context. It should include sufficient detail that a third-party could readily locate the information themselves, assuming the source remains accessible.
A formatted citation is a citation that has been rendered into human-readable form, typically as a sentence or short paragraph that might be used as a footnote, endnote, tablenote or bibliography entry. There is no single standard on the correct form of formatted citations; many different style guides exist, each giving their own rules on how to construct a formatted citation.
A formatted citation produced for use in a footnote on the first use of the source, and conforming to [Chicago] might read:
1 Christian Settipani, Les ancêtres de Charlemagne, 2nd ed. (Oxford: Prosopographia et Genealogica, 2015), 129–31.
The 1 at the start of the citation is the hypothetical footnote number.
A layered citation is a citation that includes information about several sources between which source derivation links exist. The information in a layered citation about a specific source, whether the consulted source or one of sources from which it was derived, is known as a citation layer. A citation with just a single citation layer is called a single-layer citation.
The citation layer containing the information about the specific source which was consulted is known as the head citation layer. For a single-layer citation, its sole citation layer is necessarily the head citation layer.
A citation to a census return that was consulted on microfilm might contain information about the microfilm and as well as information about the census return, as in the following formatted citation from [Evidence Explained]:
1810 U.S. census, York County, Maine, town of York, p. 435 (penned), line 9, Jabez Young; NARA microfilm publication M252, roll 12.
In this example, the information before the semicolon pertains to the census return, while the information after it pertains to the microfilm. The microfilm and the census return are different sources, and a source derivation exists between them as the microfilm is derived from the census return. The information in the citation about microfilm forms the head citation layer, while the information about the census return forms a separate citation layer. As the citation contains two citation layers, it is an example of a layered citation.
In this example, the head citation layer is not presented first in the formatted citation. Whether the head citation layer is presented first is a matter of style and emphasis, and it is common not to present the head citation layer first when it is a photographic or digital reproduction, as in this case.
A citation element is a logically self-contained piece of information in a citation layer that might reasonably be included in a formatted citation. As this standard does not aim to provide facilities for the exhaustive description of sources, information about sources that is not normally included in formatted citations is not considered to be a citation element. Citation elements are represented in a sufficiently structured and language-independent way that applications can parse and reformat it in different styles and languages as needed.
The accompanying Citation Elements: Vocabulary standard defines many citation elements, covering the information normally found in formatted citations to a wide range of common sources. Applications may define their own citation elements or use those defined by a third-party standard; such citation elements are known as extension citation elements.
Conforming applications must not discard citation elements, except on the instruction of the user or as explicitly permitted in this standard. This applies to unrecognised extension citation elements too, though an application may opt not to display any such citation elements.
Note that the definition of citation element limits it to information that might reasonably appear in a citation; thus, most items of metadata (such as who created the citation and when, or a globally-unique identifier for the citation or its layers) are not properly considered citation elements themselves.
It is anticipated that metadata will be addressed in a future FHISO standard. Initial brainstorming on metadata implementation suggests that this document may be edited slightly to support metadata, perhaps by adding an optional identifier or context pointer to each element. The exact nature of such an edit, or if it will even be necessary, will depend on future development of that metadata standard.
A citation element set is a collection of citation elements that completely encode the information about a source that is present in a particular citation layer.
The example formatted citation to Les ancêtres de Charlemagne is represented by a citation element set containing the following seven citation elements:
Settipani, Christian
”.Les ancêtres de Charlemagne
”.2
”.Oxford
”.Prosopographia et Genealogica
”.2015
”.129-131
”.The footnote number is not a citation element as it does not pertain to the source. The author and page range are not expressed here in quite the same form as the formatted citation, but an application can readily parse them to convert them to the required format because their format is defined by this standard.
When provided with the citation element set for each citation layer in the citation, knowledge of which is the head citation layer, information about the source derivations between sources referred to in each citation layer, and any necessary internal state, an application ought to be able to produce algorithmically a formatted citation in a reasonable approximation to any mainstream citation style. If higher quality formatted citations are desirable, applications should allow users to manually edit them to fine-tune their presentation, and should store the result for reuse. Formatted citations need not include all the information from a citation element set if the style dictates that certain information is omitted in the relevant context.
Citation element sets should not include citation elements for information that is not normally included in a formatted citation. They are not intended to provide a general mechanism for storing arbitrary information about sources.
Characters are specified by reference to their code point number in [ISO 10646], without regard to any particular character encoding. In this standard, characters may be identified in this standard by their hexadecimal code point prefixed with “U+”.
Characters must match the Char
production from [XML].
Char ::= [#1-#xD7FF] | [#xE000-#xFFFD] | [#x10000-#x10FFFF]
A string is a sequence of zero or more characters.
string
datatype defined in [XSD Pt2], used in many XML and Semantic Web technologies.
Applications may convert any string into Unicode Normalization Form C, as defined in any version of Unicode Standard Annex #15 [UAX 15].
Characters matching the RestrictedChar
production from [XML] should not appear in strings, and applications may process such characters in an implementation-defined manner or reject strings containing them.
RestrictedChar ::= [#x1-#x8] | [#xB-#xC] | [#xE-#x1F]
| [#x7F-#x84] | [#x86-#x9F]
RestrictedChar
s.
Whitespace is defined as a sequence of one or more space characters, carriage returns, line feeds, or tabs. It matches the production S
from [XML].
S ::= (#x20 | #x9 | #xD | #xA)+
Whitespace normalisation is the process of discarding any leading or trailing whitespace, and replacing other whitespace with a single space (U+0020) character.
In the event of a difference between the definitions of the Char
, RestrictedChar
and S
productions given here and those in [XML], the definitions in the latest edition of XML 1.1 specification are definitive.
A term consists of a unique, machine-readable identifier, known as the term name, paired with a clearly-defined meaning for the concept or idea that it represents. This standard uses terms as datatypes and citation element names, as defined in §2 and §3 of this standard respectively. Term names shall take the form of an IRI matching the IRI
production in §2.2 of [RFC 3987].
Term names are compared using the “simple string comparison” algorithm given in §5.3.1 of [RFC 3987]. If a term name does not compare equal to an IRI known to the application, the application must not make any assumptions about the term, its meaning or intended use, based on the form of the IRI or any similarity to other IRIs.
The following IRIs are all distinct for the purpose of the “simple string comparison” algorithm given in §5.3.1 of [RFC 3987], , even though an HTTP request to them would fetch the same resource.
https://éléments.example.com/nationalité
HTTPS://ÉLÉMENTS.EXAMPLE.COM/nationalit%C3%A9
https://xn--lments-9uab.example.com/nationalit%c3%a9
An IRI must not be used as a term name unless it can be converted to a URI using the algorithm specified in §3.1 of [RFC 3987], and back to a IRI again using the algorithm specified in §3.2 of [RFC 3987], to yield the original IRI.
The terms defined in this standard all have term names that begin https://terms.fhiso.org/
. Subject to the requirements herein, third parties may also define additional terms for use as datatypes or citation elements. It is recommended that any such terms use the http
or preferably https
IRI scheme defined in §2.7.1 and §2.7.2 of [RFC 7230] respectively, and an authority component consisting of just a domain name or subdomain under the control of the party defining the extension citation elements.
An http
or https
IRI scheme is recommended because the IRI is used to fetch a resource during discovery, and it is desirable that applications implementing discovery should only need to support a minimal number of transport protocols. URN schemes like the uuid
scheme of [RFC 4122] are not recommended as they do not have transport protocols that can be used during discovery.
The preference for a https
IRI is because of security considerations during discovery. A man-in-the-middle attack during discovery could insert malicious content into the response, which, if undetected, could cause an application to process user data incorrectly, potentially discarding parts of it or otherwise compromising its integrity. It is harder to stage a man-in-the-middle attack over TLS, especially if public key pinning is used per [RFC 7469].
Term names in the Citation Elements standard are sometimes referred to in prefix notation. This is a system whereby prefixes are assigned to IRIs that occur frequently as the leading portion of a term name. Then, instead of writing the term name in full, the leading portion of the term name is replaced by its prefix followed by a colon (U+003A) separator.
https://terms.fhiso.org/sources/title
is used in several of the examples in this standard. Instead of writing this in full, if the cev
prefix is bound to the IRI https://terms.fhiso.org/sources/
, then this IRI can be written in prefix form as cev:title
.
The following prefix bindings are assumed in this standard:
rdf |
http://www.w3.org/1999/02/22-rdf-syntax-ns# |
rdfs |
http://www.w3.org/2000/01/rdf-schema# |
xsd |
http://www.w3.org/2001/XMLSchema# |
It is recommended that an HTTP GET
request to a term name IRI with an http
or https
scheme (once converted to a URI per §4.1 of [RFC 3987]), should result in a 303 “See Other” redirect to a document containing a human-readable definition of the term if the request was made without an Accept
header or with an Accept
header matching the format of the human-readable definition. It is further recommended that this format should be HTML, and that documentation in alternative formats may be made available via HTTP content negotiation when the request includes a suitable Accept
header, per §5.3.2 of [RFC 7231].
Parties defining terms may arrange for their term name to support discovery. This when an HTTP GET
request to a term name IRI with an http
or https
scheme, made with an appropriate Accept
header, yields 303 redirect to a machine-readable definition of the term.
A datatype is a term which serves as a formal description of the values that are permissible in a particular context. Being a term, a datatype is identified by a term name which is an IRI. The term name of a datatype is also referred to as a datatype name.
A datatype has a lexical space which is the set of strings which are interpreted as valid values of the datatype. The definition of a datatype shall state how each string in its lexical space maps to a logical value, and state the semantics associated with of those values.
XML Schema defines an integer type in §3.4.13 of [XSD Pt2] which is well suited for use in this standard. XML Schema does not give its types IRIs, but it does give them id
s, and following the best practice advice given in §2.3 of [SWBP XSD DT] gives it the following IRI:
http://www.w3.org/2001/XMLSchema#integer
This same type is also recommended for use in RDF by §5.1 of [RDF Concepts] which explicitly gives it the IRI above.
The lexical space of this datatype is the space of all strings consisting of a finite-length sequence of one or more decimal digits (U+0030 to U+0039, inclusive), optionally preceded by a +
or -
sign (U+002B or U+002D, respectively). Thus the string “137
” is within the lexical space of this datatype, but “20.000
” and “四十二
” are not, despite being normal ways of representing integers in certain cultures.
The mapping from lexical representations to logical values need not be one-to-one. If a datatype has multiple lexical representations of the same logical value, a conformant application must treat these representations equivalently and may change a string of that datatype to be a different but equivalent lexical representation.
integer
datatype used in the previous example is one where the mapping from lexical representation to value is many-to-one rather than one-to-one. This is due to lexical space including strings with a leading +
sign as well as superfluous leading 0
s, and means that “00137
”, “+137
” and “137
” all represent the same underlying value: the number one hundred and thirty-seven. Because conformant applications may convert strings between equivalent lexical representations, they may store them in a database in an integer field and regenerate strings in a canonical representation.
Strings outside the lexical space of a datatype must not be used where a string of that datatype is required. If an application encounters any such strings, it may remove them from the dataset or may convert them to a valid value in an implementation-defined manner. Any such conversion that is applied automatically by an application must either be locale-neutral or respect any locale given in the dataset.
date
type in §3.3.9 of [XSD Pt2] which has a lexical space based on [ISO 8601] dates. If, in a dataset that is somehow identified as being written in German, an application encountering the string “8 Okt 2000
” in a context where an XML Schema date
is expected, it may convert this to “2000-10-08
”. However an application encountering the string “8/10/2000
” must not conclude this represents 8 October or 10 August unless the document includes a locale that uniquely determines the date format. In this case, information that the document is in English is not sufficient as different English-speaking countries have different conventions for formatting dates.
A language-tagged datatype is a datatype whose value consists of both a string from the lexical space of the datatype and a language tag to identify the language, and where appropriate the script and regional variant, in which that particular string is written. The language tag shall match the Language-Tag
production from [RFC 5646].
Language-tagged datatypes should be used whenever a datatype is needed to represent textual data that is in a particular language or script and which cannot automatically be translated or transliterated as required, and should not be used otherwise.
2015
”. Even though an application designed for Arabic researchers might need to render this year as “٢٠١٥” using Eastern Arabic numerals, this conversion can be done entirely in the application’s user interface, so a language-tagged datatype is not required and should not be used.
The [CEV Vocabulary] defines a datatype for representing the names of authors and other people, which has the following term name:
https://terms.fhiso.org/sources/AgentName
A person’s name is rarely translated in usual sense, but may be transliterated. For example, the name of Andalusian historian صاعد الأندلسي might be transliterated “Ṣā‘id al-Andalusī” in the Latin script. Because machine transliteration is far from perfect, a language-tagged datatype should be used to allow an application to store both names. In this case, they would be tagged ar
and ar-Latn
respectively, meaning the Arabic language in its default script and in the Latin script.
An author’s names may also be respelled to conform to the spelling and grammar rules of the reader’s language. An Englishman named Richard may be rendered “Rikardo” in Esperanto: the change of the “c” to a “k” being to conform to Esperanto orthography, while the final “o” marks it as a noun. The respelling would be tagged eo
, the language code for Esperanto.
A datatype that is not a language-tagged datatype is called a non-language-tagged datatype.
A party defining a datatype shall specify a pattern for that datatype. This is a regular expression which provides a constraint on the lexical space of the datatype. Matching the pattern might not be sufficient to validate a string as being in the lexical space of the datatype, but a string that fails to match the pattern is guaranteed not to be in the lexical space.
\p
), though it seems likely this will improve in the next version of ECMAScript. Another option is to use the form defined in Appendix G of [XSD Pt2] which is much less widely supported, but has the advantage of being the standard form for defining datatypes in XML and RDF.
Sept 2017
” does not match the pattern [0-9]{4}
, despite the lack of ^
…$
around the pattern.
The XML Schema date
type mentioned in the previous example has the following pattern (here split onto two lines for readability — the second line is an optional timezone which the XML Schema data
type allows).
-?([1-9][0-9]{3,}|0[0-9]{3})-(0[1-9]|1[0-2])-(0[1-9]|[12][0-9]|3[01])
(Z|(\+|-)((0[0-9]|1[0-3]):[0-5][0-9]|14:00))?
This pattern matches strings like “1999-02-31
”. Despite matching the pattern, this string is not part of the lexical space of this date
type as 31 February is not a valid date.
A datatype with a pattern other than .*
is known as a structured datatype, while one with a pattern of .*
is known as an unstructured datatype. It is expected that most datatypes in common use, other than the rdf:langString
datatype defined in §2.4.1 will be structured datatypes.
AgentName
datatype used to represent the names of authors and other people is a microformat which is constrained by a pattern meaning it is a structured datatype, but it is also a language-tagged datatype as names can be translated and transliterated.
A datatype may be defined as a subtype of another datatype which is referred to as its supertype. This is used to provide a more specific version of a more general datatype. The lexical space of the subtype shall be a subset of the lexical space of the supertype, and if an application is unfamiliar with the subtype it may process it as if it were the supertype. The subtype must be defined in such a way that at most this results in some loss of meaning but does not introduce any false implications about the dataset.
1999-02-31
”.
A datatype may be defined to be a abstract datatype. An abstract datatype is one that must only be used as a supertype of other types. A string must not be declared to have a datatype which is an abstract datatype. Abstract datatypes may specify a pattern and shall have a lexical space.
Subtypes may be defined of language-tagged datatypes as well as of other datatypes. If the supertype is a language-tagged datatype then the subtype must also be; and if the supertype is not a language-tagged datatype then the subtype must not be.
This standard gives special treatment to three datatypes defined in third-party standards.
rdf:langString
datatypeAny language-tagged datatype that is not defined to be a subtype of some other datatype shall implicitly be considered to be a subtype of the rdf:langString
datatype defined in §2.5 of [RDFS]. This datatype is an unstructured language-tagged datatype and has the following properties:
Name | http://www.w3.org/1999/02/22-rdf-syntax-ns#langString |
Pattern | .* |
Supertype | none |
Abstract | no |
No constraints are placed on the lexical space of this datatype; the only restriction placed on the use or semantics of this datatype is that it should contain text in a human-readable form.
rdfs:Resource
, rdfs:Literal
, xsd:anyType
, xsd:anySimpleType
and xsd:anyAtomicType
would serve, but this needs careful consideration of the differences between datatypes in XML Schema, RDF and this standard. At present there is no compelling need for either of these additional supertypes.
xsd:string
datatypeThis standard makes limited use of the xsd:string
datatype defined in §3.3.1 of [XSD Pt2]. This is an unstructured non-language-tagged datatype which has the following properties:
Name | http://www.w3.org/2001/XMLSchema#string |
Pattern | .* |
Supertype | none |
Abstract | no |
It is a general-purpose datatype whose lexical space is the space of all strings; however it is not a language-tagged datatype and therefore it should not be used to contain text in a human-readable natural language.
xsd:date
and xsd:integer
are not defined as subtypes of xsd:string
in XML Schema.
Use of this datatype is generally not recommended: data that is in a human-readable form should use a language-tagged datatype, while data that is not human-readable should use a structured datatype.
If an application encounters a string with the xsd:string
datatype, it may change the datatype to rdf:langString
and assign the string a language tag of und
, meaning an undetermined language.
xsd:string
datatype is included in this standard in order to align this data model more closely with the RDF data model, and in particular the [CEV RFDa] bindings which use this datatype as the default when no language tag is present. The above rule allowing conversion to rdf:langString
means that applications may ignore the xsd:string
datatype.
rdfs:Resource
datatypeThis standard also makes use of the rdfs:Resource
type defined in §2.1 of [RDFS] as the class of everything that can be expressed in RDF. In these Citation Elements standards, its use is more specific, and it is used as a datatype to represent resources identified by IRIs. In this context a resource might be a document or file that can be retrieved from that IRI, but it also includes an physical and abstract concept that are merely identified by an IRI.
rdfs:Resource
datatype is used to represent the website from which an online source can be retrieved.
rdfs:Resource
in a more specific manner than RDF does not introduce a incompatibility between RDF and this FHISO standard. This is because, in RDF terminology, rdfs:Resource
is not a datatype but something more general. All literals in RDF have a datatype, but IRIs are a distinct class of entity which do not have an RDF datatype. Instead the thing they represent has a type and rdfs:Resource
is the most general possible type. A further complication is that all RDF datatypes are also subclasses of rdfs:Resource
, but as rdfs:Resource
is not itself an RDF datatype, it cannot appear in contexts where an RDF datatype is expected. For the purpose of FHISO’s Citation Elements standards, the rdfs:Resource
datatype is not a supertype of any other datatype.
The lexical space of thus datatype is the space of valid IRIs matching the IRI
production in §2.2 of [RFC 3987]. It is a non-language-tagged datatype with the following properties:
Name | http://www.w3.org/2000/01/rdf-schema#Resource |
Pattern | [a-z][a-z0-9+.-]+:[^ ]+ |
Supertype | none |
Abstract | no |
Applications must not define subtypes of rdfs:Resource
.
rdfs:Resource
are permitted, they will be almost certainly be used to describe the type of resource being referenced, rather than the type of IRI used to reference it. Therefore it is very unlikely that subtypes of rdfs:Resource
will be permitted to define a pattern or further constrain the lexical space of the datatype.
xsd:anyURI
datatype. In RDF, the two are entirely unrelated as xsd:anyURI
is used in RDF as the datatype of a literal, whereas rdfs:Resource
is used as the type of the resource referenced by an IRI. But there may be a use case to make the two interchangeable, much as xsd:string
is with an rdf:langString
tagged with the language tag und
.
A union of datatypes is an unordered list of one or more different datatypes.
A union of datatypes may contain language-tagged datatypes, non-language-tagged datatypes, or a mixture of both.
The lexical space of a union of datatypes is the union of the lexical space of each of its constituent datatypes.
Unions of datatypes are used as the range of citation elements, as defined in §4.2. In several cases a union of the following two datatypes is used:
http://www.w3.org/1999/02/22-rdf-syntax-ns#langString
https://terms.fhiso.org/dates/AbstractDate
The former is an unstructured datatype, while the latter is an abstract datatype which serves as the supertype for various structured datatypes for dates. The inclusion of an abstract datatype provides a point of extensibility.
In the data model defined by this standard, a citation element consists of two parts, both of which are required:
A citation element set is defined to be an ordered list of citation elements; conformant applications may reorder the list subject to the following constraints:
The relative order of citation elements must be preserved when they have the same ultimate super-element (as defined in §4.1 of this standard).
When a citation element set contains a citation element with the citation element name https://terms.fhiso.org/sources/localisedElement
, the previous element in citation element set with a different citation element name is referred to as its localisation base. The localisation base of any localisedElement
citation element must not change if a citation element set is reordered.
localisedElement
s per §4.3.1 of this standard, and then removing them from the citation element set.
localisedElement
citation element term and the notion of its localisation base were called translatedElement
and its translation base in earlier drafts of this standard.
The citation element name identifies the nature of the information contained in a particular citation element. It shall be a term that has been defined to be used as a citation element name in the manner required by §4 of this standard; a term defined for this purpose is called a citation element term.
The [CEV Vocabulary] defines a citation element term for the title of a source. Its term name is:
https://terms.fhiso.org/sources/title
A dataset might contain many citation elements with this as their citation element name.
The citation element value is the content of the citation element which shall be a localisation set. A localisation set is an ordered list of strings, which applications should whitespace-normalise. Each string in a localisation set should contain the same information, but translated, transliterated or otherwise localised.
Each string in a localisation set shall be tagged with a datatype, and shall additionally be tagged with a language tag if and only if the specified datatype is a language-tagged datatype. The language tag shall match the Language-Tag
production from [RFC 5646], and should contain a script subtag per §2.2.3 of [RFC 5646] when transliteration has occurred.
The title
citation element defined in the [CEV Vocabulary] would normally contain strings tagged with the rdf:langString
datatype. An example title
citation element might contain a localisation set with three rdf:langString
strings in the following order:
Η Γενεαλογία των Κομνηνών
” with language tag el
, the language code for Greek in [ISO 639-1];Hē Genealogia tōn Komnēnōn
” and language tag el-Latn
, Latn
being the code for the Latin script in [ISO 15924]; andLa généalogie des Comnènes
”, tagged with the language code fr
.Although the language tags is required for language-tagged datatypes, it need not be explicit in the serialisation. A serialisation format may provide a mechanism for stating the document’s default language tag, and may provide a global default which should be a language-neutral choice such as und
, defined in [ISO 639-2] to mean an undetermined language. In the absence of an explicit or implicit language tag, applications must not apply their own default, and must treat the string as if it had the language tag und
.
lang
attribute to provide a default language tag for the document or a part of the document. Thus, if the document begins <html lang="pt_BR">
, it is not necessary to tag each string separately for them to be understood to be in Brazilian Portuguese. HTML does not define a default language tag that applies in the absence of a lang
tag, and applications must not apply one.
If localisation sets are being serialised in XML, it is recommended that the special xml:lang
attribute defined in §2.12 of [XML] is used to encode the language tag.
Similarly, a datatype is required, but it need not be explicit in the serialisation. A serialisation format may specify a format default datatype that applies when none is given explicitly. Ordinarily, if a format default datatype is specified, it should be the rdf:langString
datatype defined in §2.4.1 of this standard.
rdf:langString
is recommended. The datatype correction mechanism defined in §4.4 of this standard allow a conformant application to correct the datatype that have incorrectly defaulted to rdf:langString
. In practice it is anticipated that many applications will apply datatype correction during import, and therefore the format default datatype becomes a fallback that applies if the citation element term does not define its own default datatype, or if this is unknown.
The [CEV RDFa] standard makes rdf:langString
the format default datatype in most circumstances. Thus the citation element extracted from the following HTML fragment is interpreted as an rdf:langString
string, even though it is not explicitly tagged as such:
<i lang="en" property="title">The Complete Peerage</i>
Where possible, the first string in the localisation set should be the untranslated, and ideally untransliterated form of the citation element value. If it is known that the only available values are translations, the first string in the localisation set should be an empty string tagged with the language tag und
, and the translations listed afterwards. An empty string in a localisation set means that its value is unknown, rather than that this particular translation is literally an empty string.
Conformant applications may reorder the localisation set, but must leave the first string first, so that applications wishing to use the original, untranslated, untransliterated form can do so.
rdf:langString
strings, except that JSON’s object notion, as given in §4 of [RFC 7159], does not preserve order. One possible solution is to append some private use subtag (per §2.2.7 of [RFC 5646]) to the first language tag.
In a localisation set which contains more than one string with the same datatype and language tag, or more than one string with the same datatype if it is a non-language-tagged datatype, any string other than the first non-empty string with that datatype and, if relevant, language tag is known as a duplicate string.
If an application encounters a localisation set with duplicate strings, it should ignore the value of any duplicate strings and may deduplicate the localisation set; where possible it should not deduplicate a localisation set that has been reordered from its serialised form.
To deduplicate a localisation set, the application first notes the datatype and, if present, the language tag of the first string in the localisation set. Next, all duplicate strings are deleted from the localisation set. Finally, if a string with the noted datatype and language tag remains after deduplication, the application shall reorder the localisation set to ensure it is the first string in the deduplicated localisation set; if there is not, the application shall insert any empty string with that datatype and language tag as the first string in the localisation set.
If an application needs to merge two or more localisation sets, the contents of each localisation sets shall be combined in the order specified by this standard, and the application should deduplicate the resultant localisation set.
If a citation element has a citation element name which is an empty localisation set, that citation element should be discarded.
A citation element term is a term which has been defined specifically for use as a citation element name in the following manner. The party defining the citation element term shall provide a description of the intended purpose of the citation element term which should be made freely available to all interested parties, preferably by an HTTP request as described in §1.4.2 of this standard. In addition, the definition shall state:
A citation element term may be defined as a sub-element of another citation element term which is referred to as its super-element. This is used to provide a refinement of a general citation element term. If an application is unfamiliar with the sub-element it may process it as if it were the super-element, with its citation element value unchanged. The sub-element must be defined in such a way that this only results in some loss of meaning, and does not imply anything false about the cited source.
The [CEV Vocabulary] defines a citation element term with the name
https://terms.fhiso.org/sources/creatorName
which contains name of a person, organisation or other entity who created or contributed to the creation of the source. Several sub-elements of it are defined, including
https://terms.fhiso.org/sources/interviewerName
which contains the name of an interviewer when the source is an interview. An interviewer can certainly be considered to have contributed to the creation of the interview.
The [CEV Vocabulary] also defines a citation element with the name
https://terms.fhiso.org/sources/recipientName
which contains the party to whom a source such as a letter is addressed. In many respects it is similar to the sub-elements of creatorName
, but because a recipient of a letter cannot be said to have contributed to the creation of the letter, and might not even be aware of its existence if it were not delivered, the recipientName
element cannot be defined as a sub-element of creatorName
.
The range of a sub-element shall be the same as that of its super-element.
Any sub-element of a single-valued super-element must be single-valued.
The super-element list of a citation element term is an ordered list of IRIs defined inductively as follows. If the citation element term is not a sub-element, then its super-element list contains just that citation element term. Otherwise, its super-element list is the super-element list of its super-element to which its own citation element term is appended.
The ultimate super-element of a citation element term is defined as the first IRI in its super-element list.
The ultimate single-valued super-element of a single-valued citation element term is defined as the first IRI in its super-element list that is a single-valued citation element term.
The most-refined common super-element of a collection of citation element terms is defined as the last IRI that appears in the super-element list of every citation element term in the collection. It is only defined for citation element terms that share an ultimate super-element.
The range of a citation element term is a union of datatypes, which describes what citation element values are valid in a citation element with this citation element name.
The [CEV Vocabulary] defines a datatype for representing the names of authors and other people, which has the following term name:
https://terms.fhiso.org/sources/AgentName
A union of datatypes consisting of just this one datatype is used as the range of several citation element terms defined in the [CEV Vocabulary] including:
https://terms.fhiso.org/sources/editorName
Citation elements terms with non-textual citation element values such as numbers or dates should have ranges that include one or more non-language-tagged datatype.
The [CEV Vocabulary] defines an abstract datatype called AbstractDate
which is used as the supertype of all structured datatypes for dates; it has the following term name:
https://terms.fhiso.org/dates/AbstractDate
Several citation element terms have a range consisting of a union of AbstractDate
and rdf:langString
. One such citation element term is:
https://terms.fhiso.org/sources/publicationDate
Because this citation element typically has non-textual values, frequently just a year, its range should include a non-language-tagged datatype: in this case, AbstractData
. The inclusion of rdf:langString
is to allow dates that cannot readily be represented in any of the available structured formats. An example might be a termly university publication dated “Michaelmas term, 1997”.
AbstractDate
has not been discussed yet.
A datatype is said to be compatible with the range if either it is one of the datatypes listed in the range, or it is a subtype of a datatype that is compatible with the range.
A string in a localisation set which is used as a citation element value is said to be invalid if, after datatype correction has occurred per §4.4 of this standard, either the string is tagged with a datatype that is not compatible with the range of the citation element term used as the citation element name, or the string is outside the lexical space of that datatype. Conformant application should take steps to avoid creating localisation sets containing invalid strings.
Applications may use one or more discovery mechanism to obtain the information needed to determine which strings are invalid.
If the range of the citation element term includes one of the following datatypes, applications should change the datatype of the invalid string to that datatype:
http://www.w3.org/1999/02/22-rdf-syntax-ns#langString
http://www.w3.org/2001/XMLSchema#string
If the range does not include either of these datatypes, applications may discard any strings that are found to be invalid. It is recommended that this should be done prior to deduplicating a localisation set, and it may be done at other times. A conformant application must not discard any string unless it is known to be invalid or as otherwise permitted by this standard.
Exceptionally, a conformant application may also discard any string which it has credible reason to believe contains malware or illegal content, or any string that is so long that the application cannot reasonably handle it.
The cardinality of a citation element term records how many semantically distinct values it can have. A multi-valued citation element term is one that can logically have multiple values in a single citation layer. It should be reserved for situations where the values genuinely contains different information, and not used to accommodate transliterations, translations, or variant forms of something that is logically a single value. Citation elements terms that are not multi-valued are single-valued.
https://terms.fhiso.org/sources/title
citation element term is defined to be single-valued, as citations do not refer to the same sources by multiple titles (though they may translate or transliterate the title), so a citation element set must not contain more than one citation element with this citation element name; but it may contain several https://terms.fhiso.org/sources/authorName
citation elements, as that is defined to be multi-valued to accommodate sources with several authors.
In a citation element set which contains more than one citation element whose citation element names have the same ultimate single-valued super-element, any citation element other than the first citation element with that ultimate single-valued super-element is known as a duplicate citation element.
Citation element sets should not contain duplicate citation elements, and an application should take steps to avoid creating duplicate citation elements.
When duplication citation elements are present, they may be deduplicated. To deduplicate a citation element set, the application should replace all the citation elements with a common ultimate single-valued super-element with a single replacement citation element with the following properties:
Consider the following citation element set, written in a hypothetical JSON format:
[ "title": [ "fr": "Les ancêtres des Charlemagne",
"en": "The Ancestors of Charlemagne" ],
"title": [ "fr": "Les Ancêtres des Charlemagne",
"de": "Die Vorfahren von Karl dem Großen" ] ]
Assuming the title
citation element term is single-valued, an application may deduplicate the citation element set by merging the two localisation sets in order to get the following:
[ "title": [ "fr": "Les ancêtres des Charlemagne",
"en": "The Ancestors of Charlemagne",
"fr": "Les Ancêtres des Charlemagne",
"de": "Die Vorfahren von Karl dem Großen" ] ]
After merging the localisation sets, §3.2.2 says the application should deduplicate the resultant localisation set. This removes the second French title to give the following:
[ "title": [ "fr": "Les ancêtres des Charlemagne",
"en": "The Ancestors of Charlemagne",
"de": "Die Vorfahren von Karl dem Großen" ] ]
These rules mean that single-valued citation elements with the same ultimate single-valued super-element (in this example, with the same citation element name) are assumed to be given in order of preference for the purpose of deduplicating the merged localisation set, with the most preferred value first.
This standard needs to define how to merge citation element sets. The following text is a start towards that.
If an application needs to merge two or more citation element sets, the contents of each citation element set shall be combined in order. The application shall identify any sets of duplicate citation elements in the combined citation element set and deduplicate them according to the rules above. An application may use one or more discovery mechanism to attempt to obtain machine-readable definitions of any extension citation element used in the citation element set before identifying duplicate citation elements.
However the merger of multi-valued elements requires thought too. Even though the data model doesn’t require deduplication, it is still necessary to prevent duplication of, say, authors.
Conformant applications must ensure that in citation elements whose citation element names are multi-valued, the localisation set in each citation element value remains separate.
The authorName
citation element term is defined to be multi-valued because a source may have multiple authors, and each of them may have names that have been transliterated into different scripts. Suppose a researcher wants to cite the Anglo-Japanese Treaty document of 1902 which was (at least nominally) authored by the Marquess of Lansdowne and Count Hayashi Tadasu whose name is written in kanji as 林 董.
The following hypothetical JSON serialisation is not allowed as it flattens localisation sets so it is no longer possible to determine how many authors there are, and which names are translations of which others.
[ { "name": "https://terms.fhiso.org/terms/title",
"lang": "en", "value": "The Anglo-Japanese Treaty" },
{ "name": "https://terms.fhiso.org/terms/authorName",
"lang": "en", "value": "Lord Lansdowne" },
{ "name": "https://terms.fhiso.org/terms/authorName",
"lang": "jp", "value": "林 董" },
{ "name": "https://terms.fhiso.org/terms/authorName",
"lang": "jp-Latn", "value": "Hayashi Tadasu" } ]
In this example, the datatype of each string has been omitted on the assumption that it defaults to rdf:langString
and is corrected via the mechanism specified in §4.4 of this standard.
This is an example of a list-flattening format that does not conform to this specification; a list-flattening format that does conform to this specification is found in the next example.
A serialisation format that does not keep the localisation sets of each citation element value separate is called a list-flattening format, and this standard provides a facility to allow such formats to comply with this standard by introducing a special citation element term with the following properties:
Name | https://terms.fhiso.org/sources/localisedElement |
Range | unspecified |
Cardinality | multi-valued |
Super-element | none |
Default datatype | none |
localisedElement
citation element term has no range specified. No other citation element terms defined in accordance with this standard may have an unspecified range.
In a list-flattening format, an application must consider every value to be a separate citation element value, and therefore to be a localisation set with one element.
When a localisation set with two or more strings needs to be serialised in a list-flattening format, the first string must be serialised according to the normal rules of the format, and subsequent strings must be serialised as if they were separate citation element, but with the localisedElement
citation element term in place of the actual citation element name. This special citation element indicates that its value is not a distinct citation element and should instead be appended to the localisation set of its localisation base (i.e. the last preceding citation element which is not a localisedElement
), and the localisedElement
removed from the citation element set.
The hypothetical JSON serialisation in the last example can be fixed by using a localisedElement
to serialise the transliterated version of Hayashi’s name:
[ { "name": "https://terms.fhiso.org/terms/title",
"lang": "en", "value": "The Anglo-Japanese Treaty" },
{ "name": "https://terms.fhiso.org/terms/authorName",
"lang": "en", "value": "Lord Lansdowne" },
{ "name": "https://terms.fhiso.org/terms/authorName",
"lang": "jp", "value": "林 董" },
{ "name": "https://terms.fhiso.org/terms/localisedElement",
"lang": "jp-Latn", "value": "Hayashi Tadasu" } ]
The two authorName
element are assumed to be separate citation elements and therefore to refer to different authors. The use of localisedElement
signifies that this is not a different author. It immediately follows an authorName
citation element with the value 林 董, and its value (“Hayashi Tadasu”, tagged as jp-Latn
) should be appended to that localisation set.
localisedElements
occurs. Ideally an application should do it during the process of reading a list-flattening format, but may do it later or not at all. If the application subsequently serialise the data in a non-list-flattening format, the localisedElement
s may still be present. Therefore applications reading non-list-flattening format should cope with the possibility of localisedElements
being present.
If the localisation set in the localisation base already contains a string with the same datatype and language tag, an application must not overwrite or duplicate a language tag; the localisedElement
should be ignored and may be removed from the the citation element set.
The use of list-flattening formats is not recommended except where there is a good technical reason. The use of localisedElement
s other than in list-flattening formats is not recommended.
A citation element term may have a default datatype defined. When a default datatype is defined, it is used to provide an optional datatype correction mechanism for correcting the datatype of a string in the localisation set of a citation element value in certain situations. The default datatype must be a datatype that is compatible with the range of the citation element term.
Datatype correction shall not be carried out unless the datatype of the string prior to datatype correction is one of the following datatypes, and not just a subtype of one of them:
http://www.w3.org/1999/02/22-rdf-syntax-ns#langString
http://www.w3.org/2001/XMLSchema#string
http://www.w3.org/2000/01/rdf-schema#Resource
rdf:langString
. Support for xsd:string
and rdfs:Resource
is only included in this datatype correction mechanism to accommodate certain corner cases in RDF processing that could arise in the [CEV RDFa] bindings.
Datatype correction shall only be applied to a string if it appears in a citation element whose citation element name is a citation element term that has a default datatype, and if that default datatype is a datatype whose pattern is known to the application, and if the string matches that pattern.
At any time when an application encounters a string which is eligible for datatype correction according to the above criteria, it may replace its datatype with the default datatype. It is recommended that applications apply datatype correction during or shortly after the import of data in any serialisation format that defines a format default datatype of rdf:langString
.
The hypothetical JSON format used in several earlier examples included the following citation element:
[ { "name": "https://terms.fhiso.org/terms/authorName",
"lang": "jp", "value": "林 董" } ]
This hypothetical format is supposed to default datatypes to rdf:langString
, as recommended by this standard.
The authorName
citation element is defined in the [CEV Vocabulary] to have the following default datatype:
https://terms.fhiso.org/sources/AgentName
This datatype in turn defines the following pattern:
([^!#$%&@{|}]+@)?[^!#$%&@{|}]+(\|[^!#$%&@{|}]*(\|[^!#$%&@{|}]+)?)?
The string “林 董
” matches this pattern — specifically it matches the second [^!#$%&@{|}]+
part of the pattern — and therefore the datatype correction will change the datatype to this AgentName
datatype.
AgentName
will almost certainly need changing as the AgentName
datatype is properly specified.
The publicationDate
citation element term defined in the [CEV Vocabulary] has a range which is the union of the AbstractDate
and rdf:langString
datatypes; its default datatype is GregorianDate
, a subtype of AbstractDate
with the following pattern:
-?[0-9]{4,}(-(0[1-9]|1[0-2])(-(0[1-9]|[12][0-9]|3[01]))?)?
A citation element set might contain a publicationDate
citation element whose localisation set contains the following two strings, both tagged with the language tag en
and datatype rdf:langString
(presumably implicitly as the result of no datatype being given in the serialisation):
Michaelmas term, 1997
1997-10
The former string is not remotely close to matching the pattern for the GregorianDate
datatype, so it is unaffected by datatype correction; however the latter string does match the pattern and so datatype correction may change its datatype to GregorianDate
.
This is an example of where a localisation set might usefully contain both language-tagged datatypes and non-language-tagged datatypes. The former gives the date in the correct form for inclusion in a formatted citation, while the latter allows an application to parse the date, for example to highlight contemporary sources to a user.
GregorianDate
datatype, let alone whether it is actually the default datatype of the publicationDate
citation element term. If such a datatype is specified, it is unlikely to have precisely the pattern given above. Nevertheless, it is safe to assume that this citation element term will have a default datatype that is some structured datatype for dates.
Matching the pattern of a datatype does not guarantee the string necessarily belongs to the lexical space of that datatype, so it is possible that data correction might turn a valid unstructured string into an invalid string. An application should not perform data correction when it knows the result would be an invalid string.
rdf:langString
or xsd:string
rather than being discarded.
Applications should try to ensure that no strings are entered which match the pattern of the default datatype but are outside its lexical space. One strategy for ensuring this is to suggest an alteration to the string that would prevent it from matching the pattern; however applications must not make such an alternation other than at the instruction of the user.
1999-02-31
” matches the pattern for a GregorgianDate
but is nonetheless outside the lexical space of that datatype as there was no such date. A conformant application might warn the user that this is not a valid Gregorian date; if the user confirms they really did mean to enter an unstructured string that looks like an invalid Gregorian date, the application may alter the string to make it not match the pattern. One way this could be done would be appending “(sic)
” to the string; another option is to append an invisible Unicode character such U+2060 (word joiner).
If datatype correction would result in replacing a non-language-tagged datatype with a language-tagged datatype, then the application must tag the string with the language tag und
.
xsd:string
or rdfs:Resource
datatypes, which this standard discourages when the data is indeed language-tagged.
In the data model defined in this standard, a citation layer is represented by a citation element set containing the information in the citation layer.
A citation is represented with the following three parts:
In the common case of a singe-layer citation, the set of layer derivation links will be empty, and the sole citation layer present must be the head citation layer. This means that a single-layer citation can be represented using just a citation element set.
Applications should not reorder the list of citation layers, other than at the request of the user. The order of the citation layers is an indication of the preferred order for displaying the citation layers, and should begin with the one considered most important. This is not necessarily the head citation layer. Applications may ignore this order when displaying or formatting citation layers.
When the sources represented by two citation layers are linked by a source derivation, a layer derivation link is used to encode this. It has three parts, all of which are required:
The two references to citation layers in the layer derivation link shall refer to citation layers present in the current citation.
The source derivation type shall be either an IRI defined in accordance with a future FHISO standard on source derivation types, or the following IRI which represents the most general case of derivation supported in this data model:
https://terms.fhiso.org/sources/derivedFrom
prov:wasDerivedFrom
or prov:wasInfluencedBy
properties from [PROV-O] instead of inventing our own derivedFrom
term?
Applications may discard any IRI that it knows does not conform to the above requirement.
derivedFrom
source derivation type. The Source Derivation Vocabulary standard will also provide a mechanism for third parties to provider their own extension source derivation types, and provide a means of determining whether a given IRI is a source derivation type. If this document is ready for standardisation at the same time as this document, the previous paragraph will be updated to reference it.
A citation layer is directly derived from another citation layer if there exists a layer derivation link whose derived reference is to the former citation layer and whose base reference is to the latter citation layer. The direct base citation layer set of a citation layer is the set of citation layers from which the first citation layer is directly derived.
The complete base citation layer set of a citation layer is defined recursively as follows. The citation layer itself is part of its complete base citation layer set. It also contains every citation layer in the complete base citation layer set of every citation layer in its direct base citation layer set.
The complete base citation layer set of the head citation layer shall contain every citation layer in the citation. If an application encounters a citation for which this is not the case, it may discard any citation layers that are not in the complete base citation layer set of the head citation layer.
Copyright © 2017, Family History Information Standards Organisation, Inc. The text of this standard is available under the Creative Commons Attribution License.