A microformat for creator’s names
This standard defines a class for storing the name of a creator of a genealogical source in a citation element. The word “creator” is used rather than “author” so as also to include editors, compilers, painters and stonemasons. The format is designed to accommodate names conforming to a wide range of cultural norms, not just Western ones; however it does not aim to cover all variants of names, just those that might reasonably be expected to appear in a genealogical citation.
It is explicitly not a goal of this microformat to allow the storage of additional data on creators, such as their email addresses, phone numbers or academic affiliations. The representation of corporate authors such as companies, organisations or government departments is also outside the scope of this microformat.
Basics
The key words must, must not, required, shall, shall not, should, should not, recommended, may, and optional in this standard are to be interpreted as described in [RFC 2119].
The grammar given here uses the same EBNF notation as [XML], except that grammar symbols always have initial capital letters. Conforming applications must not generate data not conforming to the syntax given here, but non-conforming values may be accepted and processed by a conforming application in an implementation-defined manner.
Characters are specified by reference to their code point number in [ISO/IEC 10646], without regard to any particular character encoding. In this standard characters may be identified in this standard by their hexadecimal code point prefixed with “U+”.
Characters must match the Char
production from [XML].
Char ::= [#1-#xD7FF] | [#xE000-#xFFFD] | [#x10000-#x10FFFF]
The value of a citation element is a string, defined as a sequence characters.
xs:string
datatype defined in [XSD Pt2], used in many XML and Semantic Web technologies.
Characters matching the RestrictedChar
production from [XML] should not appear in strings, and applications may process such characters in an implementation-defined manner or reject strings containing them.
RestrictedChar ::= [#x1-#x8] | [#xB-#xC] | [#xE-#x1F]
| [#x7F-#x84] | [#x86-#x9F]
Applications may convert any string into Unicode Normalization Form C, as defined in any version of Unicode Standard Annex #15 [UAX 15].
Whitespace is defined as a sequence of one or more space characters, carriage returns, line feeds, or tabs. It matches the production S
from [XML].
S ::= (#x20 | #x9 | #xD | #xA)+
Whitespace normalisation is the process of discarding any leading or trailing whitespace, and replacing other whitespace with a single space (U+0020) character. Many citation element values are whitespace-normalised before being processed, and in such elements, the production S
collapses to a single space (U+0020).
In the event of a difference between the definitions of the Char
, RestrictedChar
and S
productions given here and those in [XML], the definitions in the latest edition of XML 1.1 specification are applicable.
List types
This section defines a general pattern that can be used for serialising an ordered list of one or more items into a citation element value by separating them with a ListSeparator
of a repeated ampersand (U+0026).
ListSeparator ::= S '&&' S
ListSeparator
, both in terms of the punctuator and whether surrounding whitespace should be required.
By convention in this standard, the grammar production for such a list is the name of the individual item’s production suffixed by List
. The CreatorsNameList
production follows this convention:
CreatorsNameList ::= CreatorsName ( ListSeparator CreatorsName )*
CreatorsNameList
of “William III && Mary II
”.
Applications should whitespace-normalise any CreatorsNameList
value before processing or serialising it.
CreatorsNameList
.
The class name for the class of values encoding a list of the names of the creators of a genealogical source that matches this production is:
http://terms.fhiso.org/type/CreatorsNameList
Name variants
A CreatorsName
contains four separate variants of a name.
- The sort data used to order a list of names, and which is never displayed.
- The bibliographic version of the name to display in a bibliography sorted by author.
- The short version that may be used in subsequent references to that author’s work.
- The natural version, giving the full name as it is normally written in a first reference.
All these forms might by used by a single citation style. A work following The Chicago Manual of Style would use the natural version on the first reference note:
1. Douglas Richardson, Royal Ancestry (Salt Lake City: 2013), II 484.
Then, when the work is next cited, the short version would be used:
7. Richardson, Royal Ancestry, IV 426.
Finally, in the source list or bibliography, the bibliographic version would be used:
Richardson, Douglas. Royal Ancestry. Salt Lake City: 2013.
The natural version of the creator’s name is generally the name in whatever form it is given in the work. It is normal for any titles or post-nominals to be dropped unless they are necessary to distinguish two people of the same name, but the CreatorsName
format does not enforce that.
David Dumville
”. The quotation marks do not form part of the syntax.
The bibliographic version of a name is normally produced by reordering the natural version of the name to bring the most significant parts to the front, and inserting commas, typically to mark where the natural version’s order was disturbed. The intended format is the one commonly used in bibliographies and indexes which aims to appear logical when sorted; nevertheless, this standard does not require a specific choice of bibliographic version format.
Νυσταζοπούλου-Πελεκίδου, Μαρία
”, if written in Greek.
The sort data will usually be the same as the bibliographic version, but in some cultures the correct sorting of names requires information that is not normally displayed. For Japanese names, the sort data will commonly be an entirely separate version of the name written in a kana, as names are sorted base on pronunciation, not their kanji spelling.
The short version of the name is typically just the main component of the name, as used in a formal or academic context to refer to a person who has already been referred to. For Western names this is normally the surname, but in cultures where the norm is for people to have given name and a patronymic, this would be the given name. The short version should not be used to encode arbitrary shortened forms of names, such as versions with initials in place of given names, or diminutive forms of names.
Örnólfur
”; Þorsson is a patronymic derived from his father’s name, Þor.
The name format
The general format for a CreatorsName
allows the inclusion of between one and four name variants, represented by Name
in the grammar production below. The four Name
s are, in order, the sort data, the bibliographic version, the short version and the natural version; all except the bibliographic version are optional.
CreatorsName ::= ( Name S '@' S )? Name
( S '|' ( S Name )? ( S '|' S Name )? )?
The class name for the class of values encoding the names of the creators of a genealogical source that matches this production is:
http://terms.fhiso.org/type/CreatorsName
Default sort data
If sort data is present, it precedes the bibliographic version, and is separated from it by an @
character (U+0040). If the sort data is not explicitly given in the CreatorsName
, it defaults to the bibliographic version.
Homer
” and “Homer @ Homer
” are treated identically; the omited sort data in the former has been written explicitly in the latter. Japanese names will normally need the sort data giving explicitly. Emperor Meiji might be represented “めいじてんのう @ 明治天皇
”, where the first part gives the pronunciation of Meiji-tennō in hiragana.
Because the sort data is the leading substring of the CreatorsName
, and if omitted defaults to the bibliographic version which would then be the leading substring (and cannot be omitted), a collation algorithm applied to the whole CreatorsName
would normally do the same as if applied to just the sort data.
CreatorsName
.
Default short version
If the short version is present, if comes immediately after the bibliographic version, and is separated from it by a |
character (U+007C).
CreatorsName
, he might be written “Kong Deyong | Kong
” if transliterated into the Latin script, or “孔德庸 | 孔
” in Chinese characters.
If the short version is not explicitly given in the CreatorsName
, it defaults to a substring of the bibliographic version. If the bibliographic version contains no comma (U+002C), then the short version defaults to the whole bibliographic version. If a comma is present, the short version defaults to the whitespace normalised substring up to but excluding the first comma.
CreatorsName
as “Settipani, Christian
”. This lacks an explicit short version, which defaults to “Settipani
”, the substring up to the first comma. This could have been written explicitly as “Settipani, Christian | Settipani
”.
Default natural version
If the natural version is present, it comes immediately after the short version, separated by a |
character (U+007C); if the short version was omitted, then it follows the bibliographic version separated by two |
characters with a space between them.
CreatorsName
“Renell, James, Major | | Major James Renell
” contains both a bibliographic version and a natural version; the short version (which would have been between the two |
characters) has been omitted and correctly defaults to “Renell
”, the substring of the bibliographic version up to the first comma .
If the natural version is not explicitly given in the CreatorsName
, it defaults to a value determined from the bibliographic version as follows.
If the bibliographic version contains no comma (U+002C), then the natural version defaults to the bibliographic version.
Kyaw Thet
” in a CreatorsName
; the omitted natural version defaults to the bibliographic version as it contains no comma.
If the bibliographic version contains precisely one comma, the natural version defaults to the whitespace-normalised string created by concatenting the substring of the bibliographic version after the comma, a space (U+0020), and the substring before the comma.
CreatorsName
as “Méndez Silva, Rodrigo
”. Not only Western names invert the natural version to form the bibliographic version. Andalusian historian Said al-Andalusi (صاعد الأندلسي) may be stored “al-Andalusī, Ṣā‘id
” in the Latin script; or untransliterated, showing the characters in memory order, “ا
ل
أ
ن
د
ل
س
ي
,
ص
ا
ع
د
”.
If the bibliographic version contains more than one comma, the natural version defaults to the whitespace-normalised string created by concatenting the substring of the bibliographic version between the first and second commas, a space (U+0020), the substring before the first comma, and the substring from (and including) the second comma to the end of the string.
CreatorsName
string “Moriarty, G. Andrews, Jr.
” contains no explicit natural version. The default is formed by concatenating the text between the commas (“G. Andrews
”), a space (“ ”), the text before the first comma (“Moriarty
”), and the text from the second comma to the end (“, Jr.
”). This results in the correct natural version: “G. Andrews Moriarty, Jr.
”.
Escaping characters
A Name
(used to represent any of the four name variants) is a sequence of characters or escaped characters which does not begin or end with whitespace.
Name ::= NameChar ( ( NameChar | S )* NameChar )?
NameChar ::= ( Char - ( S | ReservedNameChar ) ) | EscapedChar
The characters %
, &
, @
and |
(U+0025, U+0026, U+0040 and U+007C) have specific uses in a CreatorsNameList
. If a name naturally contains one of these characters, or any other character matching the ReservedNameChar
production, it must be escaped.
!
, #
and $
(U+0021, U+0023 and U+0024) are assigned no meaning in this standard, but are reserved for future use. The characters {
and }
(U+007B and U+007D) are used in this standard, but not in a context that would require them to be escaped elsewhere; nevertheless, this standard requires them to be escaped to allow additional future use of them.
ReservedNameChar ::= '!' | '#' | '$' | '%' | '&' | '@' | '{' | '|' | '}'
The %
character (U+0025) introduces an escaped character, after which the hexadecimal code point of one character in [ISO/IEC 10646] is given, enclosed in braces (U+007B and U+007D). The code point must be for a single character that matches the Char
production. Applications may reject strings that contain escaped characters corresponding to characters that match the RestrictedChar
production. Escaped characters may be used to represent any character, not just those matching ReservedNameChar
, but this is not recommended; applications may replace any unnecessary escapes with the unescaped character.
EscapedChar ::= '%{' [0-9A-Fa-f]+ '}'
!
and |
are used to write click consonants in a number of Southern African languages. Properly these are written using characters U+01C3 and U+01C0, respectively, but these are missing in many fonts and hard to enter on many keyboards; the similar-looking ASCII character U+0021 and U+007C are often substituted. Thus the name of Namibian chief ǃNanseb gaib ǀGâbemab may be written “%{21}Nanseb gaib %{7C}Gâbemab
”, but a representation using the correct characters (U+01C3 and U+01C0), unescaped, is recommended.
If a name naturally contains a comma, the defaulting rules listed above may yield incorrect results. In these cases the explicit forms of those name parts should be given.
%{2C}
can hide a comma from the defaulting rules.
Stylistic recommendations
This section gives recommendations on how to use the CreatorsName
for maximum interoperability.
It is recommended that diacritics are preserved in all name variants, including the sort data.
Wałęsa
” and not “Walesa
”. If desired, an application can strip the diacritics when formatting a citation, but it is not generally possible for an application to restore lost diacritics. Algorithms such as the Unicode Collation Algorithm [UTS 10] can handle the sorting of Unicode data according to the requirements of various locales.
It is recommended that titles and post-nominals be dropped from the authors of published sources unless they are necessary to distinguish two people of the same name.
George John Armytage
”. However, were one of Princess Michael of Kent’s historical books being cited, the bibliographical version of her name might be given as “Michael, of Kent, Princess
”: simply putting “Michael
” (or “Marie
”) would be ambiguous, and as a member of the British Royal Family she does not normally use a surname.
If initials or other abbreviations are given, it is recommended that they should be be formatted with a period (U+002E) and a space (U+0020) after each initial or abbreviation, unless followed by another punctuation mark or the end of the string, in which case the space should be omitted. The non-breaking space character (U+00A0) should not be used to separate initials.
A. J. P. Taylor
”, and the bibliographic version of the name of former US Supreme Court justice Lewis F Powell Jr is “Powell, Lewis F., Jr.
”. Per this recommendation there is a period after “Jr” even though some style guides say it is not needed as the last letter of “junior” has not been omitted. How these are presented in a formatted citation is beyond the scope of this standard, so an application formatting a citation to Taylor’s work would be free not to print the periods or spaces.
When a name contains a single element that is written as two words, but logically a single, indivisible entity, a non-breaking space character (U+00A0) may be used to separate them.
CreatorsName
as “St.%{A0}George, Henry, the younger
”; the use of the escaped character %{A0}
rather than a literal non-breaking space character is not recommended, but is done here for clarity. The suffix “the younger” is retained to disambiguate him with his father who had the same name and had also been Garter King of Arms. When formatting a citation, applications are not required to honour the non-breaking space.
If additional detail has been added to the author’s name that is was not present in a published source, it is recommended that square brackets (U+005B and U+005D) are used to enclose the additional content.
CreatorsName
could be written “C[okayne], G. E.
”.
If it is considered desirable to include the original form of a transliterated name, it is recommended that they are placed after the name in parentheses (U+0028 and U+0029).
Akiko, of Mikasa, Princess (彬子女王)
” shows that Akiko is spelt 彬子 rather than, say, 淳子.
Surname particles
In this section, a surname particle refers to a short word that may appear before the main part of surname (or other similar name part), and that may or may not be regarded as part of the surname.
It is recommended that surname particles are written in lower case in the bibliographic version if that is the author’s preference or the convention in that culture, even if they appear at the start of a string.
de Gaulle, Charles
”, but the French author would be “La Fontaine, Jean de
”. This because it is conventional in French to capitalise “La” but not “de”.
The process for determining whether a surname particle should be put in front of the surname or after the given name in the bibliographic version is complicated and culture-specific, and outside the scope of this standard. It is the responsibility of the party entering the data to determine the correct form.
Sort data should not be used to suppress surname particles for the purpose of sorting.
References
Normative references
- [ISO/IEC 10646]
- International Organization for Standardization. ISO/IEC 10646-1:2014. Information technology — Universal Coded Character Set (UCS).
- [RFC 2119]
- Internet Engineering Task Force. RFC 2119: Key words for use in RFCs to Indicate Requirement Levels. See http://tools.ietf.org/html/rfc2119
- [UAX 15]
- The Unicode Consortium. UAX 15: Unicode Normalization Forms. Unicode Standard Annex. See http://unicode.org/reports/tr15/
- [XML]
- World Wide Web Consortium. Extensible Markup Language (XML) 1.1. W3C Recommendation. See https://www.w3.org/TR/xml11/
Other references
- [UTS 10]
- The Unicode Consortium. UTS 10: Unicode Collation Algorithm. Unicode Technical Standard. See http://unicode.org/reports/tr10/
- [XSD Pt2]
- World Wide Web Consortium. W3C XML Schema Definition Language (XSD) 1.1 Part 2: Datatypes. W3C Recommendation. See https://www.w3.org/TR/xmlschema11-2/
- [Vocab]
- Family History Information Standards Organisation. Preferred Nature of Vocabularies. FHISO draft policy. See http://tech.fhiso.org/policies/vocabularies