[TSC-public] Format for Raw Source Content

Thomas Wetmore ttw4 at verizon.net
Thu Jan 8 10:26:49 EST 2015


I have seen no FHISO posts over the holidays. Hope all are healthy and ready for a new year.

I have picked up some old DeadEnds threads to implement a few ideas based on recent discussions. One thing has become clear — We need a format for expressing raw source content [not that this hasn’t been obvious for a long time]. I believe this format should be defined by FHISO and included in the overall FHISO standard.

Based on DeadEnds XML format, I have created an example showing raw source content taken from a 1905 New York state census. See below. A few things to note about this:

1. I have used XML format and the enclosing entity is named “content”.
2. Content can contain any number of entities. In this example there is a single “cluster” entity.
3. A cluster entity contains a set entities that were extracted from a source together because they are related to one another in some way. In this case the cluster consists of a census household.
4. Obvious entity types within a source content cluster are event, person and source. All three are in the example below. Some don’t like calling a census household an “event”, so if you’re one of them, call it something else.
5. The person entities are “raw personas” [which implies there are “cooked personas” to contend with later]. Another email will follow up on that.
6. No entity has a “universal” id — the ids that do exist [p1, p2, …] are local within the cluster. They could be reused in other clusters.
7. The source entity is the source for all the other entities [so the other entities need not refer to it].
8. I have captured essentially all the original data from the source here, and interpreted nothing.
9. Well, except I have chosen to use Gedcom name format, where the surname is in slashes — I interpreted the names to the point where I have made that decision.
10. Well, except the birth places were given as “Germany Pol”, so I encoded them as “Germany or Poland.” Those used to dealing with ancestors from eastern Europe are used to these types of issues. Usually this means something like “they were Poles born in Germany,” or “they are Poles born in a region of Germany that used to be part of Poland.” We have to live with these things. [This does indicate the importance of an “OR” operation for genealogical places.]

I am writing the software to “cook” these entities into the form that would be found in a FHISO format suitable for sharing between users and databases. As soon as I have the cooker working I will revisit this example and show the cooked (interpreted, standardized, externalized, universalized, …) form of this same data.

Summary: The important point about “raw” content data is that it be entirely self-contained, make no reference to external entities, and require no magical id values.

<content>
<cluster>

<event> census
  <date> 8 June 1905 </date>
  <place> Block J, Election District 4, Ward 11, Buffalo, Erie County, New York, United States </place>
  <address> 13 Geneva Street </address>
  <role> head <person> p1 </person> </role>
  <role> wife <person> p2 </person> </role>
  <role> son <person> p3 </person> </role>
  <role> son <person> p4 </person> </role>
  <role> son <person> p5 </person> </role>
  <role> son <person> p6 </person> </role>
  <role> daughter <person> p7 </person> </role>
</event>

<person> p1
  <name> Frank /Grenda/ </name>
  <color> white </color>
  <gender> male </gender>
  <age> 45 years </age>
  <birth> <place> Germany or Poland </place> </birth>
  <yearsInCountry> 14 </yearsInCountry>
  <citizen> yes </citizen>
  <occupation> day laborer </occupation>
</person>

<person> p2
  <name> Mary /Grenda/ </name>
  <color> white </color>
  <gender> female </gender>
  <age> 47 years </age>
  <birth> <place> Germany or Poland </place> </birth>
  <yearsInCountry> 12 </yearsInCountry>
  <citizen> yes </citizen>
  <occupation> housework </occupation>
</person>

<person> p3
  <name> ? /Grenda/ </name>
  <color> white </color>
  <gender> male </gender>
  <age> 23 years </age>
  <birth> <place> Germany or Poland </place> </birth>
  <yearsInCountry> 12 </yearsInCountry>
  <citizen> yes </citizen>
  <occupation> day laborer </occupation>
</person>

<person> p4
  <name> Frank /Grenda/ </name>
  <color> white </color>
  <gender> male </gender>
  <age> 21 years </age>
  <birth> <place> Germany or Poland </place> </birth>
  <yearsInCountry> 12 </yearsInCountry>
  <citizen> yes </citizen>
  <occupation> day laborer </occupation>
</person>

<person> p5
  <name> John /Grenda/ </name>
  <color> white </color>
  <gender> male </gender>
  <age> 14 years </age>
  <birth> <place> Germany or Poland </place> </birth>
  <yearsInCountry> 12 </yearsInCountry>
  <citizen> yes </citizen>
</person>

<person> p6
  <name> Peter /Grenda/ </name>
  <color> white </color>
  <gender> male </gender>
  <age> 12 years </age>
  <birth> <place> Germany or Poland </place> </birth>
  <yearsInCountry> 12 </yearsInCountry>
  <citizen> yes </citizen>
  <occupation> school pupil </occupation>
</person>

<person> p1
  <name> Rose /Grenda/ </name>
  <color> white </color>
  <gender> female </gender>
  <age> 6 years </age>
  <birth> <place> United States</place> </birth>
  <citizen> yes </citizen>
  <occupation> at home </occupation>
</person>

<source> New York State Census, 1905 </source>

</cluster>
</content>





More information about the TSC-public mailing list