[TSC-public] Format for Raw Source Content

Albert Emmerich albert.emmerich at freenet.de
Fri Jan 9 09:35:46 EST 2015

Hi Tom,
I am on your level 1. And already I want to put the data into a format we
have standardized by FHISO.

Your example is good: Robert Thomas, and you do not know from the source
whether it is given names or a surname included.
  <originalName> Robert Thomas </originalName>
is what you know. There is no way to put into the data anything like "Robert
/Thomas/" or "/Robert/ Thomas" on level 1. That's interpretation - next step
after getting the source into the data base. You have the same with a lot of
other features, like names of places, like dates in any (may be not known)
calendar, like occupations... And may be the next step you cannot do before
you look at other sources, too, and start to make some hypothesis.

I like the proposal to use the <originalTag> with any tag, where you are not
sure about the substructure and take the original content from source -
letter by letter. This could be a very good way to transfer data where we
are not yet shure about their internal structure.


-----Ursprüngliche Nachricht-----
Von: Thomas Wetmore, 9. Januar 2015 02:20

> And think the raw data part must be part of the whole FHISO standard, 
> otherwise we end up in a lot of incompatible raw data formats, and no 
> one of them agreed to be the one we start to exchange our data. The 
> customizing to source type must be on software level, not on data format

I agree when I assume by the raw data part you are referring to what I
called level 2 in my post. Apparently I am much more lax about doing some
interpretation when preparing the source content records. For example,
deciding surname, figuring out places when the place names used are odd or
in a non-standard language, converting between calendars, and so on. Yeah,
we must keep the actual, original strings, but the data that is to be used
algorithmically by our software can [should?] be standardized in my opinion
[even at the source content level].

For example, what if you find a record for a person named Robert Thomas
(both are common given names and surnames in English) and because of the
context you are not sure whether the person is named Robert Thomas or Thomas
Robert? I would record the persona like this:

  <name> Robert /Thomas/ </name>
  <name> Thomas /Robert/ </name>
  <originalName> Robert Thomas </originalName>

I’m guessing I would not get a lot of agreement that this is a good idea. I
won’t get into my philosophy of how to record names, as few agree with that
either, but by using this technique I get tremendous advantages. I have the
original name, and I also have two possible names the person could have. In
my software everyone is indexed under all their possible <name>’s, so this
fellow will show up in all searches [and all other algorithms that might be
applied] to either Robert Thomas or Thomas Robert. If I can ever figure it
out later, well maybe I would go back and get rid of one of those names.
It’s not really raw data so it might be fair game to remove upon later
reflection and discovery.

> What’s wrong with it?


> Albert


More information about the TSC-public mailing list