[TSC-public] Format for Raw Source Content

Thomas Wetmore ttw4 at verizon.net
Thu Jan 8 13:15:30 EST 2015


The local id’s are certainly one issue. There is a rather large question lurking — “Will FHISO decide that every person record everywhere should be assigned a universally unique id?” I hope and assume the answer will be yes. But why? One reason is for convenience — that is, every time we add a record we will assign it a UUID as a way of guaranteeing that the record will have a unique id within its local context (database of transmission file) — we are just using it as mechanism to prevent id clashes. Another reason is more global — that is, FHISO might anticipate a future in which every record will belong to a global space of records that all users have access to. Just imagine that every birth record from some register has a unique UUID, and everybody who has researched that register and extracted a record from it has a copy of that record with the same UUID. I see this as a golden glow over the horizon that we will never attain, but is still nice to think about.

Another issue is the lack of interpretation. In the census example I gave, the date of the “event” and the ages of the persons at the time of the event are all given. From these we will want to infer an estimated birth year, so we should add a “birth vital” to each persona based on the birth place (also given) and the estimated birth year. This is not “conclusion” data in the sense that I use the term. It is an intelligent “interpretation” of the raw data to make it more useful for making real conclusions (if you prefer, decisions) later. Some genealogists might want to add “residence” facts to each of the personas as well. Where and when does that interpretation occur?

Combining these two points I view the conversion of “raw, uninterpreted, local source content data” into the “interpreted, UUID’ed, content records” as the portal between the world of a data extracting or data indexing organization and the world of full FHISO record-hood. Thought experiment: should organizations who are extracting data be required to assign UUID’s that meet FHISO criteria? I wouldn’t say so. But what if they use software that converts the data they have extracted from their own formats, into standard FHISO personas? Clearly then they should.

Am I being two-faced? Because I am implying both that an organization extracting raw data should not need to use FHISO UUID’s, but also that they should use an XML based extraction standard promulgated by FHISO. This just points to the importance of the interpretation step in my mind. The conversion of raw source content to interpreted personas (and ancillary records) is for me the entrance point into FHISO.

But clearly data extractors are free to extract in other formats — and they do — they all have their formats. Clearly CSV files (really, any file holding spreadsheet-like data) are used all over the place. I think the real answer will be that we will end up with lots of little conversion programs that will take data from a plethora of existing formats and convert them all to the same FHISO formats.

Certainly that is what I am experimenting with. I’m trying to track my wife’s ancestors back into Poland, and this is difficult genealogy. We have 100s of mentions of persons with no apparent rhyme or reason to them — clearly we are going to start with a monstrous batch of personas and then try to organize them into real individuals. I am using this problem as a test case for my software. The format I gave in the example is a “DeadEnds-like” format for recording raw data. That data goes into software that converts the clusters into “official” DeadEnds events, persona, and source records. Once that software is working a little better than it is now I will interpreted the census record given in the example and show the results. Since we have no FHISO standards to experiment with yet, I am using DeadEnds formats for that purpose.

Tom Wetmore

> On Jan 8, 2015, at 12:23 PM, Luther Tychonievich <tychonievich at gmail.com> wrote:
> That said, is there any reason why a single-source GEDCOM5/GEDCOMX/GrampsXML/... file could not do this job?  In other words, what aspect of "raw source content" makes it need a distinct discussion?  Is it just the local IDs

More information about the TSC-public mailing list