[TSC-public] Format for Raw Source Content

Louis Kessler lkessler at lkessler.com
Fri Jan 9 01:49:28 EST 2015

Tom said:
> Does this mean that you and Jan both think that the concept of the household 
> is too “interpreted” to use as a basis in extracting raw data from a census? 

You really want to draw me in on this, Tom?  Well, at least it is down to the basics.

Yes. Any concept is an interpretation. And no matter how obvious the concept may seem to be to one person, another person may interpret it differently. The data needs to be made of just the data and its structure. The structure of the data (i.e., rows, columns, headings, groupings) reflect physical associations in the data. I don't want to see any interpretations of the data or the physical associations in the data. I want it all presented as primary data, i.e. as raw as possible. 

So I give it zero tolerance. I don't want the slightest possibility of getting someone else's incorrect interpretation messing me up or preventing me from finding the record because the wrong interpretation got it indexed wrong.

> If you check censuses that organize by household you will find that almost 
> all of them give each household a specific index number, and keep them grouped
> by those numbers. Isn’t that data indexing part of the raw source data?

That's fine. Then household number 50 will have line number 305 to 312 associated with it. That association is important. But the meaning of the association should not be interpreted for me.

> You could treat that family index as another field/column for each person, 
> but that’s just the same effect that I’m after with a slightly different 
> organization. Are you saying the household is not a useful concept? If you 
> think it is a useful concept, how would you handle it?

Of course the household is a useful concept. But I want to apply that concept when I am ready to interpret it. I want primary data. I don't want secondary data. 

> When you are doing research and extracting data from a census, do you extract
> the data for every person on the page with your family of interest, or do
> you just extract the data for the family of interest? Or do you do something 
> between the two, maybe extract nearby by families that you think might prove 
> of interest eventually.

Now you're egging me on here, Tom. 

The source records I extract will be at the time what I think may be relevant to my research.

Whatever gets extracted will get added to my genealogy software as a source. The physical groupings in the source, whether they are called a "household", a "line", a "page" or whatever, make up the source records for that source. The data I will include here will be the raw data with physical associations and field names where applicable. Each source record will resemble the original as closely as it can, with the minimal amount of interpretation possible. 

> Just think about what it would mean to extract “just the facts” from a census 
> as a source. Wow.

Yes. Wouldn't it be great to always get the info as close as possible to what is in the original document.

> If you have an event with a date, and the age of a person at that date, you 
> can estimate the person’s birth to plus or minus a year or two. Would you call 
> that interpreted data? Or would you call it just another form of the provided 
> information? Would you include that estimated birth year anywhere? Or would you 
> expect software to infer it when appropriate? If you were to include that 
> estimated birth year anywhere, where would that be?

These to me are all interpretations. They are the current conclusions that I (or some software) are inferring about the raw data. I would include this with my conclusion person. I would state with my conclusion person all the interpretation that I made to the raw data, e.g. my estimate of the birth year was based on blah blah blah. I'd use one of the GEDCOM date forms, e.g. ABT 1850 or BET 1845 and 1855 depending on what I think fits best - not what someone else interpreted it to be. Or maybe state CAL 1850 or "INT 1850 (assuming 25 years before marriage)" which GEDCOM allows. If I let my software estimate it for me, hopefully the software will use the tag GEDCOM recommends and have it as: EST 1850.

> > This small difference in thinking is the thing I don’t like about Tom's ideas 
> > of personas, because I think of personas as mini-conclusions not belonging with 
> > the source data.
> Given that you don’t like it at the persona level, do you like it anywhere else instead?

I do see why you like the personas. They are mini-conclusions that are bite-sized and people-based. You've expressed the analogy to having them as cards that you can mix and match and collect together the ones you think are the same. Your conclusion people then only need to point to the persona that make them up. And it can even be multi-layer as persona can point to other persona.

I see how that all can work. But no, I don't want it. 

I would sooner just work with the non-interpreted raw data sources as the cards. I want to move those around and find the sources that relate to the conclusions I already have or that allow me to create new conclusions. My conclusions would then link directly to the sources that support (or conflict with) them. My conclusions are not just people. They are also facts, events, dates, places, relationships, groups, notes, stories, and interpretations of photos, postcards and other media.

In my opinion, this is better and I think simpler. Conclusions point directly to the sources, and therefore sources link directly back to the conclusions they pertain to. No intermediate placeholder is necessary.


More information about the TSC-public mailing list