GeneaBloggers

Saturday, 2 January 2016

Anatomy of a Source



What is a source? When are sources independent and when are they not? How do citations describe related sources? These questions may seem to have obvious answers, but a detailed analysis of the relationships is essential when working electronically with sources.

Traditional genealogical work may take these relationships for granted, but software considerations are too often relegated to a mere reference-note citation added at the end of some narrative, or an electronic bookmark added at some point in a tree. By ‘working electronically with sources’, I mean source-based genealogy using a computer, and so the organisation of entities related to both sources and citations is then a fundamental consideration. The use of computer software as a tool during research, rather than simply for the maintenance of some database afterwards, is still not the norm, but it will be — one day — and not before time.

Anatomy of a Source
Figure 1 - Anatomy of a Source.

Background Knowledge and Observations

Let’s begin with some basic knowledge to set the scene. I will frequently refer to terms and concepts from the works of Elizabeth Shown Mills, and also Thomas W. Jones, which I hope will be obvious enough that I don’t have to cite every single occurrence.

Expressed as succinctly as I can, information is semantic data (data with meaning), and a source is the someone, something, or somewhere from which the information was obtained. Citations are statements that identify such sources, and these are most commonly recognised as the sentences used in footnotes/endnotes. In truth, a footnote/endnote may contain multiple citation sentences, each referencing a distinct source, but we’ll ignore that here for simplicity. For the curious, an example may be found at Cite Seeing.

A citation has a number of purposes: intellectual honesty (not claiming prior work as your own), allowing your sources to be independently assessed by the reader, and allowing the strength of your information sources to be assessed. To this end, there are a number of core principles to good citations: identification, description, and evaluation. The identification of the source is what existing software tends to focus on most, and it amounts to naming it and citing its location. If we include the location of the specific information within the source then these details are conveniently summarised as the five Ws: who, what, when, where-is, and where-in.

One of the most important mechanisms in a citation sentence is layers: the segments separated by semicolons that are used to describe the provenance of the source information, or of the source itself, and to provide analytical notes. When only one copy of a source exists, or when it is very rare with only a few copies existing, then citing the repository is the correct thing to do — this contrasts with widely published materials such as books and newspapers. When a source identifies the source of its own information then it is usually termed the source-of-the-source, and the corresponding layer would indicate this with a preceding “citing” or “the author cites”. However, provenance may have been determined by independent means, or there may be a mixture as in this example:

“Literary Miscellanea: Sketch of a Railway ‘Navvy’ ”, book extract, Bath Chronicle and Weekly Gazette (23 June 1859): p.6; citing: [Samuel Smiles,] The Story of the Life of George Stephenson [, Railway Engineer (London: John Murray, 1859)]; abridged by the author from the original and larger work: The Life of George Stephenson (1857).

A point to note is that a layer isn’t just another source; it could also be a repository, it could be provenance details, or it could be analytical notes. This means that layers cannot be modelled by simply linking software entities that describe sources.

In one of the papers submitted to FHISO[i], a set of desirable properties were presented for citations. One of these (section 2.10) is described as Canonical, and it suggested that citations should be one-to-one with sources. Expressed symbolically for two sources: S1 and S2, and their respective citations: C(S1) and C(S2), this would mean that S1=S2 and C(S1)=C(S2) would be equivalent statements. However, real-life citations can never be that precise.

A case where this immediately breaks down is when the same citation is expressed in different languages for users from different locales. In other words, even for two absolutely identical sources, with identical provenance and evaluation, there could be several different citations required for the same information. Going further, there may be a need to support different citation styles (e.g. CMOS, MLA), or different variants of a reference note for the first and subsequent usage. These facts mean that any organisation of software entities must support a one-to-many relationship between sources and citations.

Real sources are not all equivalent, even when the associated information has a common origin. Sources may be categorised as original, derivative, or authored works. The latter is effectively a hybrid of original opinions and conclusions, but derivative sources of information.  An original is when the material exists in its first oral or recorded form, and a derivative source is one produced by the copying an original or where the content has been manipulated, such as transcripts, abstracts, extracts, translations, and databases. Image copies are a special sub-category of derivatives that include digital scans and photographic facsimiles. Since they should capture the information exactly as it was then they are often treated the same as originals. However, they are still technically derivatives since the contrast may be lacking, or a film may be scratched, or the resolution too low; even the loss of colour in a monochrome image may have removed essential information. This doesn’t necessarily mean that a copy is always poorer than an original since you could have a damaged original and a copy that was made before the damage occurred — each case has to be evaluated on its own terms.

One of the really difficult areas to deal with — and one I’ve been slowly building up to with this summary of basic knowledge — is where derivatives were formed by manipulation of the content. This is because there are so many ways that they can be formed, and the associated chain of provenance can be difficult to determine. This subject was recently covered by Sue Adams in a series of blog-posts culminating in The Original in Context. For my own example, I’m choosing the baptism of an Amelia Kirk at Nottingham St Mary in 1809. The Nottinghamshire Archive has the original parish registers, and also the bishops’ transcripts: those hand-copied derivatives that were sent to the diocesan centre each month. The archive also has image copies of the parish registers on microfiche. The Nottinghamshire Family History Society (NottsFHS) has a searchable database of transcribed details that may be purchased on CD. Findmypast has incorporated a copy of the NottsFHS transcriptions into their own online databases. Ideally, the information in these derivative forms should be identical to that in the original, but that would be a rare event indeed.

Given that the information should (in principle) be the same, and ideally will not differ too much, how best would copies of alternative derivatives be organised so that they can be worked on (electronically) together? The derivation path may not be a straight line — there could be branches — and so you may have different copies that you want to compare and contrast. Treating them as wholly independent would be both wasteful and inaccurate. In other words, the required organisation must acknowledge the common origin of the information while still allowing the sources to have their own citations, their own evaluations, and their own resources (digital images, paper-based images, textual transcriptions, etc.).

There are two main ways that source references can be related to similar source references: by containment and by derivation. We’ve just looked at the derivation case where the information has a common origin but has been copied or manipulated along the way; containment is the case where you’re looking at a part of a larger source unit. The most common example of this may be when referring to specific pages or chapters in a book, but it can apply to many sources, including separate households or schedules on a given census page. One requirement, here, is that it must be possible for the software to know that two source references are to parts of the same unit (or item, in archival terms), or that one of them is, itself, a reference to that larger unit. For example, that two source references are to pages in the same book, or that one is to the book as a whole.

When working with information from different parts of a source, each distinct where-in reference will have an associated context which must, at least, specify the where and when. For instance, if citing pages from a biography then they may mention the subject when he was in a particular city, and during a particular time frame. In a census page, information for two different households would have been taken at the same time but they would have distinct addresses. This ability to dissect a source, and to characterise references according to their context, is essential if assimilating information for later analysis.

Implementations

I’ll first look at STEMMA’s source-citation relationships. As of V4.0, the Source entity connects to multiple Citation entities and/or multiple Resource entities (e.g. media files). This structure has gradually evolved through trying to model my existing source-based research.

STEMMA Source entity
Figure 2 - STEMMA Source entity.


Looking at each of the functional issues mentioned above:

Support for derivatives. The Source entity brings together source information that has a common origin so that it can be compared and contrasted, and generally assimilated in one place. In other words, it is not representing one unique source, but sources with a common information origin. The context section embraces citations for all of those corresponding sources. Scheduled for V4.1: the <Quality> setting must be in respective <Frame> elements.

Support for containment. The SourceLet sections allow the dissection of a source into parts with a related context, and the associated citations would refer to specific where-in parts of the sources already identified outside of the SourceLets. Each SourceLet can also provide where and when contextual details for the associated information. The option to have two tiers of SourceLets — one for related derivatives and one for those specific where-in parts — was not taken for simplicity. Scheduled for V4.1: a ‘WhereIn’ attribute on selected optional citation-elements so that a single Citation entity (with a unique URI) can model references both the source as a whole and specific parts of it.

Support for citation language, modes, and styles. The Citation entity supports sets preformatted citation text strings in alternative languages and citation modes (e.g. first reference note and subsequent ones). There is currently no support for alternative citation styles such as CMOS or MLA. These preformatted strings are all optional but the same Citation entity also supports a mandatory set of citation-elements, such as author, title, and publisher in the case of a book. These may be tagged with semantic types from alternative taxonomies, such as Dublin Core, and inclusively so if desired.

Support for analysis. Terms such as quality, reliability, and credibility may be used to describe a source or some information obtained from it. Analysis of the source information as a whole, including the derivatives with the same origin, is all done within a single Source entity. Although comments and observations on those individual derivatives can still be made, they are not divorced from the context of that shared provenance.

Support for layers. The Citation entity models the layers in a citation using its parent hierarchy (see below). This is possible because its extreme flexibility allows it to model any of the following: a simple citation, a repository, provenance details, and even attribution. The fact that the layers are supported by the Citation entity, and not the by Source entity, is dictated by the observation that layers are not just other sources.

As of V4.0, the citation layers may be characterised according to the terms in the following table. Scheduled for V4.1 is an additional Reworked category.

Layer-type
Comments
Abstract
A brief summary or a précis of --
Citing
Information cited by the source. Source-of-the-Source.
Comment
Analytical comments.
DB
Database extract (usually cited in first layer)
DBImages
Database extract with images
Extract
Extracted portion from --
ImageCopy
Scan, photocopy, photograph, etc.
MediaCopy
Media conversion from --
Provenance
Other provenance information, differing from ‘Citing’.
Repository
Location of source.
Transcription
Transcribed details from --
Translation
Translated details from --


In GEDCOM, there are Repository, Source Repository Citation, Source Record, and Source Citation records. These are designed to implement a relatively straightforward, but limited, citation→source→repository model with a scope for supporting bibliographic citations — not for source analysis or source-based genealogy.


GEDCOM-X has a wider focus than GEDCOM, although the current draft specification is known to be incomplete in this area at the time of writing.

GEDCOM-X SourceDescription entity
Figure 3 - GEDCOM-X SourceDescription entity.


Although the SourceDescription relates to a unique source, there are sets of linkages to other sources related by derivation and by containment. It’s probably too early to see how this would work in practice, but the fact that ‘analysis documents’ are tied to individual SourceDescriptions would suggest that some collective analysis of derivatives would be difficult to organise. Also, the treatment of parts of a source (i.e. containment) by distinct SourceDescriptions would seem to compound the issues of collective analysis.

There is no specific support for layers; they cannot be handled by those derivation links since, as I’ve already mentioned, layers are not just other sources. Each SourceDescription may have multiple citations supporting different languages but there is no explicit consideration of alternative citation modes (e.g. first/subsequent), or styles (e.g. MLA). The SourceCitation entity is not hierarchical, and there are currently no citation-elements in either SourceCitation or SourceDescription.

The SourceReference appears to be provided solely to allow the attribution information of a SourceDescription to be overridden.

Relevance to the Reader

While software developers may understand what I’ve written here, I know there will be a significant number of other people thinking ‘I don’t get it. My software already handles sources’. There’s probably a good reason for this and I want to mention it in rounding-off this article.

A scenario that those same people might be able to associate more-closely with would be when adding a marriage date to their tree. They might have found a record with the date and place recorded so they add it to their tree and then include a citation for the record — or an electronic bookmark if found in the online databases of the tree’s host. Where does my “organisation of entities related to both sources and citations” fit into that? Well, in that scenario, it has little relevance. As my second paragraph suggested, most software only thinks about sources and citations as they appear in this limited type of scenario. But if we’re doing source-based genealogy then that organisation is a fundamental requirement.

Tradition genealogists — and good researchers everywhere — would look at more than the one date and place mentioned in such a source; they would look at other information, and the associated context of that information. If anything caught their eye as potentially significant, or requiring further research, then they would note it. That may happen in their heads (yikes!), or be written with pencil & paper, or captured on their computer using a text editor (e.g. Notepad) or some rich-media editor (e.g. Evernote). That note-taking is essentially what I mean by “assimilation” of the source information, and my approach to source-based genealogy is just including the note-taking and the initial analysis into my main genealogy program. There’s no onus on me to draw conclusions, or to attach any of it to a tree; it is a working area where sources can be dissected, and information partially digested so that I can find it and use it later.

I personally find this very natural since my career has hitherto involved developing and using cutting-edge software , but I also appreciate that the majority would not feel as comfortable relying on software to this extent — especially when most of it appears to be form-fill data entry of conclusions, and any basic methodology or representation of real-life scenarios are denied.


[i] Luther Tychonievich, "Desirable Citation Properties", FHISO Call For Papers, CFPS 112 (http://fhiso.org/files/cfp/cfps112.pdf : accessed 15 Dec 2015); this paper was not listed on the main 'papers received' page (http://tech.fhiso.org/cfps/papers) but was referenced from other papers.

Saturday, 19 December 2015

Organising Photographs



The question of how to organise your photographs in your file store, or even your general genealogical files, is a frequent one. Everyone has their own preferred scheme, but I want to try and add a different perspective on this subject. The suggestions I will make will use Windows as an example, but similar techniques will be possible elsewhere.

The ultimate brick wall that everyone will stub their toes on is that there are multiple ways of categorising their photographs, and a simple filename cannot adequately embrace them all. For instance, naming them by person (but which person if there’s a group), or by surname (again, which surname in, say, a wedding group), by event, by place, or by date. Every attempt to achieve this using just the filename will be a compromise of some sort.

There are different ways of grouping the same pictures
Figure 1 - There are different ways of grouping the same pictures.

I have written, before, that my own choice is to group the files by their provenance, and then rely on a software application to present them in other ways, and to associate them with descriptions, stories, timelines, and so on: Hierarchical Sources. There are some issues with using a specialised application, but we will come to that in a moment.

Keywords

Another option is to use keywords, such that each picture can have an arbitrary number of keywords relating to personal names, surnames, places, events, dates, or whatever you want. This is good for finding related pictures in a large set, but is it ideal for organising them in a browsable way? Although products such as Adobe Photoshop Lightroom organise pictures by keyword, this is just another type of specialised application, and so has the same issues that I hinted at above; the keywords really need to be a core feature of the operating system.

Windows has a feature called Tags which are effectively user-defined keywords that can be added to your files, but prior to Windows 7 only Microsoft office documents supported them, and there weren’t many tools to make use of them. Before I talk about them, let me first present a bargain-basement equivalent that would work under, say, Windows XP (yes, there are still people who use XP). For illustration, let’s assume we have a folder with the following three images files in:

Ann_Jones_1985.jpg
Jane_Smith_1980.jpg
Joan_James_1983.jpg

By dividing-up a filename using a character such as underscore, you’re effectively providing sets of keywords. Their order is not really important as files can be found no matter where the relevant keywords appear. Words can be grouped together to create compound keywords by using a different character, such as Joan-Smith_John-James_Marriage_1970.jpg.

The old Windows XP Search box could search on multiple filename parts, separated by semicolons, and this would achieve a search-by-keyword.

Searching by keyword under Windows XP
Figure 2 - Searching by keyword under Windows XP.

In this example, the search is looking for all files with either “Jane” or “James” in their name. The actual ordering of the keywords in a filename might be chosen so that the default sorting achieves some vaguely useful grouping.

So how did this change under Windows 7? For a start, its new-style Search box allows Boolean operators so that you can now type “Jane OR James” (equivalent to the XP example, above) or “Jane AND James” (for which none of the three example files would have matched).

Another change in Windows 7 was that the support for file Tags was greatly increased. The file Properties dialog, on its Details tab, will show a Tags field if the current file-type supports them. Clicking to the right of the Tags label allows you to enter multiple keywords, separated by semicolons, and these are then hidden away inside the file’s meta-data.

Entering keyword Tags in Windows 7
Figure 3 - Entering keyword Tags in Windows 7.

In the Search box, where we had previously typed separate filename parts, we can now use terms such as “tag:Jane”, and it will then search for files with those Tags rather than ones with particular words in their filenames.

Searching by Tags in Windows 7
Figure 4 - Searching by Tags in Windows 7.

Again, we can use the Boolean operators to say something like “tag:Jane OR tag:James”. OK, so what are the advantages of this scheme over the bargain-basement one using just the filename? Both schemes allow Boolean operators, and both operate case-blind. However, those Tags are discrete items of meta-data and so leave you to name the file any way you want. Also, the Tag names are matched as complete words and so there’s no risk of an accidental match, such as “Ann” matching “Anne” and “Anna”, etc.

Windows 7 also allows you to sort your files by their Tags — look on the View menu, under Sort by — but keywords are still primarily a way of finding content rather than presenting it. If the advantages are so great for organising pictures, or any files, by their provenance, and for relying on a specialised application to present them in a much richer fashion — with the added context of stories, timelines, and so on — then why don’t we all do it that way?

This subject came up in a Google Hangout in the DearMYRTLE's Genealogy Community, hosted by Pat Richley-Erickson (aka DearMYRTLE), on 19 Jan 2015. Twice during that Hangout — once at 15:00 into the recording, and then later at 35:40 — Pat made the astute observation that relatives (and especially the younger ones) will just want to browse some “cool old photographs” and not mess around with a specialised application. It’s sad but true that if there isn’t a description directly visible when they open the file then they won’t find the details. Remember that in the traditional family albums there would usually have been something written under each picture.

The technology is there to put a description inside each picture — in that same meta-data area where the Tags live — and this could even include an optional “wire frame” diagram that could be overlaid to identify individuals in the picture. That could have relevant links for each of those people to the data held in your specialised genealogy application. You would probably have to write your own picture-viewer application in order to see all that content, but you would then be back to the same problem again.

Proxies

When you click on a file, your operating system checks what application is registered for opening a file of that type. Although you may use the same application for all your image file-types, it is possible to make the association type-specific; for instance, using Microsoft Paint (mspaint.exe) for *.bmp files and Microsoft Office Picture Viewer (ois.exe) for *.jpg files. However, each association is fixed for a given file-type.

It is possible, though, to go via an intermediary application to make an intelligent choice for you. This would mean the vendor of your genealogy application producing a very tiny proxy application that looks at the image file you’re trying to open, and then determines whether to load it in the default image viewer (for that file-type) or in their own genealogy application.

Using a proxy viewer
Figure 5 - Using a proxy viewer.

I have written a sample C program that demonstrates this principle using alternative viewers for plain text files: proxy.c. This looks to see if the image filename ends in some genealogical identifier of the form: “ID-identifier” (e.g. Joan_James_1983_ID-1AF92G.jpg). It would be just as feasible to involve the folder path in its decision-making, or even looking inside at the file’s meta-data for an identifier there, but this scheme was simpler.

If this sample proxy finds such an identifier then it launches a specialised viewer with the arguments: <filename> <identifier>, and in all other cases it launches a default viewer with the single argument: <filename>. When configured correctly then those young relatives could happily click on images anywhere on the computer, and they would see it in the appropriate viewer depending on whether they’re part of your genealogy collection or not.

Yes, this would need some help from your genealogy application to ensure that your files have the correct identifier in their name, and hiding that information inside each file’s meta-data would be cleaner. What about the configuration, though? The proxy has to take over a number of file associations (one for each of the image types you’re interested in), and remember what their default viewers were so that it can invoke them when necessary. Well, that turns out to be quite easy: during installation, it would simply displace the existing default viewer for each file-type, and pass that to the proxy as either another argument or via a command-line option. This also serves as a way of saving the file-path of each default viewer. A later uninstall would then have those previous file-paths available so that it could put things back exactly as they were.

This approach can also be applied to non-image file-types, such as Word documents. This could make the difference between a machine that happens to hold your genealogy data, and a “genealogy machine”. Who knows, maybe someone will do this now.

Thursday, 10 December 2015

Our Days of Future Passed — Part III



This is the final part of my trilogy on the philosophy behind STEMMA V4.0. Part I covered its application to both arboreal (tree) genealogy and event-based genealogy, while Part II covered narrative genealogy; I now want to expand on its support for source-based genealogy.

Source-based genealogy is both a research orientation and an organising principle where the source is the primary focus. The majority of software, and especially Web sites, are focused on conclusions; users are asked to provide names, dates, and locations without having to indicate where or how their information originated. At best, they might be given the opportunity to retrospectively add some citation or electronic bookmark. When starting with the source, though, all the relevant resources (images, documents, artefacts) can be organised according to the source provenance and structure, a citation can be created as soon as information is acquired, and the information can be assimilated before you decide how and where to use it.

People like Louis Kessler have advocated source-based genealogy for several years, and the term itself has displaced the more naïve notion of evidence-based genealogy. Since evidence is only information that we believe supports or refutes some claim then a focus on that alone would ignore the source of the information, and any context or other information therein. For instance, imagine that you have used certain information from a source to substantiate a particular claim. What happens if you later feel that the same source might help with a different claim, made elsewhere? Do you have to assimilate the contents all over again in order to decide whether that’s true or not? How would you become aware of its possible relevance?

Link Analysis

Let’s think how source-based genealogy might work, conceptually, and especially the assimilation phase. Anyone who remembers studying text books in preparation for an examination may also recall annotating pieces of text: underlining phrases or circling sections that we believed were going to be important, and which we wanted to ensure that we fully grasped. What we were doing is reinforcing our mental model, or mind-map, and creating structure and order from the text.

I’m sure you’ve all seen detective films, or TV series, where someone solves a complex puzzle using notes and images on a pin-board with string connecting the pieces together. This technique really does exist, and it’s called link analysis. It’s a type of graphic organiser used to evaluate relationships (or connections) between nodes of various types, and it has been used for investigation of criminal activity (fraud detection, counterterrorism, and intelligence), computer security analysis, search engine optimization, market research, and medical diagnosis. Although most online sources present it in terms of software implementations, it is much older, possibly going back before WWII.[1]


Conceptual link diagram
Figure 1 - Conceptual link diagram.

This rather imaginative depiction of the concept illustrates some of the advantages and benefits quite nicely. The information of interest in each source is somehow marked-out, and used to build any of: concepts, prototype subjects[2], conjectures, and logic steps — whatever the researcher wants from them. Because of the freedom in choosing those pieces, this method is as useful to my non-goal-directed Source Mining as it is to goal-directed research. Essentially, one approach is aimed more at collecting material to paint a history or biography of subjects, whereas the other is focused on solving a specific problem or proving one-or-more claims.

When those pieces are connected all the way up to the conclusion entities in your data, then it also provides a trail by which some user could drill-down[3] in order to see how a conclusion was arrived at, what information was used as evidence, and where that information originally came from.

When a marked item is a person reference then it is effectively the same as what’s often termed a persona, and the ability to connect personae from different sources provides a way of supporting multi-tier personae. Although I have previously been critical of the use of personae because of the separation of the person reference from its original source context — including place, date, and relationship context that might be instrumental in establishing the identity behind the reference — this approach retains a connection to the source information, and even allows it to follow the subsequent use of a persona. This is particularly important because it appears to be an inclusive handling of certain contrasting approaches advocated by other data researchers: Tom Wetmore has long believed in personae, and Louis Kessler in source-based genealogy, but those approaches have sometimes appeared to be contradictory and have led to strong differences of opinion.

Source and Matrix Entities

Part I of this series introduced STEMMA’s Source entity as joining together citations and resources (such as transcriptions, images, documents, and artefacts) for a given source of information, but it encompasses much more than this. The semantic mark-up, described in Part II, allows arbitrary items of information to be labelled in a transcription. This includes subject references (person, place, animal, and group), event references, date references, and any arbitrary word or phrase. Those labelled items, called profiles, can be linked together using simple building-blocks in order to add structure, interpretation, or deduction to them. STEMMA doesn’t mandate the use of a visual link chart, or link diagram, since that would be something for software products to implement, but it does include the essential means of representation.

The Source entity normally specifies the place and date associated with the information, but these can also be linked to profiles if they have to be associated with some interpretation, or even some logic. For instance, when information relates to a place whose identity cannot be resolved beyond doubt from the mere place reference.

When a source is comprised of discrete or disjointed parts — such as a book’s pages, or a multi-page census household — or it contains anterior (from a previous time) references — such as a diary, chronological narrative, or recollections during a story — then smaller sets of linked profiles can be grouped within the Source entity using SourceLet elements, and these may have their own place and date context. Each of those discrete parts may have their own separate transcriptions and specific citations, although they’re related to the same parent source by containment — a subject for a future presentation. The network of linked profiles can bring together information and references from these different SourceLets for analysis.

The Source entity is a good tool for assimilating the information from a given source in a general and re-usable way. However, that information may need correlating with similar information from other sources, and this process may need to be repeated for different problems and with different goals. STEMMA accomplishes this with a related Matrix entity[4] that carries those networks of linked profiles outside of their source context and allows them to be worked on together.

Mechanics of a link diagram

Figure 2 - Mechanics of a link diagram.

Notice, as usual, that the building of these networks, and the association of them with corresponding conclusion entities, is independent of the relationship those entities have with their respective hierarchies and events (Part I), and narrative (Part II). In other words, the four main approaches to genealogy that I identified (arboreal, event-based, narrative, and source-based) can be inclusive of each other.

Compare and Contrast

Part I introduced the concept of Properties — items of extracted and summarised information — that could be associated directly with subject entities. The Source entity, which is part of the informational sub-model rather than the conclusional sub-model, can also achieve this but with additional flexibility. Because it is working directly with source information, and not obliged to make any final conclusions, it means that references to incidental people, or otherwise unidentified subjects, can still be assimilated but left in the Source entity for possible future use. The power of this should be obvious where, say, a referenced person later turns out to be a relative or in-law. Another difference is the vocabulary used to describe data and relationships. The Property mechanism uses a normalised computer vocabulary so that information can be consistently categorised as name, occupation, residence, etc., and relationships can be categorised precisely as things like spouse, mother, son, and so on. In the Source entity, though, what you record and what you call it are free choices; if you encounter a relationship provided as grandchild, nephew/niece, or cousin, where the interpretation may not be obvious, then you can keep it as-written and work on it. For the masochists, a comparison of these two mechanisms being applied to the same source may be found at: Census Roles.

It might be said that ignorance of prior work is bad during any type of research and development, but my software history is full of such cases where it has yielded a route to genuine innovation.[5] When I finally decided to look at whether other data models had addressed this approach, I was surprised to find that the old GenTech project from 1998 had documented a similar approach. GenTech and STEMMA had both try to build a network of extracted information and evidence from “source fragments” — and actually used this same term — but the similarities applied mainly to the intention rather than to the implementation.

The GenTech data model V1.1 is hard to read because it has no real examples, it presupposes a concrete database implementation — which I’m not alone in pointing out to be inappropriate in a data model specification — and it talks exclusively about evidence, and analysing evidence, rather than information. The latter point is technically incorrect when assimilating data from source fragments since the identification of evidence, or the points at which information can be considered evidence, is dependent upon the researcher and the process being applied rather than some black-and-white innate distinction.

GenTech’s ASSERTION statement is the core building-block for its network. This is simply a 5-tuple[6] comprising {subject-1 type/id, subject-2 type/id, value} that relates two “subjects”. Those subjects are limited to its: PERSONA, EVENT, GROUP, and CHARACTERISTIC entities — concepts which differ from STEMMA’s use of the same terms — and there are some seemingly arbitrary rules for which can be specified together. This restricted vocabulary means that it does not clearly indicate how its CHARACTERISTICs are associated with a particular time-and-place context (I couldn’t even work out how); it has no orthogonal treatment of other historical subjects (STEMMA terminology), such as place, group, or animal; and it cannot handle source fragments with arbitrary words and phrases. By contrast STEMMA’s profiles can deal with source fragments containing references to persons, places, groups, animals, events, dates, or arbitrary pieces of text. It’s <Link> element is the low-level building-block that connects these together, and to other profiles, but with much more freedom. For instance, the equivalent of a multi-tier persona is achieved by simply connecting two prototype-person profiles. GenTech uses its GROUP entity to achieve this, and effectively overloads that entity for grouping PERSONA and CHARACTERISTICs rather than using it only to model real-world groups.

Some other philosophical differences include the fact that STEMMA profiles represent snapshots of information and knowledge at a particular point in the assimilation process (or correlation process, in the case of Matrix); the actual information effectively flows between those profiles. This is hard to describe in detail, and I may save it for a later post. Another difference is that the profiles allow steps of logic and reasoning to be represented in natural language; the connections are not just a bunch of data links or database record IDs. That text would be essential if some user wanted to drill-down to understand where a claim or figure originated, and STEMMA allows multi-lingual variants to be written in parallel. Reading GenTech’s section 1.4.2 suggests (to me) that its ASSERTION may have more in common with STEMMA’s Property mechanism that with its Source and Matrix entities.

An interesting corollary is that conclusions are easily represented in software data models, and they will usually employ precise taxonomies/ontologies to characterise data (such as a date-of-birth, or a biological father), or equivalent structures (such as a tree). In effect, these conclusions are designed to be read by software in order to populate a database or to graphically depict, say, biological lineage. Source information, on the other hand, cannot be categorised to that extent — it was originally intended to be humanly-readable, it must assimilated and correlated by a human, and all analysis must be understood later by other humans.

There have been a number of attempts to represent the logical analysis of source information using wholly computerised elements (see FHISO papers received and Research Process, Evidence & GPS) but these are far removed from employing real text. As a result, they lose that possibility of drilling-down from a conclusion to get back to a written human analysis, and to the underlying information fragments in the sources. While allowing analytic notes to be added directly to data items might be one simplistic provision, connecting notes and concepts together to build structure must have written human explanation, not “logic gates” and other such notions. One reason for these overtly computerised approaches could be that software designers feel an onus on them to support “proof” in the mathematical sense rather than in the genealogical sense; a result of possibly misunderstanding the terminology (see Proof of the Pudding).

Concluding Remarks

So what have I accomplished in this trilogy? I have given insights into a published data model for micro-history that has orthogonal treatment of subjects and inclusive handling of hierarchies, events, narrative, and sources. Did I expect more? Well, I did hope for more, when I first started the project, but it was not to be. There are people in the industry that I have failed to engage, for whatever reason, and that means that the model will probably finish as another curiosity on the internet shelf, along with efforts such as GenTech and DeadEnds. Complacency and blind acceptance have left genealogy in a deep rut. In the distant future — when technophobe is simply a word in the dictionary — people of all ages will look back at this decade and laugh at what our industry achieved with its software.  When paper-based genealogy (using real records) has probably gone down the same chute as paper-based libraries, and we're left with software genealogy and online records, then we'll wish that we had built bigger bridges between those worlds. If our attitudes and perceptions don’t rise above the horizon then we'll never see the setting sun until it’s too late, and we might as well say: RIP Genealogy!





[1] Nancy Roberts & Sean F. Everton, "Strategies for Combating Dark Networks", Journal of Social Structure (JoSS), volume 12 (https://www.cmu.edu/joss/content/articles/volume12/RobertsEverton.pdf : accessed 24 Oct 2015), under “Introduction”; publication date uncertain but its References section suggests 2011.
[2] In STEMMA terms: persons, places, groups, or animals.
[3] A BI process where selecting a summarised datum or a hierarchical field usually with a click in a GUI tool revealed the underlying data from which it was derived. The term was used routinely during the early 1990s when a new breed of data-driven OLAP product began to emerge.
[4] "An environment or material in which something develops; a surrounding medium or structure", from Oxford Dictionaries Online (http://www.oxforddictionaries.com/definition/american_english/matrix : accessed 28 Oct 2015), s.v. “matrix”, alternative 1.
[5] Modern software development is less about ground-up development and more about creating solutions by bolting together off-the-shelf (e.g. open-source or proprietary) components. Both have merit but the pendulum is currently in the other quadrant.
[6] An n-tuple (abbreviated to ”tuple”) is a generalisation of double, triple, etc., that describes any finite ordered list. The term was popularised in mathematics, and more recently in OLAP technologies such as Holos and later Microsoft OLAP.