GeneaBloggers

Friday, 22 April 2016

When the Digital Age Makes You Scream



Last year, I wrote about the woes of using findmypast for newspaper searches at When the Digital Age Hinders. I now want to follow that up with an evaluation of another newspaper search engine — it’s starting to look like reliable searches are a thing of all ours pasts.

In the findmypast case, there was no visible attempt to address my issues. I made the company aware of the article but there was no comment and no requests for more information, merely a half-hearted attempt to blame the British Newspaper Archive (BNA) for some of the issues. Having tried to use it today (22 Apr 2016), the problems are still there, and even seem to be more intrusive than before.

Punishment of Sisyphus
Figure 1 - Punishment of Sisyphus.[1]

Also today, I needed to perform some searches of The Gazette (UK). Although I had used this for previous research, I recalled that they had redesigned some aspect of their interface; maybe it had been improved.

Unfortunately, I couldn’t get my searches to yield anything resembling what I asked for. I clicked on the little link entitled “Help with search”, just above the text-search field. Rather than giving some textual instructions, though, this directed me to a private — and hence inaccessible — YouTube video called "searching the Gazette" at https://www.youtube.com/watch?v=yJI--4jBhyM. Being part-way through your research, and then being expected to stop and watch some video, is not a helpful plan. After meandering around the site, I found a different link for the same video at https://www.thegazette.co.uk/videos/searching-the-gazette. This was nearly 3 minutes long and eventually informed me that putting a phrase in quotation marks would return only notices containing that exact phrase. But that’s what I had been doing!

I was actually searching for the phrase “Kirk of Nottingham” in all notice types between 1840 and 1860, and it gave me 17 hits. However — yes, you’ve guessed — I couldn’t find my exact phrase in any of them. It took me a while but I did find that each of them contained “Kirk of” but not “Kirk of Nottingham”. Now I wouldn’t insult the designers of this system by suggesting that they only honour the first two words in any exact search phrase, so it must be some inbuilt relaxation of the search terms.

The problem is that it doesn’t deliver what they said it would. They said it would only deliver notices that contain the exact phrase. I have seen several form-fill search engines that describe a field as “exact phrase” and which is probably being relaxed later by some component of the search software. I did notice that the BNA search fields have an “Exact search” tick-box but I couldn’t tell whether it did precisely that from the textual synopses of the hits, and I wasn’t going to take a subscription just on the off-chance.

So what gives here? If product managers are adding support for exact phrases then why are they not working? Why isn’t their absence being picked up during testing? Are the testers listening to software designers rather than reading functional specifications? A possible explanation is that the guts of the search engine is being brought in from some other company, and the testers test only the software developed in-house (well, ignoring the aforementioned help link) rather than testing that the whole system functions correctly as far as an end-user is concerned.

Why am I complaining? Well, the odd thing in The Gazette case is that none of the textual synopses correctly showed the context of the hit it was delivering to me. The following is just the first four of those 17 hits, so see for yourself:

Publication Date     18 June 1858
The London Gazette, Issue 22154, Page 2989
woo3, of Dewsbury, in the said county, Share Broker, and John Worsnop, of Oleckheaton aforesaid. Painter, all the real and personal estate and effects, whatsoever and wheresoever, of them, the said Wi…
View The London Gazette, Issue 22154, Page 2989 page

Publication Date     3 April 1849
The London Gazette, Issue 20964, Page 1093
awarded and issued forth against William Wood, of Wad- dington, in the county of Lincoln, Licensed Victualler, will sit on the 27th day of April instant, at eleven of the clock in the forenoon, at the…
View The London Gazette, Issue 20964, Page 1093 page

Publication Date     9 November 1849
The London Gazette, Issue 21036, Page 3371
at the Birmingham District :Cour.t of Bankruptcy, at Nottingham, in order .to Audit ,the Accounts of .the Assignees of the estate and effects qf the said""bankrupt under the said Fiat, "pursuant to th…
View The London Gazette, Issue 21036, Page 3371 page

Publication Date     17 November 1848
The London Gazette, Issue 20916, Page 4152
NATHANIEL ELLISON, Esq. Her Majesty's Com- missitfrter of the Newcastle-upon-Tyne District Court of Bankruptcy, the Commissioner authorized to act under a Fiat in of Bankruptcy, bearing date the 25th


There isn’t even a single instance of “Kirk” visible, and so the only option is to load each page image and perform a local search via Ctrl+F. This is just not good enough!




[1] Titian (1490–1576), “Sisyphus”, painting (between 1548 and 1549); attribution: [Public domain], via Wikimedia Commons.

Monday, 4 April 2016

The Great Wave



The North Wind goes over the sea
Figure 1 - The North Wind goes over the sea.[1]


 
Fixed grey faces gaze silent through their glaze, evoking memories of a time gone by and of lives that were,
But where is yesterday? When was tomorrow? An eternity of fleeting moments lost to the seething surf,
Lives beyond capture by any tree-lined garden of descent, graven on some far-distant tablet beyond the sight of man, save the muse of Parmenides,
For lifeless and textureless is the world beyond the duir, daubed in questionable hue from the palette of want.

Lain waste, their legacies of stone; the First Ones, long passed into shadow as though but a dream, bequeathing debt immortal,
Fading, failing, falling into darkness, unto dissolution, decay and dust, fettered by the illimitable dice,
Events and stories forsaken by the relentless arrow to lie fallow in their starless stasis betwixt the pages of every passing instant,
Celestial progenies cast adrift and abandoned by nature’s unmindful Doxa, unworthy of all remembrance.

From mankind, whose words be louder than its thoughts, writ in blackest quill, the dogma that serves the meek through the mighty,
While sages patiently strive to learn the magical music by which all things dance, but finding only purest melody and no tempo,
A world of adamantine illusion, wishfully tamed and told through the balancing runes,
A lost chord stripped of all harmony, and the myste with two faces poised in suspended masquerade.

So great the gift of time — the giver of Life, the æon of being, the conjurer of cause, and yet the arrow by which we fall,
Weep not for the past, nor for the lost moments, parting kisses, or stolen memories,
The price we pay for life is change — is wax and wane — is loss and gain, but the thief of days wields an arrow fashioned and shaped by conscious minds,
Taking the coin from our mouths, the thief knowingly smiles back with our own faces.

To the end of days, when all our suns have set on the crimson tide of life's blood,
When all our whispers have fallen silent on the Bible-black firmament, and their echoes have all flown their paper prisons and coulomb cages,
From the last glimmers of life, dream-dashed and robbed of love and hope, still clinging to its rock,
One final desperate cry will be heard afore The Great Wave:

...There can be no rhyme for there was no reason!


Just as they bent down to take the rose a big dense snowdrift came and carried them away
Figure 2 - Just as they bent down to take the rose a big dense snowdrift came and carried them away.[2]
 







[1] "The North Wind goes over the sea", illustration, East of the Sun and West of the Moon, Kay Rasmus Nielsen (1886–1957), illustrator (1914); attribution: [Public domain], via Wikimedia Commons.
[2] "Just as they bent down to take the rose a big dense snowdrift came and carried them away", illustration, East of the Sun and West of the Moon, Kay Rasmus Nielsen (1886–1957), illustrator (1914); attribution: [Public domain], via Wikimedia Commons.

Saturday, 26 March 2016

Dynamic Genealogical Data



Yes, I will be discussing data models and software issues in this post — sigh! — but hopefully in a manner that may be instructive to those who want to know more. As well as introducing certain important concepts, I want to illustrate some typically tricky decisions that have to be made, before rounding off with a novel way of presenting custom data to the end-user. Know-it-alls who already have a software background can just fast-forward to the “Reporting” section.

Some time ago, I submitted a paper to FHISO on the subject of object models[1] in which I explained their relevance to scripting languages, query languages, and general dynamic data access. Admittedly, this paper was aimed at a software audience, but let’s try and pull it apart to explain what these languages are, and the difference between a data format, a data model, and an object model.

Dynamic Data

Data Model

A data model is a formalised description of the relevant data entities (e.g. person, or place), their properties (e.g. names, sex, coordinates, etc.), and their relationships (e.g. biological lineage, or place of birth). Issues such as indexes and database schemas are not applicable to data models as they are a more abstract definition of the data’s structure, or rather its pattern. But issues such as cardinality (how many items of one type can be related to another), ordinality (the ordering of items related to one instance of another), and optionality (whether an item is mandatory or optional), are relevant.

Let’s consider one aspect of a genealogical data model to help illustrate this point: biological lineage. Every person had one mother and one father, even if they are unidentified; there cannot be more than one of each (ignoring the possibility of donor DNA) but some representations of this will be better than others.

If the parent person entity (e.g. a mother) points to each individual child of hers then it accommodates the erroneous situation where multiple mothers might point to the same child. However, the converse of having each person entity point to their respective mother and father enforces the cardinal integrity of the relationship without having to perform constant error checking.

Entity relationship schemes

It should be noted, here, that the direction of the link makes no difference to the ability to find children-of-a-parent or parents-of-a-child; both can be indexed from either one of the schemes.

So what if you don’t know the father? Well, having a missing father or mother link could easily be recognised as an indication that they are unidentified, but suppose that you have some incomplete details for them? Suppose, for instance, that you don’t know the name of the mother but you have her date of birth. This is where we get into controversial territory, and where I’m going to make a very bold statement. There are many threads that advocate the substitution of underscores, question marks, or some special text such as “Unknown”, “LNU” (Last Name Unknown), etc., for a missing name. In a software context, all of these are absolutely wrong, and not the right way to handle the situation. This is not a personal belief — it is a best-practice in a profession that I have spent decades in. It doesn’t matter who is making those recommendations, and there are no special cases for genealogists.

OK, rant over; now let me clarify this. Ignoring the fact that any alphabetic text may not translate well when sharing your data with someone from a different locale, the choice over which substitution to use for a missing name, or any missing datum, should not be given to the end-user, or even to a specific software product. More than that, good software would represent non-value conditions of a datum, such as unknown, not applicable, or erroneous, in a different domain to real values so that there is zero chance of a clash. For instance, consider the difference in a SQL database between NULL (a special condition in a column) and “NULL” (a normal textual value in a column). Also, the display value by which those special data conditions are represented to the end-user is a choice by the product, and not dictated by the software representation in the corresponding data format or data model.

I appreciate that some software may not follow these best-practices, but it is important to understand why this is bad for everyone. Tamura Jones produced an excellent article in 2013 related to this subject that discussed the impact of using acronyms and other invented values.[2] As I recently commented on one of Randy Seaver’s blogs, “Fake, Fudged, Dummy, and other such ‘special’ values were bad choices even in the 1970s”.

We’ve just looked at some choices in the relationships between two entities, and in the representation of non-value conditions. These cases may provide a basic insight into typical design issues affecting a data model, but what is a data model good for? It allows two products — a producer and a consumer — to agree on the structure of real data being exchanged between them.  When real data is stored in a file (or serialised as bytes for some other purpose, such as transmission over a communications link) then the data format employs a given syntax. That data format is largely irrelevant in comparison to the data model; the GEDCOM data model could be expressed in its own proprietary data format, or in XML, or some other format, and it would be a straightforward mechanical operation to convert from one format to another if they all conform to the same model.

However, such files are a very static way to exchange data, and they have to be loaded into some organised indexed form before they can be interrogated, navigated, or manipulated. One example of such a form is a database, but this is not the only form and genealogical products have other choices (see Do Genealogists Really Need a Database?). The following diagram provides a simplistic depiction of how a program may access indexed data in a disk-resident database or in memory, but other variations will be possible. For instance, a memory-resident index pointing to files on disk, as might be the case with a collection of image files.

Data access

STEMMA deliberately describes its data format as a source format in order to attach extra semantics; this draws from the term source code, as used in programming, in order to emphasise that the data is a definitive source for other forms, whether generated by transformation or by indexing, and not simply an exchange format.

Object Model

A program that accesses a database is constrained by the data-types allowed in its columns, the nature of its indexes, normalisation of its table entities, and the associated query language used to access those tables (mostly but not always SQL). Software that uses a query language directly, rather than having some abstraction layer between it and the target database, may be reducing both its longevity and its portability.

An in-memory index is more efficient and more flexible, and the ever-increasing memory capacity of modern machines means that it is totally practical to have both the data and the index in memory together. But what would the in-memory data look like?

The modern answer to this question is objects. In object orientated programming (OOP), an object is a software entity representing one instance of some real-life entity. For instance, a Person object[3] would represent one named person, and a Place object one named place. Objects of the same type are instantiated (i.e. created) from a template called a class, which defines not only the allowable properties (e.g. names and sex for a person) but also small segments of code, called methods, that may be invoked on the associated objects. For instance, there may be a method to test whether any of the names stored in the current object matches a particular name provided as a parameter. All products utilising that class in its programming would therefore use a consistent algorithm, as implemented by the designer of the associated object model: the set of related classes intended to cooperate in order to deliver access to that data. An object model is not uniquely defined by a given data model, but they do go hand-in-hand; every object model has an underpinning data model. However, whereas a data model talks about entity relationships, it is the object model that talks about actual data linkages, indexes, and issues of efficient data access.

A very important aspect of OOP is software inheritance. This is where one class is derived from another class and allows it to share unchanged portions while overriding others; the intention being to create a new class for a more specialised type of entity. The following is an illustration of how it might be applied to the various subjects of historical research, each level providing specialised classes derived from more generic ones in the previous level.


Software inheritance
In this illustration, the handling of names could be shared for all the subject types, and inherited from the historical-subject base class, whereas the hierarchy for animate subjects (i.e. lineage) would be different from that for inanimate subjects.

We might then redraw the earlier data-access diagram as follows in order to show that the object layer forms an effective abstraction:


Abstracted data access
The program sees the same application programming interface (API), irrespective of whether the data is in local memory, in some database, or even across some network connection. NB: The index associated with the objects would be implicit in their class definitions, and not really a separate entity as implied by this diagram.

I’ve just mentioned access across a network, where your data may be on a separate machine (the server) to that of your program (the client), e.g. in the “cloud”. This is a case for which query languages are well suited. You see, if data server has millions of records available but the program on your client machine only wants to see a handful that satisfy some specific criteria, it would be extremely inefficient to transport all the records up to the client machine, across a typically slow network — much easier to push the criteria down to the server and let software there sort it out. This is effectively what happens when a SQL query is sent to a server hosting a database; the query may be as small as a single line and the returned records would be just the ones of interest. There are other forms of query language, such as MDX for multi-dimensional data queries, and the form of the returned data is highly dependent on the nature of the query.

A scripting language is usually considered to be an interpreted language, meaning one executed directly from its source-code representation — as entered by the programmer — rather than being first compiled into a set of instructions that the machine can execute directly. Because the source has to be parsed before it can be interpreted then there is a performance penalty with them, but their speed of deployment and ease of maintenance make them very useful for small-scale applications. Languages such as JavaScript and VBScript are common examples, but there is little difference from the aforementioned query languages, other than in their intended function. Many database systems even support a mechanism where segments of script can be held as stored procedures that can be invoked later by a special type of query statement. The relevance of this is that when a scripting language has access to an object model then you have a very powerful and flexible means of data access.

Suppose we want to express a custom genealogical query — one not supported intrinsically by our product — to look at all the events in our timeline, then look at all the persons connected to those same events, and then to select just the ones whose name(s) have the token "Jesson" in them. If a standard object model is defined then it doesn’t matter whether we express our query using a standard scripting language or some proprietary one associated with the product.

This example uses a java-like syntax.

Person me = New Person("Tony Proctor", 1956);
for (Event e: me.allEvents()) {
for (Person other: e.allPersons()) {
if (other.nameContains("Jesson")) {
...do something with this other person...
}
}
}

This example uses a VB6-like syntax.

Dim me As New Person
me.setPersonName (“Tony Proctor”)
me.setDateOfBirth (1956)
Dim e As Event
Dim other As Person
For Each e In me.allEvents()
For Each other in e.allPersons()
If (other.nameContains(“Jesson”)) Then
...do something with this other person...
End If
Next other
Next e

The syntax is very different but the essential elements of the object model, such as class names and method names, are the same in the two examples.

So, in summary, irrespective of where the data is coming from (memory, database, or afar), if an object model is available then any processing or queries can be expressed through scripting languages. Furthermore, those segments of code can be pushed down to the data server in order to achieve efficient retrieval in the case of remote data stores.

Citations

One instance where I have found it beneficial to expose an object model was in the processing of citation-elements in order to generate a formatted citation. A citation-element is a discrete datum that can be identified in a citation, such as an author, title, publication date, etc. When a system generates a formatted citation using a citation-template then it takes various citation-element values and inserts them at selected places in a textual template. The onus is on the hosting software to provide the citation-template with all the relevant values, and this can be a heavy burden if it doesn’t have intimate knowledge of the actual template.

Suppose that the citation is required to name the Genealogical Publishing. Co. of Baltimore. How much information should the hosting program pass to the citation-template to let the reader know where that company is? Is a prefix of “Baltimore:” enough, or does it require “Baltimore, Maryland:”, or does it require “Baltimore, Maryland, US:”? Remember that the reader may not be American, and if you think they should know all the US states then would you be aware of all the regions in their country?

The alternative approach is to pass a Place object to the citation-template and allow it to invoke the associated methods to obtain the specific information that it requires for the current template and the current user. The same approach can be applied to other data such as a Person object (allowing access to name-handling methods), or a Date object (allowing the formatting of dates from alternative calendars in short/medium/long/full forms).

Reporting

A more recent application of an object model occurred in STEMMA’s narrative support. The inclusion of a segment of script in a STEMMA table entity allowed it to fetch specific data to populate the cells at the time associated Narrative entity was rendered. This basically meant that a Narrative entity could be used as a report writer (in software terminology) to fetch up-to-date data matching a custom query and to present it along with narrative or other data.

To explain the significance of this, let’s look at the HTML equivalent since it is also possible to dynamically populate an HTML table. Once upon a time, programmers used to do this by injecting new HTML source into the table using the JavaScript document.write() function. For instance:


document.write (“<tr>”);

This allowed, say, the rows and cells of a table to be generated from some data such as the results of a SQL query or the contents of a JavaScript array. This is a very old method, and has acknowledged security risks, as well as performance implications in certain cases. An equivalent way of injecting HTML source is to modify the innerHTML property of a given node. For instance:

node.innerHTML = "<b>Hello World</b>";

A better approach, though, would be use methods provided by the HTML Document Object Model (DOM): the tree of node objects into which an HTML document is compiled. For instance, document.createElement() and node.appendChild(). In the specific case of tables, W3C defined DOM methods such as tbody.insertRow() and tr.insertCell() to help build the table body more easily.

The STEMMA case is slightly different from that of HTML since it does not have an inherent DOM — the contents of a marked-up Narrative entity have to be transformed into some other system for presentation, which might be in an HTML page, a blog page, or a word-processor document — but it does expose a tentative genealogical object model. This means that the STEMMA table entity can specify a segment of script code to call upon that object model and retrieve results for the table contents. It doesn’t do this by dynamically injecting STEMMA source, or by calling on something like the DOM methods to build-up a table; the script populates an object that simply describes the contents of a table — the headings and data cells — and this can be transformed for display in the same manner as a static table would.

What this all means is that the existing support for representing narrative essays, narrative reports, transcriptions, etc., has also become a tool for dynamically reporting on genealogical data. I don’t need to devise some completely separate tool to achieve this, and I also get the benefit of mix-and-match where dynamically generated tables can be embedded in my narrative reports. And just in case anyone hasn’t realised yet, if the underlying data is subsequently modified then I will see the very latest data the next time I view the associated Narrative entity.



[1] Tony Proctor, "Proposal to Create a Standard Run-time Object Model", FHISO Call For Papers, CFPS 9 (http://tech.fhiso.org/cfps/files/cfps19.pdf : accessed 9 Mar 2016);
[2] Tamura Jones, “FNU LNU MNU UNK”, Modern Software Experience, 11 Aug 2013 (http://www.tamurajones.net/FNULNUMNUUNK.xhtml : accessed 9 Mar 2016).
[3] Using capitalisation, here, merely to distinguish software entities from the real-life entities that they are representing.