Some of you may have noticed a bit of an hiatus in my posts. I’ve been a bit under the weather, feeling lethargic and run down, and not in a good frame of mind for writing. Lucky you, I am in the mood now.
Remember a few posts back I mentioned working on a CMS reporting tool? Well, I’ve been working with a client on a tool for managing parallel development, which is closely related to the type of reporting I had in mind for the CMS reporting tool. The client’s focus is on Dimensions, but in the interests of creating a flexible system I’ve been designing with other underlying in mind (although, as is often the way, timescales and pragmatism mean the implementation is sometimes more tool specific than the original design). The result of this has been some extended mental gymnastics that have resulting in some useful and hopefully interesting thoughts that I would like to share with you now.
The parallel development tool needs to both report, analyse and, ultimately, update the host SCM database. There are a number of special considerations that I won’t go into now, but these created much candle burning and banging of my head on desks, walls and other solid objects as I struggled to resolve some apparently simple problems that turned out to be rather tricky. (The sad thing is, as is often the case with these things, once I had an answer it turned out to be deceptively simple.) The upshot of all this cranial damage covers a wide variety of topics, too many for a single post, but we can make a start.
Identification
How do we identify things?
Each tool has its own method of identifying the things it controls. In Dimensions items are ultimately identified by some sort of unique specification; files are identifies by an Object Spec, projects (worksets) are identified by Project Specs and so on. Each of these Specs. provides a unique reference within the context of a specific Dimension base database. Subversion identifies things by their position in it virtual file system; path and revision. Again, Subversion references are qualified by the repository in which they are held. Since Subversion repositories can be referred to by any number of methods there is no one unique ‘name’ for an item in Subversion once you step beyond the bound of the repository (I can, for example, use several host names to refer to the same repository).
With this in mind, how can we uniquely refer to a single controllable entity (I shall use the term ‘Artefact’ to represent a uniquely identifiable thing). For simplicity we’ll focus our attention on a file, after all with software configuration management (SCM) this is the most common sort of Artefact we want to control. As I discussed briefly in a previous post, files exist along two histories; the Revision History (in which new Artefacts are created for each revision of the file), and the Lifecycle History (in which an Artefact may undergo change without changing its revision, without becoming a new Artefact). Dimensions does a nice job of recording these two histories, but our challenge here is to differentiate identities.
Do we assign the identity to the Artefact’s revisions, or do we assign the identity to each Lifecycle historic state?
To illustrate, consider file1.txt. This file will be subject to the simple lifecycle illustrated below.

A very simple lifecycle
Starting in the state NEW we assign the file1.txt Artefact a revision of 1. The Artefacts two histories and its change log are illustrated below.

file1.txt - a simple change log
The developer chooses to change the original file1.txt but while in lifecycle state NEW we do not require a new Artefact, the resulting edited version of file1.txt remains revision 1. The lifecycle state also remains NEW, so although we have travelled the In-place Edit path in the lifecycle, the result is an edit event but no change in either the state or the revision of file1.txt. Using state and revision to identity the Artefacts (the one before the edit and the one after the edit) is therefore not sufficient to account for the two different edited versions of file1.txt.
We could simple state that we are not interested in such edits or we could also simply assign some ‘point’ revision to distinguish these intra-state revisions. Both of these options have been adopted by tools in the past, but when trying to generalise a solution we need to accommodate as many possibilities as we can. Generalising a solution should allow us to sit atop any underlying implementation. To arrive at a generalisation we need to explore the problem domain further.
Derived Identities
Perhaps the answer lies in content-based addressing (CBA). With CBA we identify Artefacts by creating a signature based upon intrinsic attributes of the item itself. This works very well for electronic data (in our case, files). There are many CBAs available and they are widely used in peer-to-peer systems for identifying files. The current darling of the CBA world is the SHA-2 family of hashing functions. These hashing functions process the stream of bytes in a file (or any other source of data), producing a single ‘summary’ string of characters which is much shorter than the original (often 64, 128, or 256 bits long).
CBA has huge appeal for our purpose in uniquely identifying an Artefact. A CBA is almost guaranteed to be unique to the data being encoded (I say almost because any CBA will ultimately collide — two data sets produce the same hash — since a potentially infinite set of data is being mapped onto a finite set of hash codes). A CBA is intimately related to the Artefact’s characteristics and so no assignment authority is required (see Assigned Identities a little later). This also means that given an Artefact we can work out its unique identity, providing we know the hash function and the attributes used for identification (a files data, for example).
CBAs work very well for electronic data. Unfortunately, CBAs do not work well for the general case. How, for example, might we assign a CBA to a physical Artefact?
Assigned Identities
This brings us to assigned identities. An assigned identity is one that, unsurprisingly, is assigned by some agent. Examples of assigned identities are legion. The ISBN you find on any book you buy is an assigned identity, not for that specific book, but for the data and form of that book (hardback version of the first edition of XYZ). Similarly, almost all products have a UPC (Universal Product Code) — the one that you find encoded in bar codes. Again, UPCs are assigned to a ‘type’ rather than the specific item; so the UPC code 50036 90803 refers to JBL Duet speakers, but not to the specific pair of JBL speakers sitting on my desk.
To ensure the uniqueness of assigned identities two conditions must be met: the scope of the identity must be well defined (ISBNs apply only to books, UPCs apply only to products, and not, for example, to ideas or patents), and there must be some central agency who defines and issues codes (or at least issues blocks of codes to be issued by approved agents).
Assigned identities are not derived from the Artefact that they represent, so there is no way, when presented with an Artefact, to establish that Artefact’s identity, other than by reference to some catalogue to see whether it already has an assigned identity. This could be a very long process if we relied on the data as a whole. ISBNs for example, can be looked up using the books format, title, and edition. These three attributes for the key by which the ISBN can be referenced in a catalogue. This presumes the existence and universal availability of a definitive catalogue.
Another form of assigned identity is the Universally Unique Identifier (UUID). UUIDs also use hashing functions to condense a set of attributes to a string of 128 bits, but the source data is not the file’s content but a series of attributes that are themselves (hopefully) unique. There are many implementations of the UUID scheme and they use different attributes to generate the UUID itself. Like CBA, a UUID is almost guaranteed to be unique — certainly sufficiently unique for most practical purposes.
So where does all this leave us?
An identity based on CBA is great for known data sets because the identity is generated from the data itself. We can therefore recreate the identity independent of any central authority. We can take a file and work out its unique identity and then use this unique identity within our tools. This, as we shall see in coming weeks, has significant advantages to resolving configuration data across distributed databases.
The problems with CBAs are that they do not translate well into the physical realm (we need to establish some characteristic of the physical object that can be uniquely selected for encoding into data and subsequently used to generate the CBA), and they are quite costly to generate (we need to read and process the entire data set to create the CBA).
Assigned identities that rely on an assigning authority (like the ISBN) require significant organisation and coordination. Each assigning authority must be organised to ensure that each identity is assigned uniquely. And here we run into another question related to the other form of assigned identity, UUIDs.
What do we mean by ‘uniquely assigned’? Suppose I create a file containing the text ‘The quick brown fox jumped over the lazy dog’ and I assign it a UUID. Some time later, you create a file with identical content and assign it a UUID. The two UUIDs will be different. Are we then dealing with two Artefacts?
In one sense we are. They are two distinct incarnations of the file. But in any useful sense, we are not. They are in fact the same file. Just as if I took my original file and copied it to another location in my file system I would consider it the same file. Or if you copied your file to my file system I would be unable to distinguish your file from mine. They are the same ‘thing’, but we have assigned different identities.
This is the problem with all assigned identities; how to ensure that the ‘thing’ being identified has no identity already assigned.
Is there an answer?
Yes and no. The work currently underway with the semantic web project (quick overview here) offers some hope. There is still a reliance on ‘authority’ to assign identities, but I think there is much to be learned from the general approach and using a combination of derived and assigned identities will allow easy reconciliation of information between CM datasources. I see the problem of configuration management (and particularly identification) as a special case of identification within the semantic web project.
This is a fascinating area (well, it is to me) of research and one that I believe will be increasingly important as CM moves into ‘the cloud’. As the distinction between CM data under your direct control and that controlled by proxy through service providers and third parties is eroded the need to consolidate CM information across disparate tool sets and managed by different agencies will become increasingly important. This is a problem that we must solve.
I spent some time a couple of years back considering Topic Maps and Resource Description Framework (RDF) as possible SCM schema. Although they are functionally equivalent, RDF seems to have won out in the standards battle. If you are interested in seeing where I think information management is going in the 21st century (and where CM must also go), check out the semantic web.