A change can be viewed in two ways; conceptually or literally. What I mean by this distinction is that when I say the requested change is to “correct spelling mistakes in the poem” I am specifying conceptually what the change is to achieve (and after the fact, what the change achieved). On the other hand we are used to dealing with change in a more literal sense of as set of revisions, for example, “I edited file1 in this change”. In this post I discuss some of the issues and implications of these two views of change. Read the rest of this entry »

Toward a CM Ontology
May 22, 2010As I suggested in a previous post, I think the future of CM (and most especially SCM) lies substantially with the semantic web. My reasoning is simple; CM is about information management and this information needs to be shared, controlled and updated across increasingly more diverse organisations and systems. To provide this facility we need a lingua franca, a common means to control and consolidate information between disparate sources. The semantic web provides the means to achieve this information management and exchange.
The great advantage of semantic web over efforts such as the now defunct Application Lifecycle Framework (ALF) is that it requires no agreement between vendors (beyond using semantic web technology). The weakness of efforts such as ALF is always that they demand buy-in from the main tool vendors. A substantial number need to agree to develop and support the new standard.
Certainly semantic web is no panacea, but at least if Vendor A chooses one semantic representation of CM information and Vendor B chooses another they can still communicate by creating a correspondence rule set between the two representations (a little like XSLT can transform one XML into another — only a little though, semantic web has much more to offer).
So vendors need to agree to use and provide semantic web representations for CM information? No, not really. Most tools provide APIs that would allow this information to be interpreted from, or added to, any existing tool. Certainly a non-trivial effort, but one that is at least feasible. Better still, if multiple implementations are created for any tool these can again be consolidated using semantic web techniques.
The real power of semantic web technology comes from two sources; the abstraction of information semantics, and the ability to draw inferences from this information. Once you have an ontology, some inference rules and semantic relationships between ontologies, your inference rules will work across ontologies — neat. What does all the gobbledy-gook mean? It means that if Vendor A develops an ontology with a set of inference rules (rules for extracting more information from the underlying information) then Vendor B can map their ontology onto Vendor A’s and use Vendor A’s inference rules too. Actually it’s even better. User X can extend the rules and have them apply to Vendor A and/or Vendor B’s information sets equally, even if the original inference rules were designed for only Vendor A.
Brilliant. Problem solved then? Sadly, no. Although this all offers promise of a way forward there remains a lot of work to establish these semantic descriptions and, as many have discovered before, agreeing on the precise meaning of each semantic element is nontrivial in its own right. Not that this should stop us attempting the task.

Absence, CM Tool, identities, and some thoughts on the future of CM
April 7, 2010Some of you may have noticed a bit of an hiatus in my posts. I’ve been a bit under the weather, feeling lethargic and run down, and not in a good frame of mind for writing. Lucky you, I am in the mood now.
Remember a few posts back I mentioned working on a CMS reporting tool? Well, I’ve been working with a client on a tool for managing parallel development, which is closely related to the type of reporting I had in mind for the CMS reporting tool. The client’s focus is on Dimensions, but in the interests of creating a flexible system I’ve been designing with other underlying in mind (although, as is often the way, timescales and pragmatism mean the implementation is sometimes more tool specific than the original design). The result of this has been some extended mental gymnastics that have resulting in some useful and hopefully interesting thoughts that I would like to share with you now.
The parallel development tool needs to both report, analyse and, ultimately, update the host SCM database. There are a number of special considerations that I won’t go into now, but these created much candle burning and banging of my head on desks, walls and other solid objects as I struggled to resolve some apparently simple problems that turned out to be rather tricky. (The sad thing is, as is often the case with these things, once I had an answer it turned out to be deceptively simple.) The upshot of all this cranial damage covers a wide variety of topics, too many for a single post, but we can make a start.
Identification
How do we identify things?
Each tool has its own method of identifying the things it controls. In Dimensions items are ultimately identified by some sort of unique specification; files are identifies by an Object Spec, projects (worksets) are identified by Project Specs and so on. Each of these Specs. provides a unique reference within the context of a specific Dimension base database. Subversion identifies things by their position in it virtual file system; path and revision. Again, Subversion references are qualified by the repository in which they are held. Since Subversion repositories can be referred to by any number of methods there is no one unique ‘name’ for an item in Subversion once you step beyond the bound of the repository (I can, for example, use several host names to refer to the same repository).
With this in mind, how can we uniquely refer to a single controllable entity (I shall use the term ‘Artefact’ to represent a uniquely identifiable thing). For simplicity we’ll focus our attention on a file, after all with software configuration management (SCM) this is the most common sort of Artefact we want to control. As I discussed briefly in a previous post, files exist along two histories; the Revision History (in which new Artefacts are created for each revision of the file), and the Lifecycle History (in which an Artefact may undergo change without changing its revision, without becoming a new Artefact). Dimensions does a nice job of recording these two histories, but our challenge here is to differentiate identities.
Do we assign the identity to the Artefact’s revisions, or do we assign the identity to each Lifecycle historic state?
To illustrate, consider file1.txt. This file will be subject to the simple lifecycle illustrated below.
Starting in the state NEW we assign the file1.txt Artefact a revision of 1. The Artefacts two histories and its change log are illustrated below.
The developer chooses to change the original file1.txt but while in lifecycle state NEW we do not require a new Artefact, the resulting edited version of file1.txt remains revision 1. The lifecycle state also remains NEW, so although we have travelled the In-place Edit path in the lifecycle, the result is an edit event but no change in either the state or the revision of file1.txt. Using state and revision to identity the Artefacts (the one before the edit and the one after the edit) is therefore not sufficient to account for the two different edited versions of file1.txt.
We could simple state that we are not interested in such edits or we could also simply assign some ‘point’ revision to distinguish these intra-state revisions. Both of these options have been adopted by tools in the past, but when trying to generalise a solution we need to accommodate as many possibilities as we can. Generalising a solution should allow us to sit atop any underlying implementation. To arrive at a generalisation we need to explore the problem domain further.
Derived Identities
Perhaps the answer lies in content-based addressing (CBA). With CBA we identify Artefacts by creating a signature based upon intrinsic attributes of the item itself. This works very well for electronic data (in our case, files). There are many CBAs available and they are widely used in peer-to-peer systems for identifying files. The current darling of the CBA world is the SHA-2 family of hashing functions. These hashing functions process the stream of bytes in a file (or any other source of data), producing a single ‘summary’ string of characters which is much shorter than the original (often 64, 128, or 256 bits long).
CBA has huge appeal for our purpose in uniquely identifying an Artefact. A CBA is almost guaranteed to be unique to the data being encoded (I say almost because any CBA will ultimately collide — two data sets produce the same hash — since a potentially infinite set of data is being mapped onto a finite set of hash codes). A CBA is intimately related to the Artefact’s characteristics and so no assignment authority is required (see Assigned Identities a little later). This also means that given an Artefact we can work out its unique identity, providing we know the hash function and the attributes used for identification (a files data, for example).
CBAs work very well for electronic data. Unfortunately, CBAs do not work well for the general case. How, for example, might we assign a CBA to a physical Artefact?
Assigned Identities
This brings us to assigned identities. An assigned identity is one that, unsurprisingly, is assigned by some agent. Examples of assigned identities are legion. The ISBN you find on any book you buy is an assigned identity, not for that specific book, but for the data and form of that book (hardback version of the first edition of XYZ). Similarly, almost all products have a UPC (Universal Product Code) — the one that you find encoded in bar codes. Again, UPCs are assigned to a ‘type’ rather than the specific item; so the UPC code 50036 90803 refers to JBL Duet speakers, but not to the specific pair of JBL speakers sitting on my desk.
To ensure the uniqueness of assigned identities two conditions must be met: the scope of the identity must be well defined (ISBNs apply only to books, UPCs apply only to products, and not, for example, to ideas or patents), and there must be some central agency who defines and issues codes (or at least issues blocks of codes to be issued by approved agents).
Assigned identities are not derived from the Artefact that they represent, so there is no way, when presented with an Artefact, to establish that Artefact’s identity, other than by reference to some catalogue to see whether it already has an assigned identity. This could be a very long process if we relied on the data as a whole. ISBNs for example, can be looked up using the books format, title, and edition. These three attributes for the key by which the ISBN can be referenced in a catalogue. This presumes the existence and universal availability of a definitive catalogue.
Another form of assigned identity is the Universally Unique Identifier (UUID). UUIDs also use hashing functions to condense a set of attributes to a string of 128 bits, but the source data is not the file’s content but a series of attributes that are themselves (hopefully) unique. There are many implementations of the UUID scheme and they use different attributes to generate the UUID itself. Like CBA, a UUID is almost guaranteed to be unique — certainly sufficiently unique for most practical purposes.
So where does all this leave us?
An identity based on CBA is great for known data sets because the identity is generated from the data itself. We can therefore recreate the identity independent of any central authority. We can take a file and work out its unique identity and then use this unique identity within our tools. This, as we shall see in coming weeks, has significant advantages to resolving configuration data across distributed databases.
The problems with CBAs are that they do not translate well into the physical realm (we need to establish some characteristic of the physical object that can be uniquely selected for encoding into data and subsequently used to generate the CBA), and they are quite costly to generate (we need to read and process the entire data set to create the CBA).
Assigned identities that rely on an assigning authority (like the ISBN) require significant organisation and coordination. Each assigning authority must be organised to ensure that each identity is assigned uniquely. And here we run into another question related to the other form of assigned identity, UUIDs.
What do we mean by ‘uniquely assigned’? Suppose I create a file containing the text ‘The quick brown fox jumped over the lazy dog’ and I assign it a UUID. Some time later, you create a file with identical content and assign it a UUID. The two UUIDs will be different. Are we then dealing with two Artefacts?
In one sense we are. They are two distinct incarnations of the file. But in any useful sense, we are not. They are in fact the same file. Just as if I took my original file and copied it to another location in my file system I would consider it the same file. Or if you copied your file to my file system I would be unable to distinguish your file from mine. They are the same ‘thing’, but we have assigned different identities.
This is the problem with all assigned identities; how to ensure that the ‘thing’ being identified has no identity already assigned.
Is there an answer?
Yes and no. The work currently underway with the semantic web project (quick overview here) offers some hope. There is still a reliance on ‘authority’ to assign identities, but I think there is much to be learned from the general approach and using a combination of derived and assigned identities will allow easy reconciliation of information between CM datasources. I see the problem of configuration management (and particularly identification) as a special case of identification within the semantic web project.
This is a fascinating area (well, it is to me) of research and one that I believe will be increasingly important as CM moves into ‘the cloud’. As the distinction between CM data under your direct control and that controlled by proxy through service providers and third parties is eroded the need to consolidate CM information across disparate tool sets and managed by different agencies will become increasingly important. This is a problem that we must solve.
I spent some time a couple of years back considering Topic Maps and Resource Description Framework (RDF) as possible SCM schema. Although they are functionally equivalent, RDF seems to have won out in the standards battle. If you are interested in seeing where I think information management is going in the 21st century (and where CM must also go), check out the semantic web.

Revision histories with more than one root
March 18, 2010Most of the time when we deal with revision history we are dealing with a directed acyclic graph with a single root. Most item revision histories develop from a single starting revision, as illustrated below.
If two items belonging to different revision histories are combined we produce a graph with more than one root, as illustrated below where file1.txt revision 3 has been combined (merged) with file2.txt revision 2 to produce file1.txt revision 4 (or arguably we might call it file2.txt revision 3).
This second situation is often overlooked when considering product integrity. Tools often do not track merge operations between two revision histories like this, so the information about the merge is not available for analysis. The difficulty arises when, in our example, file2.txt is developed to revision 3 under the same change as revision 2 (suppose CR1 was only partially tested when the merge took place and a defect was found and resolved in revision 3).

It should be obvious that file1.txt revision 4 contains only part of change CR1, it therefore lacks integrity. Checking this and reporting it to the user is very difficult if the merge operation is not properly tracked.

Items have history
March 17, 2010As those of you who have been following this blog for any time will know I am currently looking in some detail at parallel development, specifically how it can be managed safely by non-expert version managers. I have used parallel development with much success on many projects but codifying my knowledge into a tool is proving challenging — and interesting too. In this post I will be considering item histories. Read the rest of this entry »

Preventing ticket ping-pong
February 27, 2010This is the first opportunity I have had for a while to put something on this blog — busy, busy, busy. (I can tell you that I am building up a fairly sizeable backlog of articles on parallel development and I will, I promise, get round to publishing them soon.) In the meanwhile, here’s a brief thought on designing ticketing systems.
Many ticketing systems, supporting processes such as incident management, are prone to ticket ping-pong where tickets are bounced from queue to queue as people keep forwarding the problem to get it off their queue. This is simple to control once you see the problem. Read the rest of this entry »

In the beginning…
February 18, 2010…was the definition.
In this article I am going to lay out my definitions for some terminology that will become increasingly important as I develop my CMS model.
The terms I will be discussing are as follows.
- Stream
- Branch
- Configuration Item
- Revision
- Configuration
- Component
- Repository
- Configuration Management Database
- Record
At this point I caution the reader that these definitions are deliberately quite loose and informal. Each will be expanded, refined, rewritten and formalised as I work through articles in this blog. For now, my working definitions are as follows.
Project
A coordinated effort usually conducted by several individuals to deliver a Product. Project describes the totality of planning and activity requires to gather requirements and interpret these into Product.
Product
That which is to be delivered by a Project. Products include, but are not limited to:
- Executable software
- Documents — manuals, design documents, requirements, installation guides, administration and maintenance manuals
- Hardware — computers, network components, any other physical components required as part of the Product
- Training materials — exercise version of data or system components, trainer presenation, training the trainer material, sandbox systems for trainees
- Source code — when developing for a 3rd party. Source code may also be a deliverable in interpreted languages or when delivering web content such as HTML.
- Media — video, graphics, audio
Stream
Projects often consist of more than one piece of development. A common strategy is to manage these pieces of development as a sort of sub-project. Timescales of these Streams are overlapped to allow the project to compress its overall timescale.
Branch
An implementation technique used in development to manage simultaneous changes to common items. In software development Branches are common and used to allow two or more developers to work on the same source code simultaneously.
Configuration Item
A configuration item is an item within the configuration management system that is the focus for change management.
Revision
Each time an item is modified and submitted into version control, a new revision is created. In this way an item’s history can be traced by looking back through the sequence of revisions.
Delta
The difference between two revisions.
Configuration
A specific arrangement of item revisions.
Component
An item that is subject to version control, but is not elevated to the status of a configuration item.
Repository
A safe store for item revisions.
Configuration Management Database
A database containing information about each item revision and their relationships to one another and to records.
Record
A set of data that is subject to a process or workflow but not necessarily version control. Records normally carry information used to account for an item’s current disposition or the current state of a process or workflow.

Fences and Ambulances
February 17, 2010Suppose you are in charge of a cliff edge. Your task is to maintain the views from the cliff, but keep visitors safe. You can construct a fences along the top of the cliff, to stop people falling over, or you can place ambulances at the foot of the cliff, to clear up once someone falls over. Read the rest of this entry »

CMS Tool: High-level architecture
February 11, 2010Continuing my musings about a universal configuration management tool I’ve drafted the basic architecture. This is summarised in the following diagram (after the break). Read the rest of this entry »

Parallel development: theory and practice
February 10, 2010Having spent the past couple of weeks with a client working through the issues that need to be carefully considered when version controlling software, and in particular how to manage and control parallel development. I have come to three conclusions:
- People are often more afraid of the perceived problems than the practical realities of parallel development.
- People do not truly appreciate the problems and practical realities of version control and parallel development.
- There is a need for more theoretical work on the topic and, perhaps more significantly, a need for more formal expression of the process and problems involved.
- There is a quite substantial book that could be written on the subject.



