-
Notifications
You must be signed in to change notification settings - Fork 5
A Path To Data Integration
In the previous page, addressing the challenge of achieving inter- and intra-system integration was introduced as one necessary achieve an acceptable quality of representation throughout the information systems involved. It was also introduced that one key aspect that really undermines the quality of representation is inconsistency. If two or more parts of the data 'ecosystem' choose even slightly differing representations for what they believe* to be the same thing then there is going to be a potentially costly mis-match between them. Given that data-based systems are required to represent many different things it is guaranteed that systemic inconsistency will arise. One of the primary mechanisms to create a means of representation within a computer-based system is to develop a model; a data model... one of the other challenges is to make sure that it is used effectively!
Developing good data models is challenging. Even good data models, where they are employed in information systems today**, are typically limited by initial requirements and are often fixed in scope or modified during use in ways that undermine their original, internal consistency. Nevertheless, they can be complex. This creates an immediate barrier to interactions with other information systems and can typically result in point-to-point integration activities that are 'lossy' (i.e. there is an imperfect match between the use of the mapped, independently created, data models). These losses can be very costly but are often not apparent (or declared) to the end users of the systems. One way to get around these issues is to introduce an information architecture that the participating systems map into. It should be noted that this does not mean that they are centralised; it means that there is a consistent approach to getting the data quality, including representational consistency, that is needed across the participating systems.
Firstly, the information architecture needs to be suitable for the participating systems to make use of, and stay up-to-date with, a consistent data model of what they need to interact with. The details will be typically be specific to the purpose of the information-system at large (e.g. from data consistency within a single team / small organisation to a multi-national initiative involving many thousands of organisations) but all uses will require an architecture that is compatible with the simple view in the following figure:

This architecture indicates that three separate systems are making use of an external, accessible repository of "Master and Reference Data". This phrase encompasses the data model (colloquially called Reference Data in the Magma Core Wiki) and permitted values that are based on the Reference Data (likewise called Master Data). Reference Data covers the structured set of concepts that are allowed to be used in the participating systems. Reference Data can be thought of as a library of permitted classes that define what systems are concerned with and should rarely (if ever) change once created - continuous additions are allowed, as will be illustrated later. Master Data contains instances of the data model to which there is a common interest in making shared use of. As a minimum requirement systems A, B and C must use only the Reference and Master Data that there is shared access to when exchanging data with each other (this also implies that querying of each other's systems is permitted but does not imply that access to all data is allowed). The Reference Data need not be co-located, distributed Reference Data Libraries are a good idea if managed well, and Master Data sets can be co-located or distributed as required too. The main thing is that the quality of these 'libraries' is key to the point-to-point query and exchange between the participating systems. Each of them may be managed for the benefit of the wider set of systems while also being distributed, for example for resilience and maintainability purposes (all this requires sufficient quality management). A key point is that these libraries need not be 'owned' by any of the systems that depend upon them, as long as there is agreement around the governance and quality management of what is shared. As an aside, although it is a minimum requirement that these systems use the shared data model for data exchange between them, it makes a lot of sense for systems to make use of this data model internally. This gives them the benefit of the integrity of the shared data model (it is an assumption for the purposes of this paragraph that this is the case) and also makes the activity of mapping from their internal data model to the externally shared one, for exchange, a trivial one and not lossy. The green arrows from the shared Master and Reference Data are unidirectional to indicate that the systems use it to acquire the data that specifies the model and shared values but these interfaces can be dynamically queryable and, importantly, the Master and Reference Data can be added to and extended respectively (again as part of a suitably managed process).
We can now turn to the assumption that we can have a shared data model that is up to the task of meeting the needs of the participating system (in fact, to be pedantic it is the needs of the users of the participating systems that are important; it is an Enterprise challenge, not just a technical one). This is a very tall order. If it is possible then it could transform the way we use and manage data in future information systems. It is this challenge that Magma Core is focussed on assisting. You may have spotted one remaining part of the diagram above that hasn't been mentioned yet. It is labelled "TLO" and is the foundation for the Reference Data and all the other data based upon it. This root of the model can be called a Top Level Ontology***. The use of the term "Ontology" implies a construction that is up to the task of handling the concepts that are needed and "Top Level" indicates that the Ontology that the TLO draws upon is one that can deal with (almost) everything; Life, the Universe and Everything (that could possibly be 'in' it). This TLO must also be a data model itself, to provide the required roots of consistent use as a data model; a data model that is universally normalised, to enable computers (and humans) to employ logic! If you are new to the subject of foundational ontology then a good place to start is the Stanford Encyclopedia of Philosophy: Logic and Ontology. A quick read will illustrate that there is a lot to ontology and logic, much of it hotly debated.
Pulling a lot of conceptual thinking about metaphysics and logic together in a data model is a considerable task... and there isn't just one way to do it. However, there are many more wrong ways to do it so the first thing the Magma Core team had to do was to pick a suitable TLO. One that has been thoroughly documented is in a book by Dr Matthew West**** called Developing High Quality Data Models (HQDM). The provided links explain some of the background which contributed to the data model that is in this book. We shall, for the moment, continue with the assumption that the TLO is fit for our purpose.
We then need to construct a means by which we could use the TLO. Dr West's data model can be used as a modelling guide for specific modelling tasks but another approach, given that we want to support the continuous integration of information, is to adopt it in its entirety in a live information system environment. This way, if our assumption is correct, we have something that can be used as a basis for consistency in an enduring way. The TLO becomes the root model of the data environment, with all other data within our system being absolutely conformant with it. Diagrammatically the data hierarchy indicated in the first figure can be summarised in the following pyramid:

At the top of the pyramid is the hierarchical root of the model, the founding entity within the TLO. The TLO provides the modelling framework for all the data instances that makes up the Reference Data and, as a result, the Reference Data provides the framework for all the data instances that are generated during the information system's operations. All the data is therefore consistent with it. Now, where the lines are that separate the TLO from the Reference Data and the Reference Data from the Information Operations Data is just a design/implementation decision. As there is the potential for full consistency from top to bottom then it can be sliced many ways. However, the layers shown in the pyramid show a logical way of treating different parts of the data model's 'semantic' and formal structure. The TLO at the top is needs to be a stable base from which all the other data representations are constructed. If this doesn't include some foundational parts that really must be used consistently across a wide community then there is a lower chance of it being adhered to if it is 'buried' in Reference Data. Conversely, if there is too much in it then it can become unwieldy and hard for a wide community to adopt. This decision was made for us as we 'just' decided to adopt the data model in Dr West's book as 'our' TLO. As presented in his book the data model is a conceptual framework for data modelling but we took the advice in his book and selected some prudent properties (attributes) for the root object, 'thing', so that it could be used as a native data model. An example root attribute is a unique identifier so that each entity can be uniquely referenced but, importantly, these attributes don't add any formal meaning to the structure in the model other than uniqueness.
The Reference Data is the next critical part. As it can be dynamically modified (added to) as information requirements surface it plays a key, ongoing role. Managing it can, and should, be a continuous process and, being widely accessible, can be available to all systems that are in operation and allowed to access it (not just systems 'in development'). As mentioned above the Reference Data can be thought of as being similar to a set of class libraries to allow lower level data instances to be created as instances or members of those classes (both are required but class membership is different from being an instance of a class). As we haven't yet introduced how this may be implemented in code it is not advisable to conclude that the term "class" here is synonymous with an object class in an Object oriented Programming language - however, they aren't unrelated! Reference Data won't change frequently - it should be essentially immutable - and can be thought of as the library of data model patterns that allow the data model to be populated as Information Operations require.
Information Operations (IO) result in the population of the data model 'within' the participating systems. This is where the high volume data is. The requirements for handling, storing or querying this data may result in it being held in dedicated data stores that are optimised to the performance task at hand but the key point of it is that in order to generate and process these data instances there must be full conformance with the TLO via shared Reference Data (even if a mapping is used for optimisation purposes). This is a key point. Magma Core provides the facility to use the data model (HQDM) in a strictly compliant form but storage systems can be mapped from this to enable efficient storage and processing where required; as long as representational consistency is not compromised (or, where it is, the loss is accepted).
Finally, you may have noticed that Master Data has not been included in the data pyramid. This is because it can be considered as an important subset of IO Data that needs to be accessible to many/all participating systems. In addition, there is nothing to stop Master Data being included with Reference Data but Master Data can't be created without the patterns of classes declared in the Reference Data, to allow the Master Data to be populated.
This overview has indicated an approach, based on full adoption of a Top Level Data Model (formed around a TLO), that can enable representational consistency across multiple systems. The next step is to look at our experimental approach to implementing it.
Next: Introducing Magma Core
* "Belief", 'knowing' and justifications for believing or knowing is a big topic in itself (the theory of which is often called epistemology). It is highly relevant to information systems as they increasingly represent more and more about the world. There is scope to expand on this in the Magma Core Wiki in the future but for the moment it focusses on the representational foundation that can provide a basis for epistemic considerations.
** There has been a trend for over a decade or more not to use overt data models in many new systems. This can appear to allow greater flexibility but there is a severe cost in the loss, or absence, of meaningful records that would be required to make the information contained in those systems useable and manageable.
*** A cautionary remark here is that the ontological considerations are bound to metaphysics and logic and, specifically, does not rely on linguistic descriptions to for the root of the model.
**** Dr. West was not involved in the development of Magma Core.