Data quality: How do you quantify yours?

Data quality: How do you quantify yours?

Data quality: How do you quantify yours?

Being able to measure the quality of your data is a vital to the success of any data management programme. Here, Peter Eales, Chairman of KOIOS Master Data, explores how you can define what data quality means to your organization, and how you can quantify the quality of your dataset.

In the business world today, it is important to provide evidence of what we do, so, let me pose this question to you: how do you currently quantify the quality of your data?

If you have recently undertaken an outsourced data cleansing project, it is quite likely that you underestimated the internal resource that it takes to check this data when you are preparing to onboard it. Whether that data is presented to you in the form of a load file, or viewed in the data cleansing software the outsourced party used, you are faced with thousands of records to check the quality of. How did you do that? Did you start by using statistical sampling? Did you randomly check some records in each category? Either way, what were you checking for? Were you just scanning to see if it looked right?

The answer to these questions lies in understanding what, in your organization, constitutes good quality data, and then understanding what that means in ways that can be measured efficiently and effectively.

The Greek philosophers Aristotle and Plato captured and shaped many of the ideas we have adopted today for managing data quality. Plato’s Theory of Forms tells us that whilst we have never seen a perfectly straight line, we know what one would look like, whilst Aristotle’s Categories showed us the value of categorising the world around us. In the modern world of data quality management, we know what good data should look like, and we categorise our data in order to help us break down the larger datasets into manageable groups.

In order to quantify the quality of the data, you need to understand, then define the properties (attributes or characteristics) of the data you plan to measure. Data quality properties are frequently termed “dimensions”. Many organizations have set out what they regard as the key data quality dimensions, and there are plenty of scholarly and business articles on the subject. Two of the most commonly attributed sources for lists of dimensions are DAMA International, and ISO, in the international standard ISO 25012.

There are a number of published books on the subject of data quality. In her seminal work Executing Data Quality Projects: Ten Steps to Quality Data and Trusted Information™ (Morgan Kaufmann, 2008), Danette McGilvary emphasises the importance of understanding what these dimensions are and how to use them in the context of executing data quality projects. A key call out in the book emphasises this concept.

“A data quality dimension is a characteristic, aspect, or feature of data. Data quality dimensions provide a way to classify information and data quality needs. Dimensions are used to define, measure, improve, and manage the quality of data and information.
The data quality dimensions in The Ten Steps methodology are categorized roughly by the
techniques or approach used to assess each dimension. This helps to better scope and plan a project by providing input when estimating the time, money, tools, and human resources needed to do the data quality work.

Differentiating the data quality dimensions in this way helps to:
1) match dimensions to business needs and data quality issues;
2) prioritize which dimensions to assess and in which order:
3) understand what you will (and will not) learn from assessing each data quality dimension, and:
4) better define and manage the sequence of activities in your project plan within time and resource constraints”.

Laura Sebastian-Coleman in her work Measuring Data Quality for Ongoing Improvement, 2013 sums up the use of dimensions as follows:

“if a quality is a distinctive attribute or characteristic possessed by someone or something, then a data quality dimension is a general, measurable category for a distinctive characteristic (quality) possessed by data.

Data quality dimensions function in the way that length, width, and height function to express the size of a physical object. They allow us to understand quality in relation to a scale or different scales whose relation is defined. A set of data quality dimensions can be used to define expectations (the standard against which to measure) for the quality of a desired dataset, as well as to measure the condition of an existing dataset”.

Tim King and Julian Schwarzenbach in their work, Managing Data Quality – A practical guide (2020) include a short section on data characteristics, that also reminds readers that when defining a set of (dimensions) it depends on the perspective of the user; back to Plato and his Theory of Forms from where the phrase “beauty lies in the eye of the beholder” is derived. According to King and Schwarzenbach quoting DAMA UK, 2013, the six most common dimensions to consider are:

  • Accuracy
  • Completeness
  • Consistency
  • Validity
  • Timeliness
  • Uniqueness

The book also offers a timely reminder that international standard ISO 8000-8 is an important standard to reference when looking at how to measure data quality. ISO 8000-8 describes fundamental concepts of information and data quality, and how these concepts apply to quality management processes and quality management systems. The standard specifies prerequisites for measuring information and data quality and identifies three types of data quality: syntactic; semantic; and pragmatic. Measuring syntactic and semantic quality is performed through a verification process, while measuring pragmatic quality is performed through a validation process.

In summary, there is plenty of resource out there that can help you with understanding how to measure the quality of your data, and at KOIOS Master Data, we are experts in this field. Give us a call and find out how we can help you.

Contact us

In summary, there is plenty of resource out there that can help you with understanding how to measure the quality of your data, and at KOIOS Master Data, we are experts in this field. Give us a call and find out how we can help you.

+44 (0)23 9387 7599

info@koiosmasterdata.com

About the author

Peter Eales is a subject matter expert on MRO (maintenance, repair, and operations) material management and industrial data quality. Peter is an experienced consultant, trainer, writer, and speaker on these subjects. Peter is recognised by BSI and ISO as an expert in the subject of industrial data. Peter is a member ISO/TC 184/SC 4/WG 13, the ISO standards development committee that develops standards for industrial data and industrial interfaces, ISO 8000, ISO 29002, and ISO 22745. Peter is the project leader for edition 2 of ISO 29002 due to be published in late 2020. Peter is also a committee member of ISO/TC 184/WG 6 that published the standard for Asset intensive industry Interoperability, ISO 18101.

Peter has previously held positions as the global technical authority for materials management at a global EPC, and as the global subject matter expert for master data at a major oil and gas owner/operator. Peter is currently chief executive of MRO Insyte, and chairman of KOIOS Master Data.

KOIOS Master Data is a world-leading cloud MDM solution enabling ISO 8000 compliant data exchange

International trade and counterfeiting challenges: a new digital solution that will traverse the borders – Part 2

International trade and counterfeiting challenges: a new digital solution that will traverse the borders – Part 2

International trade and counterfeiting challenges: a new digital solution that will traverse the borders – Part 2

Part 2 – Introducing K:blok – the digital solution to international trade and counterfeit challenges

Introduction

In February 2019, we (KOIOS Master Data) embarked on a successful year long research and development project focusing on “Using ISO 8000 Authoritative Identifiers and machine-readable data to address international trade and counterfeiting challenges”. This project was funded by Innovate UK, part of UK Research and Innovation. ISO 8000 is the international standard for data quality.

Part one of this article explains the challenges HMRC and the UK PLC face due to counterfeiting and misclassification when importing into the UK, and outlines a digital solution to solve those challenges. Upon which we won our Innovate UK grant.

This part of the article (part two) outlines the development progress made towards building a digital solution, how machine learning and natural language processing techniques were used during the year-long project and how the project can move forward.

K:blok – technology to traverse borders

To tackle the challenges outlined in part one, we developed a new software product, K:blok.

K:blok is a cloud application that allows importers to create a digital contract between the parties involved in the cross border movement of goods from the manufacturer to the importer/buyer. These parties can include: manufacturers, shippers, freighters, insurers and lawyers, amongst others.

The contract brings together, in a single source, various pieces of data that are required to successfully and efficiently import a product into the UK and data that is not currently captured in any software system:

  • ISO 8000 compliant, machine readable, multilingual product descriptions produced by the manufacturer of the products;
  • ISO 8000 compliant Authoritative Legal Entity Identifiers (ALEI’s) for each organisation that participates in the trade;
  • Accurate commodity codes for each product, the quantity of products, serial numbers and anti-counterfeit information (only visible to the manufacturer, the buyer and HMRC) to help validate the authenticity of the product;
  • Trade specific information required for insurance and accountability, for example: the trade incoterm;
  • Licensing and trading information about the parties in the contract, for example: Economic Operators Registration and Identification (EORI) number;
  • Information regarding the route the product is taking, for example: the port of import into the UK, port of export from the original country of export, vessel/aircraft numbers and locations of the change of custody of the consignments.

The contract is digital, machine readable, can be exchanged without loss of meaning and is suitable for interoperating with distributed ledger technology, like blockchain.

This data can be accessed and used by any of the participants of the contract and analysed by HMRC. All of this data is captured before the goods are moved which, in turn, provides an intelligence layer and pre-arrival data on goods for HMRC analytics, to enable resources to be targeted at consignments deemed high risk.

This single source of data also provides buyers with an audit trail for their purchased products, which begins with the original manufacturer which assists with the authentication of the product received and can form the basis of an efficient global trusted trader scheme.

Natural language processing will help avoid misclassification

As discussed in part one, misclassification leads to the UK losing billions in tax revenue. Misclassification is both intentional and unintentional. Reducing the unintentional misclassification could save the UK millions in tax revenue.

There is a fundamental flaw in the current process of tariff code assignment. The party that currently assigns the tariff code is not usually the manufacturer of the product. Therefore, the party does not have the technical knowledge to classify the product correctly. This party also rarely has a full description of the product and resorts to using a basic description from an invoice to assign the code.

Currently, HMRC provides an online lookup and email service to enable UK businesses to assign the correct tariff code. However, there are concerns that the service is not time efficient. This concern will only get worse as more companies may have to classify their goods once the UK leaves the European Union (EU).

Therefore, as part of our project, we worked with two students from the University of Southampton, studying Computer Science with Machine Learning, to create an additional application programming interface (API) that links with the government tariff code API and uses natural language processing techniques to score a similarity between an input product description and the potential mapping to the correct tariff code.

This is accessible by manufacturers using the KOIOS software to link their ISO 8000 compliant product specifications to the correct commodity code for trading with the UK.

Techniques such as term frequency-inverse document frequency (tf-idf) and K-means were integrated into this API. Support Vector Machine (SVM), Random Forest and a Deep Neural Network (2 layers) have also been explored to improve the accuracy of the algorithm.

The API successfully improves on the searching capabilities of the government online lookup service within the product areas explored in this project – which were bearings and couplings.

KOIOS are uniquely positioned to continue the development of digital solutions for the UK PLC

Our Innovate UK project provides a foundation to achieve more efficient, cost-effective, cross border trading and to reduce counterfeit activities. We believe that data standards, including ISO 8000 can play a huge part in digitising and automating this process further.

We are ideally suited and uniquely positioned to continue the research and development of both the K:blok platform and the machine learning tariff classifier.

We also believe there is an opportunity to digitise the outdated, human readable tariff classification into a digital classification, using the international standards ISO 22745 and ISO 29002. These data standards sit at the core of all of the products in the KOIOS Software Suite. A digital version of the tariff classification will improve the accuracy, speed and reliability of computer automation.

Join us in our vision

Our successful Innovate UK project was a step in the right direction to improving international trade and reducing counterfeiting. Brexit also provides a great opportunity for the UK to become a world leader in using technology across borders and to set the standard for countries to follow.

In the coming months, we will continue to engage with the UK Government/HMRC and continue to look for opportunities to fund our research and development.

If you think that you can add value to this project and would like to explore how we could collaborate then please get in touch at info@koiosmasterdata.com

Contact us

If you think that you can add value to this project and would like to explore how we could collaborate then please get in touch.  

+44 (0)23 9387 7599

info@koiosmasterdata.com

International trade and counterfeiting challenges: a new digital solution that will traverse the borders – Part 2

International trade and counterfeiting challenges: a new digital solution that will traverse the borders – Part 1

International trade and counterfeiting challenges: a new digital solution that will traverse the borders – Part 1

Part 1 – The cost of counterfeit goods and misclassification to the UK

Introduction

In February 2019, we (KOIOS Master Data) embarked on a successful year-long research and development project focusing on “Using ISO 8000 Authoritative Identifiers and machine-readable data to address international trade and counterfeiting challenges”. This project was funded by Innovate UK, part of UK Research and Innovation. ISO 8000 is the international standard for data quality.

This part of the article (part one) explains the problem counterfeit goods and misclassification of products has on the UK PLC and the proposed solution which won us the Innovate UK Government grant.

A GBP 11 billion impact: and poor data exchange is the root of the problem

Counterfeit products and misclassification of products, when importing into the UK, cause major challenges for commercial organisations and the economy in the UK. These challenges increase a business’s exposure to risk, including consumer health, safety and well-being.

The impact of global counterfeiting on the UK economy is increasing. The Organisation for Economic Co-operation and Development (OECD) states that forgone sales for UK companies due to infringement of their intellectual property (IP) rights in global trade amounted to GBP 11 billion and at least 86,300 jobs were lost due to counterfeiting and piracy in 2019.

Protection from counterfeiting could save some organisations £000’s: for example, Greek customs seized 17,000 bearings, purporting to be from SKF, worth €1m in a single anti-counterfeiting operation.

When importing into the UK, importers are required to declare a commodity code for the products being imported. The commodity code is used to collect duty and VAT and dictates the restrictions and regulations, including the requirement for licensing, when importing or exporting the product.

Often the importer of the product does not have the technical knowledge to classify the product correctly. This, in combination with the complexity of the tariff code system currently adopted by the European Union (EU), and subsequently by the UK, causes many cases of misclassification. These cases are both intentional and unintentional.

Misclassification causes incorrect duty and VAT ratings to be applied to companies importing products, and also distorts trade statistics. Fraudulent misclassification leads to the UK losing billions in tax revenue.

Importers currently make customs declarations using the Customs Handling of Import and Export Freight (CHIEF) system, with some importers transitioning to the newer Customs Declaration Service (CDS).

The current importing process is not stringent enough and information is declared too late in the process. This results in a lack of transparency of the origin of products and a lack of quality data supporting the import and trade.

Therefore, Customs have the near impossible task of identifying and intercepting counterfeit or misclassified products. Customs activities increase spending by the UK Government on customs checks and delay trading activities.

International trade and counterfeit challenges: there is a digital solution

We believe that the challenges facing HMRC and the organisations that suffer from counterfeit goods can be solved with a stringent digital solution. A digital solution that captures:

  • A quality description of the products in a consignment;
  • The regulatory/licensing requirements on the products and the importer – for example: the commodity code of the product and the Economic Operators Registration and Identification (EORI) number; and
  • The parties involved in the trade – for example: manufacturers, shippers, freighters, insurers and lawyers, amongst others,

in a timely manner (pre-arrival to the UK border). This assured, single source of data can then be used by all parties in the supply chain, including HMRC and border forces. HMRC will then be able to use this trusted data to better target resources on more risky consignments and the platform can be a requirement for inclusion in a trusted trader programme.

This digital solution can be taken further so that the importer using the platform can establish a purchase order with the seller.

We also believe that we can help to reduce the misclassification of products by:

  1. Putting the responsibility of classifying the product on the manufacturer of the product, rather than the importer; and
  2. Assisting the manufacturers with classifying the product by using ISO 8000 compliant, machine readable product specifications and machine learning techniques to search the current human readable tariff classification.

Without a digital data solution for the automating of tariff code assignment and the provenance of products, no significant improvements to the current state of play can be achieved.

The proposed solution would enable HMRC to: 

  • reduce administration; 
  • eliminate errors; 
  • restrict growing levels of fraud in the digital economy 
  • target resources effectively through collecting pre-arrival data on goods 

This proposal formed the fundamental basis of our successful Innovate UK grant application.

The next part of this article outlines the development progress made towards building a digital solution, how machine learning and natural language processing techniques were used during the year-long project and how the project can move forward.

Contact us

If you think that you can add value to this project and would like to explore how we could collaborate then please get in touch.  

+44 (0)23 9387 7599

info@koiosmasterdata.com

Whole of life visualisation of master data for engineering entities

Whole of life visualisation of master data for engineering entities

Whole of life visualisation of master data for engineering entities

I am grateful to Peter Eales for his invitation to contribute this Blog, for the interest of the wider Master Data community.

David Rew MChir FRCS

Consultant Surgeon, University Hospital Southampton

Over the past ten years, I have led a small development team at University Hospital Southampton NHS Foundation Trust in the creation and implementation of a radical approach to the Electronic Patient Record interface. Our product, UHS Lifelines [1], is a prime exemplar of the Stacked, Synchronised Timeline and Iconographic (SSTI) structure. It is live and in daily use across the hospital for some 2.5M individual patient record sets.

This system has evolved from the Lifelines project work of the Human Computer Interaction Laboratory of the University of Maryland in the mid 1990s [2], but which was not further developed into practical applications at that time.

The UHS Lifelines system permits the display, navigation and interrogation of all electronic records (documents, reports and events) for any one patient on a single computer screen, where the X axis displays continuously incremental time, and the Y axis displays the subject taxonomy for which e-data is available for the individual patient. Each icon acts as a dynamic window to the underlying document or cluster of documents, reports and events.

The UHS Lifelines sits efficiently over the data sets, which may be sourced from a number of different packages and software systems, so as to load the interface in real time.

The image is a live screenshot in an outpatient clinic of an older lady who has multiple co-morbidities and who has needed multiple hospital admissions. Each icon opens a frame to display the underlying content

The purpose of dynamic data visualisation

Data exists to help human beings to make decisions in the real world. A well designed data visualisation system will minimise the time and effort needed to reconstruct the story from the data, while maximising the quality of the decision from an overview of the entire data set. This fits the mantra of Emeritus Professor Ben Shneiderman of the University of Maryland, who mandated that any data visualisation system should start with a global Overview of the data, while allowing the observer to Zoom In on any point of interest, Filter Out extraneous material, and secure Details on Demand [3]. 

The core purpose of UHS Lifelines is therefore clear. It is to help the clinician to make better clinical decisions in less time, at least risk and with less fatigue, than would be possible with any other format of electronic information retrieval and display. An excellent data visualisation system efficiently conveys primary information through visual imagery in the form of patterns, colours, shapes and relative positions. This format directly addresses the fast processing powers of the visual cortex in the brain, before a word is read.

Paradoxically, the format of UHS Lifelines also (in effect) abolishes time. A document of any age can be located and opened at the same speed as a contemporary document, without the need to search through back catalogues, or across multiple screens, windows, tabs and lists, to find historic records.

The wider potential of the SSTI (Lifelines) interface

It will be immediately apparent that UHS Lifelines is the first-in-class exemplar of a far wider range of applications of the SSTI concept, including many in Master Data Management. The concept of whole of life visualisation applies as readily to an engineered entity as to a human life. Indeed, many engineered systems have life spans which are as long, or longer than the longest human life.

This poses many challenges in master data management as the life of the entity evolves from the initial concept, through the technical drawing and test phases, to the engineered finished product, be it a ship, aircraft, railway network or oil and gas pipeline system.

Each product may in turn contain thousands or millions of individual components, each with their own critical life histories from conception to final immolation.

Each element of any engineered product will thus be associated with a huge volume of data in many different subject fields, which include:

  • Original drawings and design elements;
  • Test data for the component by itself and within the engineered system;
  • Ownership of the designs, of the components and of the complete system;
  • Manufacturing, including the prime assembler and one or many part suppliers;
  • The legislative environment in each jurisdiction in which the system and the components are used;
  • Wear and tear characteristics, signs of failure and actual failure;
  • “Off-label use” of components and systems outwith the original intent and specifications, and so on…

It is self evident that anyone who comes late to the party will often face huge problems in understanding all aspects of the system which they inherit for ownership, maintenance and/or replacement purposes, and the documentation and the costs associated with information management around complex systems can be enormous [4].

The questions therefore arise as to:

  • Whether a data model which resembles the UHS Lifelines model would establish a valuable role across the engineered systems universe;
  • Whether such a model would be a natural home for structured master data across the whole life of the component or system;
  • Whether such a data model could be standardised across the relevant industries, such that data and metadata were portable and transferrable across standard interfaces and taxonomies for the whole life of the entity;
  • Whether the costs of researching and implementing such a system could be estimated and matched to the projected benefits at scale;
  • Whether existing organisations, and in particular the International Organisation for Standards could be co-opted into participation and partnership in such a system
  • Whether and how suitable and persuasive exemplars can be developed at pace and scale;
  • How and where the core data would be stored; and
  • Whether the application of a mandated “whole of life” core data set and appropriate data visualisation systems could progressively be applied to all engineered components and systems, for the benefit of owners, users and maintainers throughout the life of engineered systems and components.

In summary, the Stacked, Synchronised Timeline and Iconographic (SSTI) structure, as exemplified by UHS Lifelines, would appear to offer significant potential for further development as a tool for the visualisation, interrogation and whole-of-life Master Data Management of engineered components and entities. The challenge is how to get this message out across the engineering community, and to develop practical applications.

I hope to return to these matters in future blog posts as the story evolves.

References

[1] Hales A, Cable D, Crossley E, Finlay C and Rew DA. The Design and Implementation of the Stacked, Synchronised and Iconographic Timeline Structured Electronic Patient Record in a UK NHS Global Digital Exemplar Hospital. BMJ Health & Care Informatics 2019 – https://informatics.bmj.com/content/26/1/e100025

[2] Plaisant, C., Shneiderman, B., Mushlin, R. An Information Architecture to Support the Visualization of Personal Histories.Information Processing & Management, 1997;  34: 5, pp. 581-597, 1998.

[3] Shneiderman B. The Eyes Have It: A Task by Data Type Taxonomy for Information Visualizations Proceedings  IEEE Symposium on Visual Languages  1996; 336-343

[4] Rogoway T. Boeing Is Being Paid $84 Million Just For Manuals For New Air Force One Jet – The War Zone website, April 15th 2020 – https://www.thedrive.com/the-war-zone/33034/just-manuals-from-boeing-for-new-air-force-one-jets-cost-a-whopping-84-million

About the author

David Rew is a Consultant General Surgeon at University Hospital Southampton NHS Foundation Trust, where he also leads the digital innovation team on the UHS Lifelines project.

From 2017 to 2019, he served with the Strategic Advisory Team for Healthcare Technologies of the UK Engineering and Physical Sciences Research Council

If you are interested in the research and topics covered in this article and would like to discuss further then please contact D.Rew@soton.ac.uk