Contact us

0333 772 1963

info@infoboss.co.uk

Get in touch

    Data quality dimensions: Accuracy

    There was an interesting article published recently by data quality consultants DPA on the need to assess data quality from an organisational perspective.

    Namely, when assessing data quality, to consider not just the data values, but also the requirements of the data within the organisation and the “data subject” i.e., the “thing” that the data represents. If your data quality assessment is not approached in this multi-faceted way there is a danger that the assessment of data quality may miss one or more of the key data quality dimensions.

    The accuracy dimension

    Accuracy of data is jointly the most important of the dimensions and the most difficult to assess using data quality profiling and assessment tools. The reason for this is that to assess accuracy its often necessary to reference the data subject itself. For example, to assess the accuracy of an asset’s attribute data, such as a property, it may be necessary to survey it or in the case of a person to interview them. These processes can be time consuming, expensive to undertake and as such leave the data you hold exposed to the risk of deteriorating accuracy over time.

    However, there are strategies that can be adopted to help reduce the frequency or scale of such activities and increase confidence in the accuracy of your data.

    Definition:

    The degree to which data correctly describes the “real world” object or event being described.

    Measure:

        The degree to which the data mirrors the characteristics of the real-world object or objects it represents.

    Authoritative sources of data

    Accuracy confidence can be improved if the data subject’s attributes are validated against a trusted, authoritative data source. If it’s possible to source such data, then the means of augmenting your data records with data for accuracy assessment can be utilised. For example, having a postcode for an address would allow a cross check to determine if the postcode is active, and whether the other address lines, like street, town, county are accurate.
    Another example might be that associated with B2B marketing. Having a registered company number in your data set allows for a cross check against the Companies House data to determine accuracy of the company name, registered address, status and other attributes that you may hold.

    Case study: Customer gender accuracy

    Infoboss is built on search engine technology. This means that it can do certain things on a scale that database technology is unable to do. We recently were required to determine whether a person’s first name and gender as stored in the client’s CRM database were valid and accurate. Validity rules for the person’s first name were simple, must be at least 2 characters in length, can be hyphenated, must be capitalised. Gender was stored in the database as M or F.

    To assess accuracy, we curated a list of new birth registration names (source: ONS) for the past 5 years and their related gender. The resulting list of circa 15,000 new birth names was de-duped and names like “Alex” that can be male or female tagged.

    Once we had our first name candidates it was then possible to augment our reference gender and name frequency of occurrence to our CRM data set. So, we now had both the original name and gender as well as the reference name and if there was a match, the reference gender, name frequency and indicator as to whether it could be both male and female. Using infoboss analysis tools we were able to quickly identify…

    • Names with no match – possibly inaccurate/require independent assessment
    • Names with inaccurate gender
      • Subset of these records where they could be either gender

    In summary, this process was not perfect, but it helped the client as it reduced the number of records they needed to accuracy check with the data subject from potentially 115,000 to just over 4,000. So, in short, the technique of incorporating authoritative data sources into your data quality assessments can help build confidence in the accuracy of your data and potentially reduce the costs of accuracy assessment.

    Accuracy sampling

    It is often deemed useful to be able to generate random samples of representative records that could give a better indication of overall data set accuracy. The size of the sample needs to be determined as appropriate for the type of data under scrutiny. For example, 20% error in email address accuracy for a marketing campaign for electric toothbrushes might be acceptable, but 20% mailing address errors when communicating medical test results would be completely unacceptable, a disaster!

    Data owners sometimes struggle to undertake these analyses as they almost certainly will need to go outside of their data quality tool “comfort zone” to get the answers. However, infoboss’ powerful data analytics, search, browse and discovery features are intuitive and easy to use, enabling the data owner to conduct these sampling exercises themselves.

    Armed with the representative sample, data owners can independently audit and assess the sample data subjects for accuracy, with the ensuing results recorded, loaded back into infoboss, augmented with the original data and used to establish accuracy comparisons with the entire data set.

    Summary

    At infoboss we recognise that data accuracy is the most important data quality dimension and there is no substitute to independent audit and assessment of the data subject to verify and assess accuracy. However, there are things that can be done to help the data quality professional gain a better understanding as to the prevailing accuracy of their data and mitigate the cost and timeliness of such assessments.

    Look out for our other posts on the six data quality dimensions.

    To discover more about how infoboss can help support your data quality and data protection initiatives, please get in touch.

    Share