The ‘dark data’ conundrum

Originally published in Computer Fraud & Security (Science Direct, Elsevier) [July 2020]

Unstructured “Dark” data is both an opportunity and a threat to every business. How your organisation manages this unique category of data will potentially define its chances of success or failure over the years to come. This article explores the risk and opportunity inherent in dark data and suggests ways of enhancing your organisation’s approach to managing this enigmatic data type.

Back in 2016, IDC predicted that by 2025, there would be a ten-fold increase in the volume of data held within the data estate of the modern enterprise (figure 1). In the same paper they put forward the idea that 37% of an enterprise’s digital universe will contain information of value if analysed (figure 2).

They also predicted that over time, the composition of our business data will see over 80% of it being of the unstructured (“dark”) data variety, the most difficult to manage and control.(1)

There appears to be no reason to doubt these assertions, we can clearly see as we look at our ever-expanding stores of unstructured and un-analysed data within our own data estates just how much of a challenge they have become to manage and control.

Many believe that Dark data management is the next frontier in Information Governance. (3) The reason why is that it is the most difficult category of data to manage and control. This is largely down to the fact that many businesses do not have effective capture and classification processes in place for this type of data or worse still rely on humans to undertake the classification process for them. The result is that for most organisations, there is very little insight into the dark data that they hold, consequently they store (hoard) the data for compliance reasons or until such time as they can get around to analysing it. This is a flawed approach if IBM are to be believed, as they suggest that 60% of data loses its value immediately (5)

What is Dark data?

Gartner defines Dark data to be:

“…the information assets organisations collect, process and store during regular business activities, but generally fail to use for other purposes (for example, analytics, business relationships and direct monetising). Similar to dark matter in physics, dark data often comprises most organisations’ universe of information assets. Thus, organisations often retain dark data for compliance purposes only. Storing and securing data typically incurs more expense (and sometimes greater risk) than value.”(2)

Dark data can take many forms: text messages, documents, PDFs, spreadsheets, comment fields in databases, chat scripts, social media messages, photographs, images, scanned documents, conversations in collaboration tools, log files, IoT devices, survey responses and of course, email.

Taking IDC’s prediction and Gartner’s definition together, then we have both a business opportunity to be nurtured and a financial and compliance risk to be managed.

The cost of data hoarding!

Considering just the storage costs: One Terabyte of storage costs circa £200 per month.(6) If this 1TB of data is not managed (factoring in the IDC growth prediction for data), then by 2025 this could be 10TB with a cost of storage circa £2000 per month (figure 3). With circa 60% of this data potentially having little or no value to the organisation. Doing nothing now, is effectively committing to a future cost with no value to the organisation of £1,200 per month! What is potentially worse, is that these costs to the organisation from storage alone are dwarfed if the data is subsequently subject to a data breach with the subsequent fines and reputational damage to the organisation likely to have a significant and detrimental impact on the business.

Of course, this example, is for an organisation with up to 1TB of data today, what about those that have tens or even hundreds of Terabytes within their data estate already?
Compliance is a crucial dimension to the debate. Back in 2018, the management consultancy McKinsey & Company recognised the problem and hinted at a potential way of solving it…

“Companies will need to increase automation and streamline their organisation if they are not to be overwhelmed by the challenge of sustaining [EU] GDPR compliance over the long term. Key building blocks will include tool support, continuing investment in cybersecurity, and improvements to internal processes.”(4)

So how do we automate and streamline our organisational data management processes?

One recommended approach is to incorporate into the organisation’s data management processes a means to automatically collect, index and classify data as it enters the data estate. Then providing trusted views of the data in a form suited to the needs of the compliance professional. Google TM (and other search-based technologies) solve this problem for data in the public domain (on the Internet) by using automated web-crawlers or bots to analyse the content of new and changed data and availing it to their search-engine for subsequent retrieval.

If such a solution were implemented, it would need to:

Automate the collection, indexation and classification of data from any digital source
Store indexed results in a scalable and secure enterprise data repository
Provide views of the data to information asset owners and data stakeholders throughout the organisation – effectively democratising data asset to owners and stewards within the enterprise
Enforce data management rules such as those for compliance (retention, data subject rights etc.), quality, analytics, portability and system interoperability.

Being able to automatically index, classify and search are key components of a potential solution. The sheer scale of data ensures that humans alone cannot do it. For example, we recently conducted a data audit for a property and care service organisation. For them, a typical Terabyte of shared drive contained circa 2 million files. Over half the files were image files (photographs) illustrating various levels of repair job completion (figure 4). But almost 925,000 files contained textual information, and these were stored in a folder structure that was many (over seven) levels deep. Undertaking a manual classification and audit of such a data source could potentially take months to complete and be almost worthless as the data would have changed (and grown) by the time it was completed. GDPR is heightening a sense of urgency for many, and in particular for this organisation as it became apparent as to the scale of the challenges to meet GDPR as alluded to by McKinsey and Company.

In undertaking the audit we discovered data going back to the IBM PC epoch date 1/1/1980 with filenames in the 8.3 filename format. The age of the files is perhaps a cause for concern from a data retention and protection perspective. There were future dated photographs taken on a camera with its date and time set incorrectly, meaning that enforcement of their data retention policy would be nigh on impossible for these files. We discovered 26,000 files were duplicates, some copied several times to different locations and perhaps more alarmingly almost 86,000 files that did not have a filetype that an application could process! It raises the question of why these files were being stored? From a data security perspective 62 files were files of type .CSV. These tend to be application downloads and often are a source of significant concern to information security professionals as they often contain personal and sensitive information. For this client, almost all of these files contained special category data and one in particular contained a list of employee salary details!

A similar breach, although allegedly committed by a disgruntled employee of Morrisons supermarket, led to the publication of the salary details of the entire workforce of 100,000 people. Employees were given too much access to special category information and protection was not monitored or enforced.(8)

It is often very easy for employees to export lists of information from databases, applications and reporting systems, save them to “work in progress” folders and then leave them (sometimes for years). With 97% of IT leaders worried about insider threat compounded by 78% believing that employees have accidentally leaked information.(9) Then there is an obvious and compelling reason to do more to combat this potentially hazardous type of data breach.

Unstructured data processing

Techniques for processing file meta data have been around for a while. I.e. looking at the filename, author, size, updated and modified dates and file Mime types etc… There are also ways of interrogating content using “scan based” searches. These are fine for relatively small quantities of data, but when several terabytes are being searched, they can take a long time to return the results in a digestible form if at all!

For an enterprise solution, it is necessary to use every means available to the system to automate the classification process. Regular expressions (regex) are a standardised way of identifying data that matches a specific pattern in a text-based data stream. There are literally hundreds of sites on the internet with regex patterns to match a plethora of data entities, like credit cards, passport numbers, phone numbers, vehicle registration plates, postal codes, bank account numbers and sort codes and so on. These techniques alone can help enormously when classifying a corpus of textual data, but more is potentially needed such as the propensity of a document to be a certain thing. For example, to answer questions like, “is this a letter?”, “Is this a customer contract?”, “Is this an accident report form?”. For these we are often looking for specific words and phrases that may be present to indicate its likelihood of being a particular thing. A letter for example, is likely to be a letter if it contains “Dear <name>” and “Yours sincerely” or “Yours faithfully” in the same document.

Going further, if you were able to ascertain from the letter a customer reference number in a particular format, you could utilise a database system of reference information to tag the file with additional data about the potential purpose of the letter. E.g. this person is currently dealing with the organisation for a legal dispute. Using combinations of machine learning and regex data derivation alongside reverse lookups of resulting data, we are able to provide a search-engine or casual browser of the data with the ability to get some insight into the purpose and potential meaning of the document and in turn classify it!

Of course there are other techniques for processing text based material such as sentiment analysis, is this a positive or a negative document? There are also a growing number of artificial intelligence algorithms that rely on a significant corpus of “case type” reference data to enable the algorithm to determine the propensity of a particular document to be representative of that “case type”. For example, does the document contain content of a scientific nature or is it related to health or finance topic etc… These algorithms are not always 100% accurate but they are a step forward in being able to help automate the process of understanding your dark data. New algorithms are emerging all of the time to solve specific issues. For example, by analysing the communication patterns of an organisation’s email participants, and highlight any specific person or email exchange as being potentially fraudulent, because it does not fit the model.

Not all dark data is text based. However, there are tools available to convert file scans of documents to text using Optical Character Recognition (OCR) techniques. There are also voice to text translation and even image recognition libraries that allow processing of photographic imagery to help generate text-based classifications of the file.

One great use-case is anti-fraud. A common problem in the insurance industry is collusion by third parties like repair specialists, solicitors and insurance brokers to inflate repair costs. Invoices supplied through the solicitor with inflated repair costs can often go unnoticed when bundled with several documents from a solicitor presenting a case for a claimant.(7) Unstructured data mining techniques can be used to identify the repairer from the invoice details and determine if they are a known fraudster, immediately tagging the document and supplementing the meta data for the workflow case. Automated processing systems can do this in seconds, providing insight that would otherwise be potentially overlooked.

Given the potential afforded by illuminating “dark data”, it is perhaps worth taking a moment to reflect on the potential relevance and value of your own organisation’s “dark data” and what you are going to do today, to reduce the long term cost of hoarding data of no value, mitigate the risks of a data breach, drive efficiencies in business processing, prepare your organisation to exploit AI, increase your competitiveness and deliver tangible improvements in services to your customers.