Could you benefit from text analytics?
Many organisations have large collections of documents stored in shared drives either in the cloud or on premise. These documents are often referred to as unstructured data or sometimes the more sinister term “dark data” is used. A common characteristic of these data assets is that very little is understood about them. You may be surprised to know that typically up to 80% of your data storage cost is associated with storing this data and from work with our customers we see that it largely doubles every 2-3 years and accelerating!
We also see that this data presents perhaps the greatest data protection risk as personal and sensitive data is stored and not minimised or retention policies are not applied. It is also very difficult to find information in these documents because in many cases there is no readily accessible means of searching them.
All document files have associated with them basic meta data such as the filename, path, size, type, modified and created dates etc.. This information you can generally see when you look at the list of documents in a folder. On some document management systems there may be additional custom meta data that has been attributed to the files such as for example, their security classification, ownership, business purpose etc… This custom meta data is rarely managed and is dependent on people to follow a process to update it, over time it suffers in terms of quality and greatly reduces the benefits of holding it.
Documents fall into broadly two types. Those containing text and those containing images. The latter can be photographs or other types of graphics but in many cases these documents are scanned copies of text documents. Legal, procurement, customer services (post room) and human resource teams will often have large volumes of these types of files on their shared folder structures.
How can text analytics help?
Using tools like infoboss, it is now possible to build a searchable, structured view of your unstructured data assets that contains not only the basic and custom meta data, but also enhanced meta data derived from the text.
Where images are stored, these can be put through an optical Character Recognition (OCR) process to derive the text from the file and associate it with the document. Having extracted the text from your documents you are then able to apply additional processes to automatically extract data items from the text such as email addresses, postcodes, telephone numbers, customer references, company house numbers, invoice numbers, national insurance numbers, payment card numbers, bank account numbers and many more. You can also use trained artificial intelligence natural language processing techniques to determine what the file might be about or gauge its sentiment and tag the file accordingly.
The results afford the organisation many benefits:
• Understand data protection risk
• Gauge effectiveness of minimisation and retention policies
• Manage the quality of document meta data
• Provide a structured view of the meta data that can be used to feed data pipelines for business intelligence and reporting
• Provide a searchable repository enabling you to quickly find relevant documents. E.g. perhaps to service Data Subject Access Requests (DSARs)
You may be surprised to learn that it doesn’t take long to get a basic view of your unstructured data assets in place. For most clients a terabyte document store can be generating insight within a couple of days.
If you’d like a demonstration as to the art of the possible with your unstructured data assets, please get in touch to discover the benefits for your organisation.