An OCR story
A case study in converting scanned documents into text to generate insights and improve document management processes…
We’re working on a client project presently where there is a requirement to convert scanned PDF documents into text. We have done this before, but for this job, there was a larger volume of documents to process than we normally see and we felt that the experience was worth sharing…
There are many benefits to be derived from extracting text from scanned document files. For a start, the documents become searchable and using tools like infoboss we can not only perform the searches, quickly and efficiently, but also undertake further processes on the text to enhance document meta data, automate document classification, run natural language processing, perform text analysis or even feed artificial intelligence/machine learning and business intelligence systems to deliver further insights.
To do the conversion of the image to text it is necessary to run an Optical Character Recognition (OCR) process, transforming the scanned document content into digital text files. Infoboss has been built with OCR processing as a standard feature so once we switched it on, the first thing we needed to understand was how long it would take to run the job. From previous experience we knew that OCR processing is a heavily CPU dependent process.
For this job, there are circa 63,000 PDF documents with a total size of 61GB. The machine we were using had 8 virtual processors with a clock speed of 2.1ghz and 32GB RAM.
Before starting we ran a small job using 8 x processor threads on 10 files, each file was roughly of size 1MB. From this we concluded that the amount of memory (32GB RAM) was not a factor in the job duration, however the 8 processors were flat out, 100% utilised, for 10 minutes and then just 2 processors were flat out for a further 10 minutes. Total job duration 20 minutes. From this test we had identified that one document was being processed per thread at an average rate of 100KB/minute i.e. up to 8 x documents of size 1MB, every 10 minutes.
A quick calculation showed that the whole job on this system was going to take roughly 52 days (running 24 x 7).
Although this was understood, it was longer than the client wanted. We needed to get this done more quickly, and ideally within 30 days.
We decided to add more virtual processors and run the task again across a larger sample size. So, we increased the CPU resources to 16 virtual processors.
We split the job into several smaller jobs as follows…
Job Files Size
Less than 100KB 24,500 1GB
100KB to 500KB 17,500 4GB
500KB to 5MB 18,000 30GB
5MB to 10MB 2,000 13GB
Over 10MB 1,000 13GB
We started with the ‘100KB to 500KB’ job and after an hour we were able to confirm that our original processing assumptions were valid and that with the additional processors we were now looking at a much more acceptable total job duration of circa 26 days. We also observed that the composition of the data set was such that the vast majority of files (60,000) were less than 5MB in size.
This week we will have been running for circa 13 days and will have 60,000 files processed just over half the data (35GB) converted. We will then kick off the jobs on the bigger files (those of file size 5MB+), but we now know with some degree of confidence exactly how long this will take and we are already able to start the analysis on the text derived thus far.
Another interesting observation is that for just short of 1% of the files OCR processed thus far we have not been able to extract any text from. On inspection of a random sample of these files, they are either poor quality scans (skewed or feint), images that do not contain text, empty or corrupt PDFs. This in itself is useful to know as these documents can be analysed independently to better understand their contents and business relevance.
Finally, the quality of the text derived appears to be sufficient to undertake further analysis and appears to be a reasonable representation of the original documents. For the 30.8GB of scanned file processed so far, it has generated 1.8GB of text, around 5.8% of the original scanned document size.
Having the text available from scanned documents in a form where it can be processed and analysed opens up many opportunities for the client. You cannot expect a job like this to deliver 100% accurate representations of the original text, but it does shine a light onto fresh insights from hitherto unstructured dark data assets.
Client benefits include:
• Discovery of personal and sensitive data for data protection management
• Quickly search and discover information for the generation of new insights from unstructured data
• Automated classification of documents for data retention and information governance purposes
• Find information to service data subject access and freedom of information requests in an efficient and controlled manner
• Enhance document meta data to support activities like contract analysis and management
Using a platform like infoboss makes all of this possible as we are able to OCR scanned documents, undertake the meta data enhancement and provide the search and text analysis tools to process the resulting corpus of data all in one tool in one location.