Have you heard the term “OCR” and thought to yourself, “what is OCR?” Or are you about to embark on a digital scanning project and feel you must have text-searchable digital files? Whatever your reason for reading our article, we plan for you to come away with a good general understanding of OCR and how you can apply it to your digital conversion projects. Don’t believe all the hype – learn for yourself if OCR is right for you!
OCR, or Optical Character Recognition, is defined by ABBYY as “a technology that enables you to convert different types of documents, such as scanned paper documents, PDF files or images captured by a digital camera into editable and searchable data.”
In plain English, OCR technology helps you turn non-searchable documents into searchable documents.
What Does OCR Do?
The OCR process helps you turn an image into a searchable document using technology that recognizes characters and letters and turns them into words and sentences. The ability we have of looking at a document and almost immediately recognizing and understanding the different letters, words, and sentences is a capability beyond computers. When a computer “sees” an image, let’s say a page of printed text, it’s just black and white shapes to it, pixels on a page that don’t actually mean anything; there isn’t an inherent understanding of the letters and words.
What OCR software does is process the characters so that a computer can now read and recognize text: the letters, characters, words, and so on. Once processed through OCR software, a user will have the ability to search scanned documents for keywords and phrases. When you combine document scanning with document recognition and text recognition, you turn your pile of hard copy records into searchable digital files.
Additionally, a scanned document that’s been run through OCR technology can be used as an editable document, basically allowing you to update the text as necessary (in certain situations). An example of this is when libraries digitize their historical collections and OCR the scanned material so that volunteers can read and update articles as necessary. This is a labor-intensive method, since it’s pretty much manual data entry, but it is very useful for specific applications.
Another example would be if you wanted to analyze some information, say from a report, but there are too many files to read through each one manually to find the data you need. You could scan the printed text and use OCR to create searchable files, and from there it comes down to extracting data that’s relevant to your research. It’s not perfect, but it’s probably much more productive than reading dozens or hundreds of pages to find just a few pieces of information!
How Does OCR Work?
In his article “Optical character recognition (OCR)” on the website explainthatstuff.com, Chris Woodford describes a seven-step process that describes what OCR involves, summarized here:
- If you have hard copy material, use the best version available. Try to avoid documents with tears, stains, stray marks, etc.
- Scan your document to create a digital image. If you already have a digital image, steps 1 and 2 can be skipped.
- The first stage of OCR involves creating a bi-tonal (black/white) version of the document to create a binary process in which the OCR software decides if something is there or not.
- Processing the image(s) by character, word, sentence, and line.
- Basic error correction
- Some OCR programs will identify words that don’t seem correct, but this is not always included. OCR software may output incorrect characters because it interpreted incorrectly.
- Layout analysis
- Formatting may be captured correctly, but not always. If your document has tables, images, graphics, and so on, OCR programs can identify these and output them correctly most of the time.
- Once complete, a human eye will be the best determinant if the OCR was processed correctly. This is not always possible, especially if you’re working with thousands or millions of images.
This describes the general process of OCR, although there are variations to it in many circumstances. In most of the OCR project we work on, there isn’t going to be a step 7 because the volume of images is so great that no one could ever proofread them all. Additionally, most folks aren’t trying to create a perfect search file – they’re trying to get as much data as they can from OCR software to reduce the overall effort of finding records across their datasets.
If anyone tells you they’ll give you 100% accuracy with OCR, your hackles should go up. OCR isn’t perfect, and you shouldn’t expect it to be. Accuracy rates of OCR applied to typical-quality documents will be in the 98-99% range (see CVISION article). That’s pretty good, but realize that for every 1,000 words there will be 10-20 that aren’t correct.
Most instances of our customers using OCR isn’t to have the perfect solution; it’s to get as much of their data text-searchable as they can so that they can reduce or eliminate the need to implement a manual key data entry and indexing project, which is usually expensive.
When OCR Is A Good Fit
You may be wondering if OCR is right for your project, and that’s a good thing. It’s not always what you need and you don’t want to spend money when you don’t need to. Below are some instances where OCR is probably the right way to go.
You’re not interested in document-level indexing
Document- or file-level indexing is when you have your digital conversion partner “key” the information that will identify and separate a specific document within a larger project. Keying is capturing the fields and data that identify the document, such as Name, SSN, Case Number, Date of Birth, and so on.
If you decide to index at the document level, you need to have a solid idea of which fields will/may be captured, as well as how the specific documents are identified. For example, if you’re a university registrar getting your student record microfilm rolls scanned, and you want each student record captured at the document level, you’ll probably be interested in Student Name, Date of Birth, and Student ID. Next, you’ll need to describe how your scanning partner will identify when there’s a new document. There may be a flasher of some sort, like a white page with the previously mentioned fields in big type on it, or maybe a document separator that clearly separates when a document starts and ends. Either way, you’ll need to be able to describe the what and the how of document-level indexing.
As an add-on, you might be wondering if every single record you have has to be perfect to get document-level indexing. No, it doesn’t. No project we’ve ever worked on has been “perfect.” What you need to be aware of is that the more discrepancies and exceptions occur, the slower the project goes and the more chance that the project will require additional resources to compensate for uncertainties and inconsistencies.
You’re trying to limit your project spending
OCR can help keep project costs down by eliminating or at least mitigating the need for document-level indexing. If you’re not concerned about always being able to search your database and go directly to the right document, OCR is a plausible way to give you a general search capability.
As an example, if you have newspaper records that you want digitized and OCR processed, our standard method is to index at the Title, Year, and Issue Date level. That seems pretty granular, but there are other methods that index to the article level! Using our indexing method, you’re able to search, explore, and retrieve information using issue dates as a general guide. What you’re not able to do is see every single article that is included in the newspaper and click right to it. That would be very expensive.
Instead, you text search! With your OCR-processed images you can still search for content even though it’s not indexed by article. If you know that you’re looking for your great aunt in a local paper, just type her name and any captured text that includes her will show as a search result. The great thing about text search as compared to indexed data is that your great aunt may show up multiple times across all the newspapers, in articles you’d never know if you didn’t have OCR.
Circling back to the cost factor, you can see that OCR applied across all your digital images gives you a fantastic search capability while not requiring that you spend your money on granular indexing, which may not even get you what you need. And don’t forget that you can always move ahead later with a “phase 2” indexing project if you see the need.
Your documents are made up of good, clean text
All images are not created equal!
There is some bad quality content in the world, and we don’t mean the writing; we’re referring to the visible quality and the condition in which it was digitally captured. This affects OCR because if the OCR engine that’s processing your content can’t recognize it or mistakes some characters for others (bad text recognition), you’ll end up with jibberish or even nothing at all. The content won’t be searchable!
A digital image that’s considered “good” quality normally comes from typed or printed characters and computer-generated documents. Examples of these include newspapers, COM microfiche, electronically filled out forms or forms that were completed using typewriters, and mostly any documents that aren’t handwritten.
As stated before, though, even “good” quality material can miss out on OCR because of its condition. If you have a nice box of computer-generated text files ready for scanning, but that box was in a damp closet for five years, the pages may not be as pristine as when they were first generated. Water spots, curling of pages due to improper storage, and other factors can cause the physical material to deteriorate or somehow infringe on the quality of the print. When put through the OCR engine for processing, a coffee stain is going to affect the words that are capable of being captured and make character recognition difficult, even if it’s computer-generated material, so be sure to store your records in a way that protects them from damage if you plan to use them in the future.
When OCR Is Not A Good Idea
As a reinforcement to the section above, there are a few ways to tell if your project is not a good fit for OCR. Sometimes you just have to go forward without what you really wanted (text-searchable documents) because the juice isn’t worth the squeeze. Here are some examples of times when OCR isn’t a good idea:
Your project includes handwritten documents
Handwriting is not well-suited for OCR processing. We want to clarify that we don’t mean that OCR can never work with handwriting or handwritten text, but that it’s just rarely successful and we don’t really recommend it.
OCR works by finding characters and combining them into words and phrases (see “How Does OCR Work?” earlier in this article), but handwriting is so unique that it’s very difficult for a computer to make sense of it. Without a baseline to recognize characters, it’s spotty and not very reliable when applied to handwritten material.
Handwritten deed document. Probably not the best candidate for OCR!
You already have document-level indexing
If you already have a precise way to find a record, why would you want to use text search? We’ve found that a lot of times OCR is mentioned when a customer is discussing a project is because they’ve heard the term before (OCR, text search, searchable PDF, etc.) and they want to include it because they think they need it. This isn’t always the case, especially if their records are already or will be indexed at the file level.
Think of it this way: if you have 100 boxes of student records (not very good for OCR anyway, lots of handwriting, but that’s another conversation!) and each record is in a folder, those folders will indexed by capturing the student data written on the folder. Once you’ve done this, you’ll be able to locate the record you need by searching for the student by name, date of birth, and possibly a student ID number or a graduation year. Once you’re at the file, why would you need text search?
One counterpoint to not using OCR when you already have document-level indexing is if the digital file is large, say 50 pages or more. Even if you know the exact document you need, if you get there and it’s dozens or hundreds of pages long, it will still take awhile to find the exact information you’re looking for. In a case like this, OCR could be useful so you can search the large file and find the data you need more quickly.
Documents with pictures, drawings, and minimal text
Pictures and drawings don’t translate to searchable text, so if you’re scanning a bunch of engineering drawings, you’ll get minimal return unless information in the title block is captured, which isn’t guaranteed.
There may be some information in a graph or on a drawing that can be OCR processed, but make sure you test your theory first before paying for OCR; document-level indexing may be a better choice in an instance like this.
Looking for a digital conversion and OCR solution? Give us a call at 800.359.3546 or email us at firstname.lastname@example.org to discuss your project with one of our sales reps.
To learn more about digital conversion projects we’ve provided a few more links to pages on our site. Now that you know a bit more about OCR and if it’s right for you, read on to take the next step in your digital scanning journey!
“Traditional Microfilm Conversion vs. Digital ReeL” is a comparison between what we call a “traditional conversion” and our microfilm solution, “Digital ReeL.” If you’re not sure of the options you have available when it comes to microfilm scanning, this is a great place to start to get an idea of which one might be best suited for your project.
“How Much Microfilm And Microfiche Do I Have?” is a quick overview of how you can estimate your microfilm collection. We describe a couple of ways to do this so that you can have a general idea of the size of your project.
“What You Should Expect During Your Digital Conversion Project” provides some insight into what you should and should not expect during your digital conversion project as you choose your partner for scanning and conversion.