What is text captureText capture is a process rather than a single technology. It is the means by which textual content that resides within physical artefacts (such as in books, manuscripts, journals, reports, correspondence etc) may be transferred from that medium into a machine readable format. My focus here is on the capture of text from digital images that have been rendered from physical artefacts. Such digital images may be made via scanners or digital cameras and stored as digital page images for later access and use.
In some cases digital images of text content are sufficient to satisfy the end users information needs and provide access to the resource in an electronic format that can be shared online. This sort of digital presentation of text resources is very useful for documents where transcribing the content would be difficult, such as for handwritten letters or personal notes, but it can be done for any type of text. The reader must make their own recognition of the text to render it meaningful and may have an easy time doing this or not. However, to enable other computer based ways to use the text content – such as for indexing, searching, data mining, copying and pasting – then the text must be rendered machine readable.
Machine readable text may be gained through the various text capture processes listed below. Rather than presenting the end user with an electronic ‘photocopy’ of the page, the user is presented with a resource that includes machine readable text. This makes additional computing functions possible and the most important of these is the ability to index and search text. Without machine readable text the computer will not find that document on Chaucer or the pages that contain the words “Wife of Bath”. As a digital repository of text becomes bigger then the only efficient way to navigate through it is with search tools supported by good indexing and this is something that machine readable text makes practicable.
Thus text capture is a process that should be designed to add value to the text resource. Inherent within the concept of adding value is an assessment of whether the cost of delivering the benefit was equitable with the value added. Thus the more automated the capture process the easier it seems to justify the cost for the benefit across a large corpus.
The main methods for text capture are (in order popularity of use):
- Optical Character Recognition (OCR) – sometimes also known as Intelligent Character Recognition (ICR)
- Handwriting Recognition (HR)
- Voice or speech recognition
Optical Character Recognition (OCR)Optical Character Recognition (OCR) is a type of document image analysis where a scanned digital image that contains either machine printed or handwritten script is input into an OCR software engine and translated into a machine readable digital text format (like ASCII text).
OCR works by first pre-processing the digital page image into its smallest component parts with layout analysis to find text blocks, sentence/line blocks, word blocks and character blocks. Other features such as lines, graphics, photographs etc are recognised and discarded.
The character blocks are then further broken down into components parts, pattern recognized and compared to the OCR engines large dictionary of characters from various fonts and languages. Once a likely match is made then this is recorded and a set of characters in the word block are recognized until all likely characters have been found for the word block. The word is then compared to the OCR engine’s large dictionary of complete words that exist for that language.
These factors of characters and words recognised are the key to OCR accuracy – by combining them the OCR engine can deliver much higher levels of accuracy. Modern OCR engines extend this accuracy through more sophisticated pre-processing of source digital images and better algorithms for fuzzy matching, sounds-like matching and grammatical measurements to more accurately establish word accuracy.
Gaining character accuracies of greater than 1 in 5000 characters (99.98%) with fully automated OCR is usually only possible with post-1950’s printed text (and not that frequently even then). Gaining accuracies of greater than 95% (5 in 100 characters wrong) is more usual for post-1900 and pre-1950’s text and anything pre-1900 will be fortunate to exceed 85% accuracy (15 in 100 characters wrong). Thus OCR for historical materials is usually hampered by the expensive and time consuming need to manually or semi-automated proofreading and correction of the text to gain as near to 100% as possible.
Optical Character Recognition as a technology is deeply affected by the following factors:
|Examples of factors that cause OCR problems|
- Scanning methods possible
- Nature of original paper
- Nature of printing
- Text alignment
- Complexity of alignment
- Lines, graphics and pictures
- Nature of document
- Nature of output requirements
OCR Accuracy ExampleIf our processes should orientate to the intellectual and user aims desired from that resource, then surely our means of measuring success should also be defined by whether those aims are actually achieved? This means we have to escape from the mantra of character accuracy and explore the potential benefits of measuring success in terms of words – and not just any words but those that have more significance for the user searching the resource. When we look at the number of words that are incorrect, rather than the number of characters, the suppliers' accuracy statistics seem a lot less impressive.
For example, given a newspaper page of 1,000 words with 5,000 characters if the OCR engine yields a result of 90% character accuracy, this equals 500 incorrect characters. However, looked at in word terms this might convert to a maximum of 900 correct words (90% word accuracy) or a minimum of 500 correct words (50% word accuracy), assuming for this example an average word length of 5 characters. The reality is somewhere in between and probably more at the higher extent than the lower. The fact is: character accuracy of itself does not tell us word accuracy nor does it tell us the usefulness of the text output. Depending on the number of "significant words" rendered correctly, the search results could still be almost 100% or near zero with 90% character accuracy.
The key consideration for newspaper digitization utilizing OCR is the usefulness of the text output for indexing and retrieval purposes. If it were possible to achieve 90% character accuracy and still get 90% word accuracy, then most search engines utilizing fuzzy logic would get in excess of 98% retrieval rate for straightforward prose text. In these cases the OCR accuracy may be of less interest than the potential retrieval rate for the resource (especially as the user will not usually see the OCRed text to notice it isn't perfect). The potential retrieval rate for a resource depends upon the OCR engine's accuracy with significant words, that is, those content words for which users might be interested in searching, not the very common function words such as "the", "he", "it", etc. In newspapers, significant words including proper names and place names are usually repeated if they are important to the news story. This further improves the chances of retrieval – as there are more opportunities for the OCR engine to correctly capture at least one instance of the repeated word or phrase. This can enable high retrieval rates even for OCR accuracies measuring lower than 90%.
So, when we assess the accuracy of OCR output, it is vital we do not focus purely on a statistical measure but also consider the functionality enabled through the OCR's output such as:
- Search accuracy
- Volume of hits returned
- Ability to structure searches and results
- Accuracy of result ranking
- Amount of correction required to achieve the required performance
RekeyingRekeying is a process by which text content in digital images is keyed by a human directly via keyboard. This is differentiated from copy typing by the automation and industrialization of the process. Rekeying tends to be offered by commercial companies in offshore locations with India being the leading outsourcing supplier to the United Kingdom.
Rekeying is usually offered in three forms:
- Double rekeying
- Triple rekeying
- OCR with rekeyed correction
Triple rekeying would expect to achieve accuracy levels of 1 in 10,000 characters incorrect (99.99% character accuracy) whilst double rekeying should achieve between 1 in 2500 and 1 in 5000 characters incorrect (99.96 – 99.98% character accuracy).
In the latter option of OCR with rekeyed correction then the text is OCR'd and where the confidence of the engine is lower than a set parameter or if words do not appear in a dictionary then these are inspected by a rekeying operator and corrected manually as required. This normally will deliver in the 99.9% or above accuracy level depending on the nature of the text. For instance, this mode will generally not be as successful for numeric or tabular texts as these are problematic formats for OCR throwing so many potential corrections at the operator that double/triple rekeying would be more efficient.
Two main issues with rekeying
- First, it appears relatively expensive because every character carries a conversion cost and thus the direct costs of capture are very apparent. However, the accuracy reached is very high and generally rekeying is cheaper than OCR when the same accuracy is expected. This is because correcting and proofreading OCR is most often more costly than rekeying with cheap offshore services.
- The second and more problematic issue is the need for an extremely clear specification that accounts for all the variations and inconsistencies in the originals. This is to avoid the rekeyers having to make judgments or interpretations of the text. Tanner’s First Law of Rekeying states “operators should only key what they actually see, not what they think”. This is to avoid assumptions, guesses and to avoid misspelt words in the historic original being ‘corrected’ by the rekeying and thus removing the veracity of the text. As Claude Monet said "to see we must forget the name of the thing we are looking at" and for rekeying this is a challenge that is overcome by a detailed specification that removes from the keying operator the need to understand any context or language but just to key what characters they see. Detailed specifications are hard to write and require a large commitment of time and effort before the project has even got underway.
Handwriting recognitionThe specialist conversion of handwriting to machine-readable text is referred to as handwriting recognition (HR). The type of HR used in tablets or your smartphone is comparatively accurate because the computer can monitor the characters whilst they are being formed. The form of HR which converts handwriting from retrospective digital images is much more analogous to OCR in looking for word and character blocks. It is thus less accurate due to the sheer number of handwriting styles and the variation within even one person’s style of writing. Most HR is done on forms (e.g. tax or medical forms) specially designed to control the variation in a person’s handwriting by using boxes for each letter and requiring upper case.
I am yet to see a commercial or open source software for automatic transcription of, or the creation of searchable indexes from, handwritten historical documents that is really technically efficient and provides a significant cost efficiency over rekeying. I'm willing to stand corrected so add comments if you know better.
Voice recognitionVoice recognition is a form of transcription where the human operator talks into a microphone connected to their computer and the software translates this into machine readable text. This is relatively inefficient as a human generally talks more slowly than a fast copy typist and there is also a high rate of inaccuracy for any word not normally in the dictionary, such as place names. This may be of use to academics with small amounts of text to transcribe. I merely record the existence of this method for completeness but discard it as generally too inefficient for historical documents.
Deciding on the appropriate method for your needsDeciding upon a suitable text capture method for a project will be defined by available time and resource balanced against project goals. For instance, it is quite acceptable to deliver just an image of a page of text and allow the end user to read it for themselves. This is cheap and relatively easy to deliver, but lacks a lot of functionality that having machine readable text can offer.
Some benefits of machine readable text are:
- reuse, editing and reformatting of content;
- full text retrieval provides ease of access;
- indexing of content may be automated;
- metadata extraction and interaction with other systems;
- enabling mark-up into XML or HTML;
- accessibility to text content for visually impaired users; and
- delivering content with a much lower bandwidth requirement
Please note though that this is just one chain, there are a number of possible combinations. The Stop Points also give a broad indication of the level of activity and effort required to achieve them – the further down the tree the more effort (and thus more potentially costly) the Stop Point is likely to be. Projects that have decided upon which Stop Point they are aiming for can then choose the most appropriate text capture method.
- Stop Point One: – represents just delivering the content via digital images. No text capture required.
- Stop Point Two: Indexing – the text is imported to a search engine and used as the basis for full text searching of the information resource. OCR only.
- Stop Point Three: Full text representation – in this option the text is shown to the end user as a representation of the original document. This requires a much higher level of accuracy to be acceptable and thus assumes validation of captured text. OCR or rekeying.
- Stop Point Four: Metadata – all of the above Stop Points would benefit from additional metadata to help describe and manage the resource. Manually added to the above resources with some automatic extraction possible.
- Stop Point Five: Mark-up – content is presented to the end user with layout, structure or metadata added via XML mark-up.
Another feature of the above flow is that included in all stages are the collection assessment, preparation and scanning. It is essential to assess the collection to identify its unique characteristics. These may be physical or content driven characteristics, but these unique features will drive the digitisation mechanism and help define the required provision and access routes to the electronic version.
There are no standard digitisation projects and the defining the nature of the original materials to be digitised is the essential first step of any conversion project. Without the steps then none of the other steps should be considered.