Thursday, 18 June 2015

Text Capture and Optical Character Recognition 101

This blog will introduce text capture by describing the different methods with a focus upon historical documents. I will introduce the basics of OCR and rekeying with discussion of handwriting and voice recognition.

What is text capture

Text capture is a process rather than a single technology. It is the means by which textual content that resides within physical artefacts (such as in books, manuscripts, journals, reports, correspondence etc) may be transferred from that medium into a machine readable format. My focus here is on the capture of text from digital images that have been rendered from physical artefacts. Such digital images may be made via scanners or digital cameras and stored as digital page images for later access and use.

In some cases digital images of text content are sufficient to satisfy the end users information needs and provide access to the resource in an electronic format that can be shared online. This sort of digital presentation of text resources is very useful for documents where transcribing the content would be difficult, such as for handwritten letters or personal notes, but it can be done for any type of text. The reader must make their own recognition of the text to render it meaningful and may have an easy time doing this or not. However, to enable other computer based ways to use the text content – such as for indexing, searching, data mining, copying and pasting – then the text must be rendered machine readable.

Machine readable text may be gained through the various text capture processes listed below. Rather than presenting the end user with an electronic ‘photocopy’ of the page, the user is presented with a resource that includes machine readable text. This makes additional computing functions possible and the most important of these is the ability to index and search text. Without machine readable text the computer will not find that document on Chaucer or the pages that contain the words “Wife of Bath”. As a digital repository of text becomes bigger then the only efficient way to navigate through it is with search tools supported by good indexing and this is something that machine readable text makes practicable.

Thus text capture is a process that should be designed to add value to the text resource. Inherent within the concept of adding value is an assessment of whether the cost of delivering the benefit was equitable with the value added. Thus the more automated the capture process the easier it seems to justify the cost for the benefit across a large corpus.

The main methods for text capture are (in order popularity of use):
  • Optical Character Recognition (OCR) – sometimes also known as Intelligent Character Recognition (ICR)
  • Rekeying
  • Handwriting Recognition (HR)
  • Voice or speech recognition

Optical Character Recognition (OCR)

Optical Character Recognition (OCR) is a type of document image analysis where a scanned digital image that contains either machine printed or handwritten script is input into an OCR software engine and translated into a machine readable digital text format (like ASCII text).

OCR works by first pre-processing the digital page image into its smallest component parts with layout analysis to find text blocks, sentence/line blocks, word blocks and character blocks. Other features such as lines, graphics, photographs etc are recognised and discarded.

The character blocks are then further broken down into components parts, pattern recognized and compared to the OCR engines large dictionary of characters from various fonts and languages. Once a likely match is made then this is recorded and a set of characters in the word block are recognized until all likely characters have been found for the word block. The word is then compared to the OCR engine’s large dictionary of complete words that exist for that language.


These factors of characters and words recognised are the key to OCR accuracy – by combining them the OCR engine can deliver much higher levels of accuracy. Modern OCR engines extend this accuracy through more sophisticated pre-processing of source digital images and better algorithms for fuzzy matching, sounds-like matching and grammatical measurements to more accurately establish word accuracy.

Gaining character accuracies of greater than 1 in 5000 characters (99.98%) with fully automated OCR is usually only possible with post-1950’s printed text (and not that frequently even then). Gaining accuracies of greater than 95% (5 in 100 characters wrong) is more usual for post-1900 and pre-1950’s text and anything pre-1900 will be fortunate to exceed 85% accuracy (15 in 100 characters wrong). Thus OCR for historical materials is usually hampered by the expensive and time consuming need to manually or semi-automated proofreading and correction of the text to gain as near to 100% as possible.

Optical Character Recognition as a technology is deeply affected by the following factors:
Examples of factors that cause OCR problems

  • Scanning methods possible
  • Nature of original paper
  • Nature of printing
    • Uniformity
    • Language
    • Text alignment
    • Complexity of alignment
    • Lines, graphics and pictures
    • Handwriting
  • Nature of document
  • Nature of output requirements

OCR Accuracy Example

If our processes should orientate to the intellectual and user aims desired from that resource, then surely our means of measuring success should also be defined by whether those aims are actually achieved? This means we have to escape from the mantra of character accuracy and explore the potential benefits of measuring success in terms of words – and not just any words but those that have more significance for the user searching the resource. When we look at the number of words that are incorrect, rather than the number of characters, the suppliers' accuracy statistics seem a lot less impressive.

For example, given a newspaper page of 1,000 words with 5,000 characters if the OCR engine yields a result of 90% character accuracy, this equals 500 incorrect characters. However, looked at in word terms this might convert to a maximum of 900 correct words (90% word accuracy) or a minimum of 500 correct words (50% word accuracy), assuming for this example an average word length of 5 characters. The reality is somewhere in between and probably more at the higher extent than the lower. The fact is: character accuracy of itself does not tell us word accuracy nor does it tell us the usefulness of the text output. Depending on the number of "significant words" rendered correctly, the search results could still be almost 100% or near zero with 90% character accuracy.

The key consideration for newspaper digitization utilizing OCR is the usefulness of the text output for indexing and retrieval purposes. If it were possible to achieve 90% character accuracy and still get 90% word accuracy, then most search engines utilizing fuzzy logic would get in excess of 98% retrieval rate for straightforward prose text. In these cases the OCR accuracy may be of less interest than the potential retrieval rate for the resource (especially as the user will not usually see the OCRed text to notice it isn't perfect). The potential retrieval rate for a resource depends upon the OCR engine's accuracy with significant words, that is, those content words for which users might be interested in searching, not the very common function words such as "the", "he", "it", etc. In newspapers, significant words including proper names and place names are usually repeated if they are important to the news story. This further improves the chances of retrieval – as there are more opportunities for the OCR engine to correctly capture at least one instance of the repeated word or phrase. This can enable high retrieval rates even for OCR accuracies measuring lower than 90%.

So, when we assess the accuracy of OCR output, it is vital we do not focus purely on a statistical measure but also consider the functionality enabled through the OCR's output such as:
  • Search accuracy
  • Volume of hits returned
  • Ability to structure searches and results
  • Accuracy of result ranking
  • Amount of correction required to achieve the required performance
See my paper in DLib for more information.

Rekeying

Rekeying is a process by which text content in digital images is keyed by a human directly via keyboard. This is differentiated from copy typing by the automation and industrialization of the process. Rekeying tends to be offered by commercial companies in offshore locations with India being the leading outsourcing supplier to the United Kingdom.

Rekeying is usually offered in three forms:
  • Double rekeying
  • Triple rekeying
  • OCR with rekeyed correction
In rekeying, the digital image of the text is viewed at high magnification in one window and keyed into a separate window (usually in specially designed software). Double and triple rekeying are variations on a quality assurance method. In double rekeying the same digital image of text is keyed by two different keying operators with no conferring over their interpretation of the text. This output is then automatically compared by computer and a third person (usually in a supervisory role) is shown the two texts overlaid with with the differences highlighted. The third operator makes the casting vote in deciding which version is right for each difference. In this way high levels of accuracy can be reached by reducing keying errors introduced by human error. Triple rekeying is an even more accurate method which uses three keying operators instead of two.

Triple rekeying would expect to achieve accuracy levels of 1 in 10,000 characters incorrect (99.99% character accuracy) whilst double rekeying should achieve between 1 in 2500 and 1 in 5000 characters incorrect (99.96 – 99.98% character accuracy).

In the latter option of OCR with rekeyed correction then the text is OCR'd and where the confidence of the engine is lower than a set parameter or if words do not appear in a dictionary then these are inspected by a rekeying operator and corrected manually as required. This normally will deliver in the 99.9% or above accuracy level depending on the nature of the text. For instance, this mode will generally not be as successful for numeric or tabular texts as these are problematic formats for OCR throwing so many potential corrections at the operator that double/triple rekeying would be more efficient.


Two main issues with rekeying

  1. First, it appears relatively expensive because every character carries a conversion cost and thus the direct costs of capture are very apparent. However, the accuracy reached is very high and generally rekeying is cheaper than OCR when the same accuracy is expected. This is because correcting and proofreading OCR is most often more costly than rekeying with cheap offshore services.
  2. The second and more problematic issue is the need for an extremely clear specification that accounts for all the variations and inconsistencies in the originals. This is to avoid the rekeyers having to make judgments or interpretations of the text. Tanner’s First Law of Rekeying states “operators should only key what they actually see, not what they think”. This is to avoid assumptions, guesses and to avoid misspelt words in the historic original being ‘corrected’ by the rekeying and thus removing the veracity of the text. As Claude Monet said "to see we must forget the name of the thing we are looking at" and for rekeying this is a challenge that is overcome by a detailed specification that removes from the keying operator the need to understand any context or language but just to key what characters they see. Detailed specifications are hard to write and require a large commitment of time and effort before the project has even got underway.

Handwriting recognition

The specialist conversion of handwriting to machine-readable text is referred to as handwriting recognition (HR). The type of HR used in tablets or your smartphone is comparatively accurate because the computer can monitor the characters whilst they are being formed. The form of HR which converts handwriting from retrospective digital images is much more analogous to OCR in looking for word and character blocks. It is thus less accurate due to the sheer number of handwriting styles and the variation within even one person’s style of writing. Most HR is done on forms (e.g. tax or medical forms) specially designed to control the variation in a person’s handwriting by using boxes for each letter and requiring upper case.

I am yet to see a commercial or open source software for automatic transcription of, or the creation of searchable indexes from, handwritten historical documents that is really technically efficient and provides a significant cost efficiency over rekeying. I'm willing to stand corrected so add comments if you know better.

Voice recognition

Voice recognition is a form of transcription where the human operator talks into a microphone connected to their computer and the software translates this into machine readable text. This is relatively inefficient as a human generally talks more slowly than a fast copy typist and there is also a high rate of inaccuracy for any word not normally in the dictionary, such as place names. This may be of use to academics with small amounts of text to transcribe. I merely record the existence of this method for completeness but discard it as generally too inefficient for historical documents.

Deciding on the appropriate method for your needs

Deciding upon a suitable text capture method for a project will be defined by available time and resource balanced against project goals. For instance, it is quite acceptable to deliver just an image of a page of text and allow the end user to read it for themselves. This is cheap and relatively easy to deliver, but lacks a lot of functionality that having machine readable text can offer.

Some benefits of machine readable text are:
  • reuse, editing and reformatting of content;
  • full text retrieval provides ease of access;
  • indexing of content may be automated;
  • metadata extraction and interaction with other systems;
  • enabling mark-up into XML or HTML;
  • accessibility to text content for visually impaired users; and
  • delivering content with a much lower bandwidth requirement
Therefore projects need to first decide what sort of text content and delivery they require before deciding upon a text capture method. We could for instance consider a possible digitisation chain of events with various Stop Points along the way for a typical document collection.  Each Stop Point represents a place in the chain where a project could stop and state that their information goals have been delivered.

Please note though that this is just one chain, there are a number of possible combinations. The Stop Points also give a broad indication of the level of activity and effort required to achieve them – the further down the tree the more effort (and thus more potentially costly) the Stop Point is likely to be. Projects that have decided upon which Stop Point they are aiming for can then choose the most appropriate text capture method.
  • Stop Point One: – represents just delivering the content via digital images. No text capture required.
  • Stop Point Two: Indexing – the text is imported to a search engine and used as the basis for full text searching of the information resource. OCR only.
  • Stop Point Three: Full text representation – in this option the text is shown to the end user as a representation of the original document. This requires a much higher level of accuracy to be acceptable and thus assumes validation of captured text. OCR or rekeying.
  • Stop Point Four: Metadata – all of the above Stop Points would benefit from additional metadata to help describe and manage the resource. Manually added to the above resources with some automatic extraction possible.
  • Stop Point Five: Mark-up – content is presented to the end user with layout, structure or metadata added via XML mark-up.
Choosing one of the above Stop Points obviously makes choosing the appropriate text capture method easier.

Another feature of the above flow is that included in all stages are the collection assessment, preparation and scanning. It is essential to assess the collection to identify its unique characteristics.  These may be physical or content driven characteristics, but these unique features will drive the digitisation mechanism and help define the required provision and access routes to the electronic version.

There are no standard digitisation projects and the defining the nature of the original materials to be digitised is the essential first step of any conversion project. Without the steps then none of the other steps should be considered.

17 comments:

  1. What digital libraries in the world go as far as Stop Point Five? I only know of Wikisource.org.

    ReplyDelete
    Replies
    1. Old Bailey Session Papers would be a good example. Also pretty much any TEI related project you could think of. Some examples of when it might be used below:
      http://www.oldbaileyonline.org/ admittedly rekeyed & OCR'd content
      http://www.finerollshenry3.org.uk/home.html structured data

      Delete
  2. Simon, we use OCR in BoB (www,bobnational.com) to convert the subtitles, which are broadcast as bitmaps, into text. This machine readable text is time stamped and is searchable as time based metadata.

    ReplyDelete
    Replies
    1. Brilliant example, thank you Tony.
      Big fan also of BUFVC. We experimented with speech to text from video and film and it was sometimes good, sometimes quite poor according to the context of the video (news and sports commentary was good - voxpops on the street, drama etc = bad).

      Delete
  3. Great post! Our project Purposeful Gaming and BHL (http://biodivlib.wikispaces.com/Purposeful+Gaming) is combining both OCR and rekeying of text. We create 2 OCR outputs from different software for a single page and compare the differences. Those differences are then sent to 2 online games where the public crowdsources the correct words for us. This article has given me some thought about prioritizing the differences rather than sending them all to the game. e.g. the common function words probably don't need verification as much as words users will actually search on. Thanks for the food for thought!

    ReplyDelete
    Replies
    1. Glad it could help - yes prioritisation would give you a much better efficiency. Also the process you describe (whilst you are using crowdsourcing to decide the correct words) is exactly what the OCR engines are doing inside themselves. The most expensive OCR packages are lining up 5, 7 or even 15 algorithms that then vote on the outcome based on probability - a kind of internal crowd.

      Delete
  4. The Lincoln Archives Digital Project has been using good old fashioned keying by hand and sometimes using Natural Dragon, since everything is in cursive.

    ReplyDelete
    Replies
    1. All hail the Lincoln Archives Digital Project!
      I did consultancy for you a while ago (goes and checks records... crickey) back in 2009 when it was all kicking off big time. You've done great work.

      Delete
  5. see http://trove.nla.gov.au/

    ReplyDelete
    Replies
    1. Trove is a classic - what is most interesting to me is augmenting the OCR with corwdsourcing. See for instance:
      http://www.nla.gov.au/our-publications/staff-papers/trove-crowdsourcing-behaviour

      Delete
  6. Thank you for the post. Have you any experience with Transkribus? It's a text recognition programme especially for handwritten documents. I haven't used it myself, but it looks interesting.
    https://transkribus.eu/Transkribus/

    ReplyDelete
    Replies
    1. Handwritten documents are particularly hard for obvious reasons. Any of these transcription tools depend upon having a large corpus with steady consistent handwriting or the same hand throughout. Otherwise variation gets in the way and rekeying becomes cheaper than correcting the recognition. The problem of pattern recognition is getting better but it is still mainly a task of comparison to a known source - for instance most OCR engines will work on over 2,000 exemplar instances (i.e. 2000 x of each letter x in each font). How do we build up such a set of comparable patterns in handwriting. There are some great examples but there is still much work to be done. See for example http://www.digipal.eu/

      Delete
  7. I have been working with transcripts created from the OCR in CONTENTdm and rekeying the corrections by hand. Most of the text is in table and spreadsheet format. At present, I am copying the text into Excel, cleaning up the data and formatting, then copying back into CONTENTdm. Very tedious. I need to find a better, faster way to clean up this text. Any suggestions?

    ReplyDelete
    Replies
    1. Hi Eileen
      I'm assuming you've seen the ContentDM help pages here:
      http://www.contentdm.org/help6/projectclient/ocr.asp

      If the text original is table or spreadsheet format then you are in a world of OCR pain without many reasonable solutions that don't involve a lot of person time. I'm assuming you have more numeric or date or name based content if in tables - these are hard to deal with and more likely to have inaccuracies.

      I would try getting to the original images if possible and running them through a professional OCR package such as ABBYY Finereader Pro (not the limited edition). That would give you more control.

      You could also look at this as a volumteer opportunity and consider whether volunteers would be willing to do this type of task or even consider crowdsourcing (cheaper than it seems). Handballing the text is probably going to remain the best option I'm afraid.

      Delete
  8. I find a free online ocr, it's using tesseract ocr 3.02.

    ReplyDelete
  9. Great post I would like to thank you for the efforts you have made in writing this interesting and knowledgeable article.
    Wormaxio

    ReplyDelete