Text Capture and Optical Character Recognition 101

June 18, 2015

This blog will introduce text capture by describing the different methods with a focus upon historical documents. I will introduce the basics of OCR and rekeying with discussion of handwriting and voice recognition.

What is text capture

Text capture is a process rather than a single technology. It is the means by which textual content that resides within physical artefacts (such as in books, manuscripts, journals, reports, correspondence etc) may be transferred from that medium into a machine readable format. My focus here is on the capture of text from digital images that have been rendered from physical artefacts. Such digital images may be made via scanners or digital cameras and stored as digital page images for later access and use.

In some cases digital images of text content are sufficient to satisfy the end users information needs and provide access to the resource in an electronic format that can be shared online. This sort of digital presentation of text resources is very useful for documents where transcribing the content would be difficult, such as for handwritten letters or personal notes, but it can be done for any type of text. The reader must make their own recognition of the text to render it meaningful and may have an easy time doing this or not. However, to enable other computer based ways to use the text content – such as for indexing, searching, data mining, copying and pasting – then the text must be rendered machine readable.

Machine readable text may be gained through the various text capture processes listed below. Rather than presenting the end user with an electronic ‘photocopy’ of the page, the user is presented with a resource that includes machine readable text. This makes additional computing functions possible and the most important of these is the ability to index and search text. Without machine readable text the computer will not find that document on Chaucer or the pages that contain the words “Wife of Bath”. As a digital repository of text becomes bigger then the only efficient way to navigate through it is with search tools supported by good indexing and this is something that machine readable text makes practicable.

Thus text capture is a process that should be designed to add value to the text resource. Inherent within the concept of adding value is an assessment of whether the cost of delivering the benefit was equitable with the value added. Thus the more automated the capture process the easier it seems to justify the cost for the benefit across a large corpus.

The main methods for text capture are (in order popularity of use):

Optical Character Recognition (OCR) – sometimes also known as Intelligent Character Recognition (ICR)
Rekeying
Handwriting Recognition (HR)
Voice or speech recognition

Optical Character Recognition (OCR)

Optical Character Recognition (OCR) is a type of document image analysis where a scanned digital image that contains either machine printed or handwritten script is input into an OCR software engine and translated into a machine readable digital text format (like ASCII text).

OCR works by first pre-processing the digital page image into its smallest component parts with layout analysis to find text blocks, sentence/line blocks, word blocks and character blocks. Other features such as lines, graphics, photographs etc are recognised and discarded.

The character blocks are then further broken down into components parts, pattern recognized and compared to the OCR engines large dictionary of characters from various fonts and languages. Once a likely match is made then this is recorded and a set of characters in the word block are recognized until all likely characters have been found for the word block. The word is then compared to the OCR engine’s large dictionary of complete words that exist for that language.

These factors of characters and words recognised are the key to OCR accuracy – by combining them the OCR engine can deliver much higher levels of accuracy. Modern OCR engines extend this accuracy through more sophisticated pre-processing of source digital images and better algorithms for fuzzy matching, sounds-like matching and grammatical measurements to more accurately establish word accuracy.

Gaining character accuracies of greater than 1 in 5000 characters (99.98%) with fully automated OCR is usually only possible with post-1950’s printed text (and not that frequently even then). Gaining accuracies of greater than 95% (5 in 100 characters wrong) is more usual for post-1900 and pre-1950’s text and anything pre-1900 will be fortunate to exceed 85% accuracy (15 in 100 characters wrong). Thus OCR for historical materials is usually hampered by the expensive and time consuming need to manually or semi-automated proofreading and correction of the text to gain as near to 100% as possible.

Optical Character Recognition as a technology is deeply affected by the following factors:

Examples of factors that cause OCR problems

Scanning methods possible
Nature of original paper
Nature of printing

Uniformity
Language
Text alignment
Complexity of alignment
Lines, graphics and pictures
Handwriting

Nature of document
Nature of output requirements

OCR Accuracy Example

If our processes should orientate to the intellectual and user aims desired from that resource, then surely our means of measuring success should also be defined by whether those aims are actually achieved? This means we have to escape from the mantra of character accuracy and explore the potential benefits of measuring success in terms of words – and not just any words but those that have more significance for the user searching the resource. When we look at the number of words that are incorrect, rather than the number of characters, the suppliers' accuracy statistics seem a lot less impressive.

For example, given a newspaper page of 1,000 words with 5,000 characters if the OCR engine yields a result of 90% character accuracy, this equals 500 incorrect characters. However, looked at in word terms this might convert to a maximum of 900 correct words (90% word accuracy) or a minimum of 500 correct words (50% word accuracy), assuming for this example an average word length of 5 characters. The reality is somewhere in between and probably more at the higher extent than the lower. The fact is: character accuracy of itself does not tell us word accuracy nor does it tell us the usefulness of the text output. Depending on the number of "significant words" rendered correctly, the search results could still be almost 100% or near zero with 90% character accuracy.

The key consideration for newspaper digitization utilizing OCR is the usefulness of the text output for indexing and retrieval purposes. If it were possible to achieve 90% character accuracy and still get 90% word accuracy, then most search engines utilizing fuzzy logic would get in excess of 98% retrieval rate for straightforward prose text. In these cases the OCR accuracy may be of less interest than the potential retrieval rate for the resource (especially as the user will not usually see the OCRed text to notice it isn't perfect). The potential retrieval rate for a resource depends upon the OCR engine's accuracy with significant words, that is, those content words for which users might be interested in searching, not the very common function words such as "the", "he", "it", etc. In newspapers, significant words including proper names and place names are usually repeated if they are important to the news story. This further improves the chances of retrieval – as there are more opportunities for the OCR engine to correctly capture at least one instance of the repeated word or phrase. This can enable high retrieval rates even for OCR accuracies measuring lower than 90%.

So, when we assess the accuracy of OCR output, it is vital we do not focus purely on a statistical measure but also consider the functionality enabled through the OCR's output such as:

Search accuracy
Volume of hits returned
Ability to structure searches and results
Accuracy of result ranking
Amount of correction required to achieve the required performance

See my paper in DLib for more information.

Rekeying

Rekeying is a process by which text content in digital images is keyed by a human directly via keyboard. This is differentiated from copy typing by the automation and industrialization of the process. Rekeying tends to be offered by commercial companies in offshore locations with India being the leading outsourcing supplier to the United Kingdom.

Rekeying is usually offered in three forms:

Double rekeying
Triple rekeying
OCR with rekeyed correction

In rekeying, the digital image of the text is viewed at high magnification in one window and keyed into a separate window (usually in specially designed software). Double and triple rekeying are variations on a quality assurance method. In double rekeying the same digital image of text is keyed by two different keying operators with no conferring over their interpretation of the text. This output is then automatically compared by computer and a third person (usually in a supervisory role) is shown the two texts overlaid with with the differences highlighted. The third operator makes the casting vote in deciding which version is right for each difference. In this way high levels of accuracy can be reached by reducing keying errors introduced by human error. Triple rekeying is an even more accurate method which uses three keying operators instead of two.

Triple rekeying would expect to achieve accuracy levels of 1 in 10,000 characters incorrect (99.99% character accuracy) whilst double rekeying should achieve between 1 in 2500 and 1 in 5000 characters incorrect (99.96 – 99.98% character accuracy).

In the latter option of OCR with rekeyed correction then the text is OCR'd and where the confidence of the engine is lower than a set parameter or if words do not appear in a dictionary then these are inspected by a rekeying operator and corrected manually as required. This normally will deliver in the 99.9% or above accuracy level depending on the nature of the text. For instance, this mode will generally not be as successful for numeric or tabular texts as these are problematic formats for OCR throwing so many potential corrections at the operator that double/triple rekeying would be more efficient.

Two main issues with rekeying

First, it appears relatively expensive because every character carries a conversion cost and thus the direct costs of capture are very apparent. However, the accuracy reached is very high and generally rekeying is cheaper than OCR when the same accuracy is expected. This is because correcting and proofreading OCR is most often more costly than rekeying with cheap offshore services.
The second and more problematic issue is the need for an extremely clear specification that accounts for all the variations and inconsistencies in the originals. This is to avoid the rekeyers having to make judgments or interpretations of the text. Tanner’s First Law of Rekeying states “operators should only key what they actually see, not what they think”. This is to avoid assumptions, guesses and to avoid misspelt words in the historic original being ‘corrected’ by the rekeying and thus removing the veracity of the text. As Claude Monet said "to see we must forget the name of the thing we are looking at" and for rekeying this is a challenge that is overcome by a detailed specification that removes from the keying operator the need to understand any context or language but just to key what characters they see. Detailed specifications are hard to write and require a large commitment of time and effort before the project has even got underway.

Handwriting recognition

The specialist conversion of handwriting to machine-readable text is referred to as handwriting recognition (HR). The type of HR used in tablets or your smartphone is comparatively accurate because the computer can monitor the characters whilst they are being formed. The form of HR which converts handwriting from retrospective digital images is much more analogous to OCR in looking for word and character blocks. It is thus less accurate due to the sheer number of handwriting styles and the variation within even one person’s style of writing. Most HR is done on forms (e.g. tax or medical forms) specially designed to control the variation in a person’s handwriting by using boxes for each letter and requiring upper case.

I am yet to see a commercial or open source software for automatic transcription of, or the creation of searchable indexes from, handwritten historical documents that is really technically efficient and provides a significant cost efficiency over rekeying. I'm willing to stand corrected so add comments if you know better.

Voice recognition

Voice recognition is a form of transcription where the human operator talks into a microphone connected to their computer and the software translates this into machine readable text. This is relatively inefficient as a human generally talks more slowly than a fast copy typist and there is also a high rate of inaccuracy for any word not normally in the dictionary, such as place names. This may be of use to academics with small amounts of text to transcribe. I merely record the existence of this method for completeness but discard it as generally too inefficient for historical documents.

Deciding on the appropriate method for your needs

Deciding upon a suitable text capture method for a project will be defined by available time and resource balanced against project goals. For instance, it is quite acceptable to deliver just an image of a page of text and allow the end user to read it for themselves. This is cheap and relatively easy to deliver, but lacks a lot of functionality that having machine readable text can offer.

Some benefits of machine readable text are:

reuse, editing and reformatting of content;
full text retrieval provides ease of access;
indexing of content may be automated;
metadata extraction and interaction with other systems;
enabling mark-up into XML or HTML;
accessibility to text content for visually impaired users; and
delivering content with a much lower bandwidth requirement

Therefore projects need to first decide what sort of text content and delivery they require before deciding upon a text capture method. We could for instance consider a possible digitisation chain of events with various Stop Points along the way for a typical document collection. Each Stop Point represents a place in the chain where a project could stop and state that their information goals have been delivered.

Please note though that this is just one chain, there are a number of possible combinations. The Stop Points also give a broad indication of the level of activity and effort required to achieve them – the further down the tree the more effort (and thus more potentially costly) the Stop Point is likely to be. Projects that have decided upon which Stop Point they are aiming for can then choose the most appropriate text capture method.

Stop Point One: – represents just delivering the content via digital images. No text capture required.
Stop Point Two: Indexing – the text is imported to a search engine and used as the basis for full text searching of the information resource. OCR only.
Stop Point Three: Full text representation – in this option the text is shown to the end user as a representation of the original document. This requires a much higher level of accuracy to be acceptable and thus assumes validation of captured text. OCR or rekeying.
Stop Point Four: Metadata – all of the above Stop Points would benefit from additional metadata to help describe and manage the resource. Manually added to the above resources with some automatic extraction possible.
Stop Point Five: Mark-up – content is presented to the end user with layout, structure or metadata added via XML mark-up.

Choosing one of the above Stop Points obviously makes choosing the appropriate text capture method easier.

Another feature of the above flow is that included in all stages are the collection assessment, preparation and scanning. It is essential to assess the collection to identify its unique characteristics. These may be physical or content driven characteristics, but these unique features will drive the digitisation mechanism and help define the required provision and access routes to the electronic version.

There are no standard digitisation projects and the defining the nature of the original materials to be digitised is the essential first step of any conversion project. Without the steps then none of the other steps should be considered.

Comments

Federico18 June 2015 at 18:23
What digital libraries in the world go as far as Stop Point Five? I only know of Wikisource.org.
ReplyDelete
Replies
Tony Blake19 June 2015 at 10:56
Simon, we use OCR in BoB (www,bobnational.com) to convert the subtitles, which are broadcast as bitmaps, into text. This machine readable text is time stamped and is searchable as time based metadata.
ReplyDelete
Replies
trosesandler19 June 2015 at 16:07
Great post! Our project Purposeful Gaming and BHL (http://biodivlib.wikispaces.com/Purposeful+Gaming) is combining both OCR and rekeying of text. We create 2 OCR outputs from different software for a single page and compare the differences. Those differences are then sent to 2 online games where the public crowdsources the correct words for us. This article has given me some thought about prioritizing the differences rather than sending them all to the game. e.g. the common function words probably don't need verification as much as words users will actually search on. Thanks for the food for thought!
ReplyDelete
Replies
Lincolnarchives19 June 2015 at 21:41
The Lincoln Archives Digital Project has been using good old fashioned keying by hand and sometimes using Natural Dragon, since everything is in cursive.
ReplyDelete
Replies
Anonymous22 June 2015 at 01:22
see http://trove.nla.gov.au/
ReplyDelete
Replies
K26 June 2015 at 06:20
Thank you for the post. Have you any experience with Transkribus? It's a text recognition programme especially for handwritten documents. I haven't used it myself, but it looks interesting.
https://transkribus.eu/Transkribus/
ReplyDelete
Replies
Unknown16 July 2015 at 20:21
I have been working with transcripts created from the OCR in CONTENTdm and rekeying the corrections by hand. Most of the text is in table and spreadsheet format. At present, I am copying the text into Excel, cleaning up the data and formatting, then copying back into CONTENTdm. Very tedious. I need to find a better, faster way to clean up this text. Any suggestions?
ReplyDelete
Replies
DomFilk22 October 2015 at 13:16
I find a free online ocr, it's using tesseract ocr 3.02.
ReplyDelete
Replies
Addmengroups21 December 2015 at 07:52
You can include Intelligent Character Recognition Software too.
ReplyDelete
Replies
Unknown7 August 2017 at 18:28
Great post I would like to thank you for the efforts you have made in writing this interesting and knowledgeable article.
Wormaxio
ReplyDelete
Replies
Used PC Supplier21 August 2017 at 10:14
Nice Blog Post !
ReplyDelete
Replies
Thomas T. Vanover26 October 2017 at 19:32
Awesome article. Thanks
ReplyDelete
Replies
Unknown17 January 2018 at 09:57
Good to know your writing.
ReplyDelete
Replies
Anonymous3 January 2023 at 06:30
Good post..Company Formation in BVI
ReplyDelete
Replies
Shampa9 June 2023 at 16:50
Hi Simon Tanner,
Firstly, I want to thank you for this knowledgeable article. You described many processes of conversion in a very nice way. Nowadays many persons use OCR process. But I think to convert image to text for bulk of pages always take service from good data entry service provider if you cannot know the process of conversion by proper training of characters. After that to achieve accuracy, you have to cross check always with the help of good image to text proofreading software or manually. But I think to use the software is best. Voice recognition works good now if you pronounce properly but little bit tough. Thank you again for this valuable post.
ReplyDelete
Replies
William14 June 2023 at 12:15
Uncover the magic of extracting text from images and documents, you can check the image text converter. Dive into the world of OCR technology, transforming scanned text into editable and searchable content. Expand your knowledge and boost your productivity. Let's get started!
ReplyDelete
Replies
Michael White27 June 2023 at 10:24
Fantastic and educational blog!
Take My Online Course
ReplyDelete
Replies
Erhan678761 October 2023 at 10:00
Konya
Kayseri
Malatya
Elazığ
Tokat
FNNTE
ReplyDelete
Replies
Marionn4 October 2023 at 11:28
ağrı
van
elazığ
adıyaman
bingöl
MX64BJ
ReplyDelete
Replies
casinosite777.top15 October 2023 at 02:21
Great information you shared through this blog.
ReplyDelete
Replies
casinosite.one15 October 2023 at 02:22
Keep it up and best of luck for your future blogs and posts.
ReplyDelete
Replies
casinositeguide.com15 October 2023 at 02:22

Thank you for sharing this useful information, I will regularly follow your blog.
ReplyDelete
Replies
safetotosite.pro15 October 2023 at 02:23
Thanks for posting this valuable information, really like the way you used to describe.
ReplyDelete
Replies
HyperspacePhoenix23 October 2023 at 09:02
istanbul evden eve nakliyat
balıkesir evden eve nakliyat
şırnak evden eve nakliyat
kocaeli evden eve nakliyat
bayburt evden eve nakliyat
2YC
ReplyDelete
Replies
SolarCipheress12AT23 October 2023 at 14:20
izmir evden eve nakliyat
malatya evden eve nakliyat
hatay evden eve nakliyat
kocaeli evden eve nakliyat
mersin evden eve nakliyat
Sİ8YSS
ReplyDelete
Replies
CB28CKylanD6FB36 November 2023 at 22:29
FFFA9
İzmir Lojistik
Erzincan Evden Eve Nakliyat
Sivas Lojistik
Ardahan Evden Eve Nakliyat
Şırnak Parça Eşya Taşıma
ReplyDelete
Replies
EBD87Kian732359 November 2023 at 05:42
A62F3
İzmir Evden Eve Nakliyat
Denizli Evden Eve Nakliyat
Çankaya Parke Ustası
Okex Güvenilir mi
Balıkesir Şehir İçi Nakliyat
Malatya Lojistik
Tokat Evden Eve Nakliyat
İzmir Şehir İçi Nakliyat
Adıyaman Şehirler Arası Nakliyat
ReplyDelete
Replies
FC92DCarl3E2889 November 2023 at 08:16
5892E
Okex Güvenilir mi
Çerkezköy Boya Ustası
Manisa Şehir İçi Nakliyat
Kilis Şehir İçi Nakliyat
Kastamonu Parça Eşya Taşıma
Konya Parça Eşya Taşıma
Karapürçek Boya Ustası
Urfa Şehir İçi Nakliyat
Amasya Parça Eşya Taşıma
ReplyDelete
Replies
Sita13 April 2024 at 13:37
Feeling Great !! Such a useful content you have provided with us via this blog. Regards for sharing. Convert JPG to PNG for free.
ReplyDelete
Replies