O.c.r.Frequently Asked QuestionsmainTained by omid E. KIA
Document Processing Group, Center for Automation Research
University of Maryland, College Park, Md. 20742
E-mail: kia@cfar.umd.edu
This is a document that answers most frequently asked questions about OCR and its derivatives. This FAQ is mostly applicable to the document analysis newsgroups, where there has been a large number of requests for general information. It is organized as a FAQ with emphasis on the generality versus especiallity. This document is also available in postscript format at ****. For any questions, deletions, additions kinldy remit your request to kia@cfar.umd.edu. I hope that this document would be helpful to all its readers .
Contents
OCR FAQ Basics
What is The REAON for THIS FAQ? WHO SHOULD USE THIS FAQ? What Is Covered in this faq? What is not cover in this faq? What Are OCR, ICR AND OMR?
What Does Ocr Stand for? What Does OMR Stand for? What Does ICR Stand for? Sources of Information
WHERE CAN I GEENERAL INFORMATION, PREFERABLY IN ONE PLACE? Applications and Binaries
Are there any freeware / shareware program limited? Is there? Company? Optical Input Devices? Optical Input Devices? Optical Input Devices? Optical Input Devices?
What Are The Minimum Requirements? How Are Flatbed Scanners? How Are Handheld Scanners? What about Pen Scanners? Comparisons
SPECIFIC Application Performance Comparisons? Pointers? Are the ANY COMPARISONS OE OCR METHODS? SOURCES OF ON-LINE PUBLICATIONS?
TECHNICAL REPORT SOURCES? CENTERS WITH ON-LINE INFORMATION? Orderable Sources? BIBLIOGRAPHY SOURCES? DATABASES for TESTING PURPOSES?
Areate Document Databases? Are The There Handwritten Character Databases? Are there is forms database? Any Thing else? Internet FTP Sites Disclaimer: OCR FAQ Basics
What is The REASON for THIS FAQ?
The reason for this FAQ (Frequently Asked Questions) is to provide quick and simple answers to questions which are presented in the newsgroups `` comp.ai.doc-analysis.misc '' and `` comp.ai.doc-analysis.ocr ''. The Availability and Usage of this FAQ Should Expedite Ocr EFFORTS AND OPEN A More In-Depth Dialogue To Promote Usage, Research, And Integration of Ocr in Automation.
WHO SHOULD USE THIS FAQ?
A person who is new to the field should read this FAQ to get a feel for available material on the subject and also to grasp the extent of the field. There are a number of information sources within this document along with some mention of accepted facts and Figures.
What is covered in this faq?
In most part this FAQ includes a number of pointers to internet sites which house either information, database, or actual packages which are available for various OCR problems. These pointers range from WWW home pages, to FTP sites and some points of contact.
Other ACCEPTED PRACTICES, SETUPS, AND Taboos Are Also Mentioned Which Are Directly Related To Ocr Tasks.
What is not cover in this faq?
Specific questions which pertain to only small number of users are avoided. Every effort is given to produce answers to a wide range of problems but application specific solutions are left out mainly due to limited resources.
What Are OCR, ICR AND OMR?
What Does Ocr Stand for?
OCR stands for Optical Character Recognition. This term is typically used for general character recognition which includes the transformation of anything humanly readable to machine manipulatable representation.In this context the task of character recognition involves understanding of machine printed characters and handwritten characters. There have been Significant Advances in The Former and The Bulk of Software Available for Automation IS Geared Towards The Recognition of Machine Printed Character Recognition.
What does OMR STAND FOR?
OMR stands for Optical Mark Reader or Recognition OMR is used to read forms which have variable size field entries and the entry can be in form of a check mark, a cross, or some scribble. These forms include but are not restricted to automated entry sheets where the user needs to fill in an oval shape with a number 2 pencil. OMR usually needs a blank form and a definition of entry zones, then the OMR detects marks in the scaned image and classifies them accordingly.
What Does ICR Stand for?
ICR stand for Intelligent Character Recognition. ICR is usually considered to be the extension of OCR which explicitly includes handwritten characters. There are a large number of research efforts in this field and does not seem to be ending anytime soon. Most working testbeds work on limited Vocabulary and clean handwriting. A Robust Mechanism Does NOT EXIST TO Perform This Task Effectively. HOWEVER THERE A LARGE NUMBER OF PUBLICATIONS AND TEST Image Hich Are Available for Further Research.
Sources of information
WHERE CAN I GET General Information, Preference IN One Place?
Every effort has been taken to identify sources of information for OCR, OMR, and ICR. However we encourage any additions or changes from the readers. In this section of the FAQ particular attention has been given to sources of general information. Specific applications and resources Are Covered Later on The FAQ.Document Und The FAQ.Document Und Tries To Maintain An Up To Date Set of Information For All Ocr Products and Related Items.
http://documents.cfar.umd.edu/
MITEK SYSTEMS Are Also Very Active In The OCR / ICR Development and Have Created A Nice Compiration of Information Which Can Be Accessed AT:
http://www.miteksys.com/ocr_info/
TheRe IS Also a Technology Which Addresses Ocr. There IS A Large Number of Useful Starter Information In this site and it is located at:
http://documents.cfar.Umd.edu/cgi-bin/groups
Http://www.cs.jcu.edu.au/~michael/document_sites.html
Http://www.miteksys.com/ocr_info/research.html
................. ..
Applications and Binaries
Are there any freeware / shareware programs available?
YES, There Are A Number of Programs Available Which Are Either Free To Copy and Use (FREEWARE), OR A SMALL FEE IS Payed To Get A Full-Blow Copy (Shareware).
Wocar: a Microsoft Windows Based OCR Package Developed by Cyril Cambien Is Twain Compliant and Also Reads Tiff images. This Software is Available At:
http://www.simtel.net/pub/simtelnet/win95/graphics/wocar25.zip
Ocrchie: Originally Developed by Kathey Marsden At U.c. Berkley As Part of A Senior Project Is Available With Supporting Documentation At:
http://http.cs.berkeley.edu/~fateman/kathey/ocrchie.htmlCal Poly OCR:. Developed as a class project at Cal Poly SLO (California Polytechnic State University, San Luis Obispo) A pointer exists on
http://documents.cfar.umd.edu/oCr
And The Actual Location IS AT
ftp://ftp.csc.calpoly.edu/pub/oCr/
XoCr: weveted by martin_bauer@s2.maus.de (Martin Bauer) IS A FREEWARE SOFTWARE Which IS Accesible At the Following:
ftp://sunsite.unc.edu/pub/linux/x11/xapps/graphics/
FTP: //ftp.cdrom.com/pub/linux/sunsite/x11/xapps/graphics/
There is Sofu Versions of this Software, English and deutsch.
Ocr for atari: Optical Character Recognition for the Atari ST / STE / TT / FALCON
http://www.ensta.fr/internet/atari/oCr.html
ftp://ftp.uni-kl.de/pub/atari/misc
ftp://ftp.isbiel.ch/atari/diverrses
ftp://ftp.cnam.fr/pub/atari/text
ftp://tari.archive.umich.edu/atari/applications/other
Newsgroup discussion at: comp.sys.atari.st
Comcom: Public Domain OCR Software for the PC
http://www.comcomsystems.com/image.html
ftp://oak.ooakland.edu/pub3/simtel-win3/fax/elaicr19.zip
ftp://ftp.intnet.net/pub/windows/ocr.software/elaicr19.exe
Is there? Company?
A List is Kept At:
http://www.miteksys.com/ocr_info/companies.html
http://documents.cfar.umd.edu/resources/products/
Optical Input Devices
What Are The Minimum Requirements?
Bare minimum requirements for image scanning to be used for OCR are bi-level 200 dpi. The assumption is that the letters will have to be sufficiently large to be effectively extracted and small fonts and small print scanning will probably not work very well with this method . It is more widely accepted that 300 dpi is readable by humans and should also be used for scannings. Since humanly unreadable documents are usually not presented for OCR, it suffices to consider a 300 dpi scanner. While documents are bi-level images in most part, illumination plays a big role in gathering of a good scan and some times there is a need to perform adaptive thresholding to arrive at a clean bi-level document image. to be able to do this, the scanner must be able to retrieve greyscale scans. A scanner able to gather a 256 level greyscale scan is sufficient mostly due to the fact that the greyscale resolution is beyond the capabilites of a human eye. So for a robust system a scanner with 300 dpi and 256 greyscale is preferred.Higher resolution is always nice but does not improve OCR performance by a large amount, and use of color does not improve performance unless the text image to be scanned is in multi color format. If the print is in one color, not necessarily black, there should be sufficient contrast that the greyscale scanner will be able to extract the character components, but if multi color fonts and multi color backgrounds are used extensively, the proper choice would be to use a color scanner. Note that an adaptive thresholding STILL NEEDS to BE DONE IN ORDER TO Convert to Bi-level for Input to an Appropriate OCR.
How Are Flatbed Scanners?
Flatbed scanners are by far the most prefered mode of scanning. For automated entry, some flatbed scanners are equiped with automatic sheet feeders and their automatic scanning methodology results in a consistent performance. The only drawback is the effect of the flat screen while scanning a book and the artifacts which arise from this situation There is really no alternative to automatically scan while retaining this feature The list of scanner manufacturers are truely immense and is beyond the scope of this FAQ and the readers are kindly referred to http:.. // documents .cfar.umd.edu / resources / products / for further information. Available Scanners Are Able To Scan in Excess of 1200 DPI WITH 24 BIT Color Scanning Ability.how Are Handheld Scanners?
An alternative mode of scanning is via a cheap handheld scanner. While these scanners are relatively cheaper than their flatbed cousins they are harder to handle. These scanner utilize a roller which advance the pixel counter in order to scan a row. In higher resolutions, it is hard to keep a sufficiently low speed to perform an accurate scan and often times rows of pixels are dropped from the scan. Various contraptions are available in the market which is retrofitted to the hand-held scanner inorder to avoid this problem. Also due to ITS Small Size It Is Offen Hard to Capture An Entire Document. These Scanners Are Sometimes bundeled with an ocr SOFTWARE button.
What about pen scanners?
To date there is only one reference to a pen scanner. The IRIS Datapen has built in OCR which recognizes up to 100 characters per second. This product is made by Primax Electronic and can be reached at 800-338-3693. This product is perfect For scanning in a piece of article, memo, or ibi this product is not know, at the time of this publication.comparisons
I Think THAT THIS HAS BEEN DONE? MITEK'S COMPARISON PAGE.
This section is included to prompte communication in the OCR community on the performance issues of various OCR applications. The performance statistics mentioned in this section are views of the author and may be different from publicized statistics and any implied responsibility is waived. This information source is From An Uninterested Party and Should Serve As a Basis for Considering An Ocr Package.
........................
Specific Application Performance Comparisons?
The number of applications which are available is at debate here. The section is compiled on user feedback and every effor is made to keep it objective. Strong and weak points of associated applications are discussed along with their usability. If the reader has any input in this section please send and e-mail to kia@cfar.umd.edu with referece to a specific application, Trade name and manufacturer, along with version number and any comments. Input and feedback is strongly encouraged.
Pointers to Performance Comparisons?
. Some companies provide information on their products in terms of accuracy and performance A list is provided which points to these locations For any additions or changes please forward an e-mail to kia@cfar.umd.edu.Caere Products:. Functional Description without Any Physical Performance Numbers.
Http://www.caere.com/live/content/products/products.htm
MITEK Products: Functional As Well As Accuracy and speed Specifications.
http://www.miteksys.com/products/qs/index.html
Olduvai Products: Mostly Functional Description with Little Performance Specifications.
Http://www.shadow.net/~olduvai/pressreleases.html
Xerox Products: Functional Description Wiht Performance and Accuracy Specifications.
http://www.xerox.com/products/xis/contents.html
Are there any company company?
There Are Only A Small Number of Publications Which Address Performance Issues. The Published Papers Are:
Anigbogu, JC et-al Performance evaluation of an HMM based OCR system Proceedings of the International Conference on Pattern Recognition, 1992. pp 565-568 Cushman, WH et-al Usable OCR:....? What are the minimum performance requirements Proceedings Of The ACM Conference On Computer Human Interaction, 1990. PP. 145-151. Nagy, G. At the Frontiers of Ocr. Proceedings of The IEEE. (80), 1992. PP. 1093-1100. Chen, S. ET- Al. Performance Evaluation of TWO OCR Systems. Proceedings of the Symposium on Document Analysis and Image RetreiVal, 1994. PP. 299-318.
Sources of on-line publications?
TECHNICAL REPORT SOURCES?
There are a number of research centers around the world which provide an extensive list of their publications. Since there is a large time lag from original work to publication in a journal several centers opt for publication of technical report. Most centers have a database of published PAPERS AND TECHNICAL REPORTS. Some Additional Sights Are: http://www-white.media.mit.edu/vismod/cgi-bin/tr_pagemaker
CENTERS WITH LINE INFORMATION?
Look at appendix a for the list. This Information is Also Kept At:
http://documents.cfar.Umd.edu/cgi-bin/groups
Http://www.miteksys.com/ocr_info/research.html
Orderable Sources?
BIBLIOGRAPHY SOURCES?
There Are A Number of On-line Bibliography Sources Gerased Towards Artificial Intelligence, Machine Vision, And Documents. Some Locations Are:
http://documents.cfar.umd.edu/biblio/docbibonline.html
Http://www.ph.tn.tudelft.nl/bibliographic.html
http://iris.usc.edu/vision-notes/rosenfeld/contents.html
http://cosmos.kaist.ac.kr/pub/bibliographies/graphics/index.html
THE FIRST Site Is Also Available for Bibliography Searches by E-Mail.
Databases for testing purposes?
Most of this information can be found in exquultly mentioned web pages.
Areate Document Databases?
University of Washington DataBases:
(206) 685-4974
Haalick@ee.washington.edu
UW English Document Image Database I UW-II English / Japanese Document Image Database Images of documents in various forms, i.e. noiseless, noisy, warped for photocopying, along with various ground truth information State University of New York at Buffalo, CEDAR.:
Ajay Shekhawat, Cedar
State Univerisyt of New York At Buffalo, Ub Commin
520 Lee Entrance, Suite 202
Amherst, NY 14228-2567 (USA)
Ajay@cs.buffalo.edu
.. Japanese Character Image Database Machine printed Japanese character images extracted from a large variety of sources Ground truth is also included National Institute of Standards and Technology: 301-975-2208
Srdata@enh.nist.gov (Order)
Patrick@magi.ncsl.nist.gov (Technical Info.)
NIST Special Database 20: Scientific and Technical Document Database; high resolution scan of technical journal, books, and articles The images contain a rich set of graphic elements such as graphs, tables, equations and such elements found in technical publications National Institute of.. Standards and Technology:
R. Allen Wilkinson
301-975-3383
URT@magi.ncsl.nist.gov
NIST Special Database 8: Machine-Print Database of Gray Scale and Binary Images; digitized pages with originating text are included These images use limited font and point size for comparisons of different OCR engines Environmental Research Institute of Michigan:..
Steven G. SCHLOSSER
313-994-1200 x2339
Schlosser@erim.org
ERIM Arabic Document Database: A database created from machine printed Arabic documents The images are extracted from books and magazines Scanned at 300 dpi, a test image is available at ftp://ftp.erim.org/outgoing/arabic-db... TAR.GZ (5.3MB) And Character Truth Along with LigatureS Are Given.
Areate Handwritten CHARACTER DATABASES?
State University of New York At Buffalo, Cedar:
Ajay Shekhawat, Cedar
State Univerisyt of New York At Buffalo, Ub Commin
520 Lee Entrance, Suite 202
Amherst, NY 14228-2567 (USA)
Ajay@cs.buffalo.edu
CEDAR CDROM-1:. Handwritten words and ZIP Codes, extracted from U.S. Post Office Greyscale and binary images are included along with ground truth National Institute of Standards and Technology:. 301-975-2208
Srdata@enh.nist.gov (Order)
Patrick@magi.ncsl.nist.gov (Technical Info.)
NIST Special Database 19: Handprinted Forms and Characters Database This database supersedes Special Database 3 and 7 and includes all material form NIST on this subject It has pre-segmented field entries for characters and numerals, along with a free text fields ground truth is.. Also Included. National Institute of Standards and Technology:
Michael D. Garris
301-975-2208
Mdg@magi.ncsl.nist.gov
NIST Special Database 1: NIST Binary Images of Printed Digits, Alphas, and Text; images of full page data, numbers with 2-6 digits, alphabets and unconstrained text are included Andrew Senior's Handwritten Words:.
Andrew Senior
Aws@watson.ibm.com
Anonymous ftp://svr-ftp.eng.cam.ac.uk/pub/data/
Ground Truthed Handwritten Word images: Off-line Cursive Words, Single Writer With Capitalization and Puncture Clearly Spaced, with segmentation label file.
Are there for Forms Database?
National Institute of Standards and Technology:
Darrin L. Dimmick
301-975-4147
DLD@magi.ncsl.nist.gov
NIST Special Database 2: NIST Structured Forms Reference Set of Binary Images Images of simulated tax forms completed using machine print The answer files are also included which was used to generate simulated data entry on the form by typewritters NIST Special Database 6:.. NIST Structured Forms Refence Set 2 of Binary Images; An Extension to Special Database 2.
Any Thing Else?
University of Maryland, Document Processing Group GROUP
David doermann301-405-1767
Doermann@cfar.Umd.edu
Anonymous ftp://documents.cfar.umd.edu/pub/contrib/databases/
UMDLOGO_DATABASE.TAR
UMD logo database: Images of corporate logos scanned at 256 level greyscale.
Internet FTP Sites
The Internet FTP Sites Which Are Mentioned in this Document IS Archived Here:
DiscLaimer:
This article is provided as is without any express or implied warranties. While every effort has been taken to ensure the accuracy of the information contained in this article, the author / maintainer / contributors
Kia@cfar.umd.edu Wed Mar 26 16:00:11 Est 1997