Review on Text Extraction in Complex Image using Five  Different Web OCR

Sari Dewi Budiwati; Dahliar Ananda; Siska Komala Sari

doi:10.14419/ijet.v8i1.9.26399

Article Summary Abstract References Full Article How to cite

Authors
- Sari Dewi Budiwati
- Dahliar Ananda
- Siska Komala Sari
2019-01-26

https://doi.org/10.14419/ijet.v8i1.9.26399
Online Character Recognition (OCR), web OCR, text extraction, modern market catalogs, SBPH algorithm
Price product comparison is needed for the consumer in order to find the cheapest price. In this research, we build the system to compare the price between several modern markets. The input is product catalogs since consumer often receives it from the modern market. We upload 75 grayscale 75 color input images into five web OCR. We compare the results based on characteristic and segmentation parameter. We define 0.5 and 1 point if web OCR recognizes the product price or/and names from product catalogs. Characteristic parameter is a parameter which identified product price and name using box line or empty line. Meanwhile, for segmentation parameter, we use image properties such as image dimension, dots per inch (dpi), and bit depth. From both parameters, OCR4 gives a better result as it is can recognize 50.67% grayscale image and 62.67% color images. Although it is recognized by OCR4, some of its results shown with data noise. In order to remove the noise, we proposed to used Sparse Binary Polynomial Hashing (SBPH) algorithm with 5-8 letters combination. As the result, some of the text was able to recognize in order to compare its price, while the others with much data noise were not.
Â
References
1. [1] Budiwati SD, Ananda D, Komala Sari S. Data Extraction on Hypermarket Catalogs using Sparse Binary Polynomial Hashing. In Konferensi Nasional Sistem Informasi; 2015.
  [2] CustomOCR. [Online].; 2015. Available from: HYPERLINK "http://www.customocr.com/index.php?r=site/page&view=demos.tesseract_ocr" http://www.customocr.com/index.php?r=site/page&view=demos.tesseract_ocr .
  [3] FreeOCR. [Online].; 2015. Available from: HYPERLINK "http://www.free-ocr.com/" http://www.free-ocr.com/ .
  [3]i2OCR. [Online].; 2015. Available from: HYPERLINK "http://www.i2ocr.com/" http://www.i2ocr.com/ .
  [3]OCRwebservice. [Online].; 2016. Available from: HYPERLINK "http://www.ocrwebservice.com/" http://www.ocrwebservice.com/ .
  [3]Google Drive. [Online].; 2015. Available from: HYPERLINK "https://drive.google.com/drive/" https://drive.google.com/drive/# .
  [3]Smith R. An Overview of the Tesseract OCR Engine. In Proc. Ninth Int. Conference on Document Analysis and Recognition (ICDAR), IEEE Computer Society ; 2007.
  [4] Google. Github. [Online].; 2016. Available from: HYPERLINK "https://github.com/tesseract-ocr" https://github.com/tesseract-ocr .
  [4]Patel C, Patel A, Patel D. Optical Character Recognition by Open Source OCR Tool Tesseract: A Case Study. International Journal of Computer Applications. 2012 October; 55.
  [5] Sarawagi S. Information Extraction,. Foundations and Trends in Databases. 2007; 1: p. 261-377.
  [6] Giri P. Text Information Extraction and Analysis From Images Using Digital Image Processing Techniques. Special Issue of International Journal on Advanced Computer Theory and Engineering (IJACTE). 2013; 2(1): p. 2319-2526.
  [7] Yerazunis WS. The Spam-Filtering Accuracy Plateau at 99.9% Accuracy and How To Get Past It. In MIT Spam Conference; 2004.
  [8] Yerazunis WS. Sparse Binary Polynomial Hashing and the CRM114 Discriminator. In Cambridge Spam Conference Proceedings Vol 1; 2003.
  [9] Thomason A. Blog Spam: A Review. In CEAS 2007 - The Fourth Conference on Email and Anti-Spam; 2007; Mountain View, California.
  [10] Aggarwal G, Ghosal S, inventors; United States patent US6728706 B2. 2011.
Downloads
How to Cite
Dewi Budiwati, S., Ananda, D., & Komala Sari, S. (2019). Review on Text Extraction in Complex Image using Five Different Web OCR. International Journal of Engineering & Technology, 8(1.9), 199-204. https://doi.org/10.14419/ijet.v8i1.9.26399
ACM

ACS

APA

ABNT

Chicago

Harvard

IEEE

MLA

Turabian

Vancouver

Download Citation

Endnote/Zotero/Mendeley (RIS)

BibTeX

Review on Text Extraction in Complex Image using Five Different Web OCR

Authors

References

Downloads

How to Cite

Published