Review on Text Extraction in Complex Image using Five Different Web OCR

  • Authors

    • Sari Dewi Budiwati
    • Dahliar Ananda
    • Siska Komala Sari
    2019-01-26
    https://doi.org/10.14419/ijet.v8i1.9.26399
  • Online Character Recognition (OCR), web OCR, text extraction, modern market catalogs, SBPH algorithm
  • Price product comparison is needed for the consumer in order to find the cheapest price. In this research, we build the system to compare the price between several modern markets. The input is product catalogs since consumer often receives it from the modern market. We upload 75 grayscale 75 color input images into five web OCR. We compare the results based on characteristic and segmentation parameter. We define 0.5 and 1 point if web OCR recognizes the product price or/and names from product catalogs. Characteristic parameter is a parameter which identified product price and name using box line or empty line. Meanwhile, for segmentation parameter, we use image properties such as image dimension, dots per inch (dpi), and bit depth. From both parameters, OCR4 gives a better result as it is can recognize 50.67% grayscale image and 62.67% color images. Although it is recognized by OCR4, some of its results shown with data noise. In order to remove the noise, we proposed to used Sparse Binary Polynomial Hashing (SBPH) algorithm with 5-8 letters combination. As the result, some of the text was able to recognize in order to compare its price, while the others with much data noise were not.

     

  • References

    1. [1] Budiwati SD, Ananda D, Komala Sari S. Data Extraction on Hypermarket Catalogs using Sparse Binary Polynomial Hashing. In Konferensi Nasional Sistem Informasi; 2015.

      [2] CustomOCR. [Online].; 2015. Available from: HYPERLINK "http://www.customocr.com/index.php?r=site/page&view=demos.tesseract_ocr" http://www.customocr.com/index.php?r=site/page&view=demos.tesseract_ocr .

      [3] FreeOCR. [Online].; 2015. Available from: HYPERLINK "http://www.free-ocr.com/" http://www.free-ocr.com/ .

      [3]i2OCR. [Online].; 2015. Available from: HYPERLINK "http://www.i2ocr.com/" http://www.i2ocr.com/ .

      [3]OCRwebservice. [Online].; 2016. Available from: HYPERLINK "http://www.ocrwebservice.com/" http://www.ocrwebservice.com/ .

      [3]Google Drive. [Online].; 2015. Available from: HYPERLINK "https://drive.google.com/drive/" https://drive.google.com/drive/# .

      [3]Smith R. An Overview of the Tesseract OCR Engine. In Proc. Ninth Int. Conference on Document Analysis and Recognition (ICDAR), IEEE Computer Society ; 2007.

      [4] Google. Github. [Online].; 2016. Available from: HYPERLINK "https://github.com/tesseract-ocr" https://github.com/tesseract-ocr .

      [4]Patel C, Patel A, Patel D. Optical Character Recognition by Open Source OCR Tool Tesseract: A Case Study. International Journal of Computer Applications. 2012 October; 55.

      [5] Sarawagi S. Information Extraction,. Foundations and Trends in Databases. 2007; 1: p. 261-377.

      [6] Giri P. Text Information Extraction and Analysis From Images Using Digital Image Processing Techniques. Special Issue of International Journal on Advanced Computer Theory and Engineering (IJACTE). 2013; 2(1): p. 2319-2526.

      [7] Yerazunis WS. The Spam-Filtering Accuracy Plateau at 99.9% Accuracy and How To Get Past It. In MIT Spam Conference; 2004.

      [8] Yerazunis WS. Sparse Binary Polynomial Hashing and the CRM114 Discriminator. In Cambridge Spam Conference Proceedings Vol 1; 2003.

      [9] Thomason A. Blog Spam: A Review. In CEAS 2007 - The Fourth Conference on Email and Anti-Spam; 2007; Mountain View, California.

      [10] Aggarwal G, Ghosal S, inventors; United States patent US6728706 B2. 2011.

  • Downloads

  • How to Cite

    Dewi Budiwati, S., Ananda, D., & Komala Sari, S. (2019). Review on Text Extraction in Complex Image using Five Different Web OCR. International Journal of Engineering & Technology, 8(1.9), 199-204. https://doi.org/10.14419/ijet.v8i1.9.26399