Extraction of Meaningful Information from the Web: a Brief Survey

  • Abstract
  • Keywords
  • References
  • PDF
  • Abstract

    There is an explosive growth of information on Internet that makes extraction of relevant data from various sources, a difficult task for its users. Therefore, to transform the Web pages into databases, Information Extraction (IE) systems are needed. Relevant information in Web documents can be extracted using information extraction and presented in a structured format.

    By applying information extraction techniques, information can be extracted from structured, semi-structured, and unstructured data. This paper presents some of the major information extraction tools. Here, advantages and limitations of the tools are discussed from a user’s perspective.



  • Keywords

    Information Extraction; Web Mining; Wrapper Generation; Wrapper Induction.

  • References

      [1] Kushmerick, N., Weld, D., and Doorenbos, R., “Wrapper Induction for Information Extraction,” Proceedings of the Fifteenth International Conference on Artificial Intelligence (IJCAI), 1997, pp. 729-735.

      [2] Doorenbos, Robert B., Oren Etzioni, and Daniel S. Weld, “A Scalable Comparison-Shopping Agent for the World-Wide Web,” Proceedings Of The First International Conference On Autonomous Agents, ACM, 1997, pp. 39-48.

      [3] Hsu, C.-N. and Dung, M., “Generating Finite-State Transducers For Semi-Structured Data Extraction From The Web,” Journal of Information Systems, vol. 23, no. 8, 1998, pp. 521-538.

      [4] Adelberg, B., “NoDoSE - A Tool For Semi-Automatically Extracting Structured And Semi-Structured Data from Text Document,” SIGMOD Record, vol. 27, no. 2, 1998, pp. 283-294.

      [5] Califf, M. and Mooney, R., “Relational Learning Of Pattern-Match Rules For Information Extraction,” Proceedings of AAAI Spring Symposium on Applying Machine Learning to Discourse Processing Stanford, California, March, 1998.

      [6] Freitag, D., “Information Extraction From HTML: Application Of A General Learning Approach,” Proceedings of the Fifteenth Conference on Artificial Intelligence (AAAI-98).

      [7] Soderland, S., “Learning Information Extraction Rules For Semi-Structured And Free Text,” Journal of Machine Learning, vol. 34, no. 1-3, 1999, pp. 233-272.

      [8] Muslea, I., Minton, S., and Knoblock, C., “A Hierarchical Approach to Wrapper Induction,” Proceedings of the Third International Conference on Autonomous Agents (AA-99), ACM, 1999, pp. 190-197.

      [9] Embley, David W., Douglas M. Campbell, Yuan S. Jiang, Stephen W. Liddle, Deryle W. Lonsdale, Y-K. Ng, and Randy D. Smith., “Conceptual-Model-Based Data Extraction from Multiple-Record Web Pages,” Data & Knowledge Engineering 31, no. 3, 1999, pp. 227-251.

      [10] Ribeiro-Neto, B., A., Laender, A., H., F. and DA Silva, A., S., “Extracting Semi-Structured Data Through Examples,” Proceedings of the Eighth ACM International Conference on Information and Knowledge Management (CIKM), Kansas City, Missouri, 1999, pp. 94-101.

      [11] Eikvil, Line. "Information Extraction from World Wide Web - A Survey," 1999.

      [12] Liu, L., Pu, C., and Han, W., “XWRAP: An XML-Enabled Wrapper Construction System for Web Information Sources,” Proceedings of the 16th IEEE International Conference on Data Engineering (ICDE), San Diego, California, 2000, pp. 611-621.

      [13] Kosala, Raymond, and Hendrik Blockeel., “Web Mining Research: A Survey,” ACM SIGKDD Explorations Newsletter, vol. 2, no. 1, 2000, pp. 1-15.

      [14] Freitag, Dayne, “Machine Learning For Information Extraction In Informal Domains,” Machine Learning, vol. 39, no. 2-3, 2000, pp. 169-202.

      [15] Crescenzi, V., Mecca, G. and Merialdo, P., “RoadRunner: Towards Automatic Data Extraction from Large Web Sites,” Proceedings of the 26th International Conference on Very Large Database Systems (VLDB), Rome, Italy, 2001, pp. 109-118.

      [16] Sahuguet, Arnaud, and Fabien Azavant, “Building Intelligent Web Applications Using Lightweight Wrappers,” Data & Knowledge Engineering, vol. 36, no. 3, 2001, pp. 283-316.

      [17] Laender, A. H. F., Ribeiro-Neto, B. and DA Silva, A., S., “DEByE -Data Extraction by Example,” Data and Knowledge Engineering, vol. 40, no. 2, 2002, pp. 121-154.

      [18] Laender, Alberto HF, Berthier A. Ribeiro-Neto, Altigran S. Da Silva, and Juliana S. Teixeira, “A Brief Survey Of Web Data Extraction Tools,” ACM SIGMOD Record, vol. 31, no. 2, 2002, pp. 84-93.

      [19] Flesca, S., Manco, G., Masciari, E., Rende, E., & Tagarelli, A., “Web Wrapper Induction: A Brief Survey,” AI Communications, vol. 17, no. 2, 2004, pp. 57-61.

      [20] Chang, C. H., Kayed, M., Girgis, M. R., & Shaalan, K. F., “A Survey Of Web Information Extraction Systems,” IEEE Transactions On Knowledge And Data Engineering, vol. 18, no. 10, 2006, pp. 1411-1428.

      [21] Liu, Bing., “Web Data Mining: Exploring Hyperlinks, Contents, And Usage Data,” Springer Science & Business Media, 2007.

      [22] W. Su, J. Wang, F. H. Lochovsky, and Yi Liu, “Combining Tag and Value Similarity for Data Extraction and Alignment,” IEEE Transactions Knowledge and Data Engineering, vol. 24, no. 7, July, 2012, pp.1186-1200.

      [23] WORLD WIDE WEB CONSORTIUM. W3C. The Document Object Model. https://www.w3.org/DOM/.




Article ID: 28283
DOI: 10.14419/ijet.v7i4.19.28283

Copyright © 2012-2015 Science Publishing Corporation Inc. All rights reserved.