CrawNet: Multimedia Crawler Resources for Both Surface and Hidden Web

Authors

  • Fernando Pech-May Centro Nacional de Investigación Científica y Tecnológica (CENIDET)
  • Alicia Martínez-Rebollar Centro Nacional de Investigación Científica y Tecnológica (CENIDET)
  • Hugo Estrada-Esquivel Fondo de información y documentación para la Industria, INFOTEC
  • Eduardo Pedroza-Landa Centro Nacional de Investigación Científica y Tecnológica (CENIDET)

DOI:

https://doi.org/10.21501/21454086.1365

Keywords:

Crawler, hidden web, surface web.

Abstract

The web is the most used information source in both academic, scientific and industry forums. Its explosive growth has generated billions of pages with information which may be categorized as surface web, composed of static pages that are indexed into a hidden web, accessible through search templates. This paper presents the development of a crawler that allows searching, queries, and analysis of information in the surface web and hidden in specific domains of the web.

Downloads

Download data is not yet available.

Author Biographies

Fernando Pech-May, Centro Nacional de Investigación Científica y Tecnológica (CENIDET)

Estudiante de Doctorado en Ciencias de la Computacion

Alicia Martínez-Rebollar, Centro Nacional de Investigación Científica y Tecnológica (CENIDET)

Profesora-Investigadora CENIDET

Hugo Estrada-Esquivel, Fondo de información y documentación para la Industria, INFOTEC

Gerencia de Desarrollo de Nuevos Productos y Servicios, INFOTEC

References

H. Yeye, X. Dong, G. Venkatesh, R. Sriram & S.

Nirav, “Crawling Deep Web Entity Pages”, Proceedings of the Sixth ACM International Conference on Web Search and Data Mining. Rome, Italy, pp. 355-364, 2013

M. Bergman, “White Paper: The Deep Web

Surfacing Hidden Value”. BrightPlanet: The Journal of Electronic Publishing, vol. 7, no. 1, 2012.

M. Álvarez. “Arquitectura para Crawling dirigido

de información contenida en la web oculta”. PhD. Dissertation, Universidad la Coruña, A. Coruña, España, 2007.

S. Lawrence & C. Giles, "Accessibility of Information on the Web", Nature, vol. 400, no. 1, pp. 107-109, Julio, 1999.

B. Fernández. & S. Pardo, “Selección de recursos de información disponibles en el Web invisible”. Acimed. vol. 14, no. 6, 2006.

Z. Wu, L. Jiang, Q. Zheng & J. Liu, “Learning to

surface deep web content”, In Proc. 2011, Twenty-Fourth AAAI Conference on Artificial Intelligence. Georgia, USA, pp. 1967-1968.

W. Yan, “Query selection in deep web crawling: Help your crawler efficiently retrieve data from the largest data sources in the web year”, Scholar's Press, 2014.

K. Chang, B. He, & Z. Zhang, “Toward large scale

integration: Building a MetaQuerier over databases on the web”, 2005. Proceedings of the Second Biennial Conference on Innovative Data Systems Research, Asilomar, CA, USA, January, pp. 44-55.

B.He, M. Patel, Z. Zhang & K. Chen-Chuan, “Accessing the deep web: A survey”, Commun. ACM, vol. 50 no. 5, pp 94-101. Mayo, 2007.

M. Soulemane, M. Rafiuzzaman & H. Mahmud, “Crawling the hidden web: An approach to dynamic web indexing”, International Journal of Computer Applications, vol. 55, no. 1, pp 7-15, Octubre, 2012.

D. Anuradha & A. Babita, “Hidden web extractor:

Dynamic way to uncover the deep web”, International Journal on Computer Science & Engineering, vol. 4, no. 6, pp. 1137-1145. Junio, 2012.

L. Xian, D. Xin, L. Kenneth, M. Weiyi & S. Divesh, “Truth finding on the deep web: Is the problem solved?”, Proceedings of the 39th international Conference on Very Large Data Bases, Trento, Italy, pp. 97-108, 2013

S. Liddle, S. Yau & D. Embley, “On the automatic extraction of data from the hidden web”. In Proc. 2001 International Workshop on Data Semantics in Web Information Systems (DASWIS-2001), London, UK, pp. 212-226.

C. Sherman & G. Price. 2001, “The invisible web:

Uncovering information sources search engines can’t see”. Medford, N.J, CyberAge Books, 2001.

M. Álvarez, J. Raposo, A. Pan, F. Cacheda, F. Bellas, & V. Carneiro. “Crawling the content hidden behind web forms”, Proceedings of the ICCSA, Lecture Notes in Computer Science v 4706, Springer, pp. 322-333, 2007

V. Prieto, M. Álvarez, R. López-García & F.

Cacheda, “A scale for crawler effectiveness on the client-side hidden web”, Computer Science and Information Systems, vol. 9 no. 2, pp. 561-583. Junio, 2012.

D. Lewandowski & P. Mayr, “Exploring the academic invisible web”. Library Hi Tech., vol. 24, no. 4, pp. 529-539, Feb. 2007.

M. Wuand & A. Marian, “A framework for corroborating answers from multiple web sources”. Information Systems, vol. 36, no. 2, pp. 431-449, Jun. 2011.

X. Dong, B. Saha & D. Srivastava, “Less is more: Selecting sources wisely for integration”. PVLDB, vol. 6, no. 2, 2013. Disponible en: http://www.vldb.org/pvldb/vol6/p37-dong.pdf

S. Raghavan, & H. Garcia-Molina, “Crawling the

hidden web,” Proceedings of the 27th International Conference on Very Large Data Bases (VLDB 2001), San Francisco, CA, USA, pp. 129-138, 2001

M. Cafarella, E. Chang, A. Fikes, A. Halevy, W.

Hsieh, A. Lerner, J. Madhavan & S. Muthukrishnan, “Data management projects at Google”. ACM SIGMOD Record vol. 37, no. 1, pp.34-38, 2008.

Salinas Martínez, Osvaldo. “Modelado semántico de documentos con estructura definida”. Tesis, Cd. Victoria, Tamaulipas, México, Centro de Investigación y de Estudios Avanzados del Instituto Politécnico Nacional, 2012.

F. Martínez-Méndez. Recuperación de información: Modelos, sistemas y evaluación, Murcia, España, Ed. El Kiosko, 2012.

Published

2015-01-01

How to Cite

Pech-May, F., Martínez-Rebollar, A., Estrada-Esquivel, H., & Pedroza-Landa, E. (2015). CrawNet: Multimedia Crawler Resources for Both Surface and Hidden Web. Lámpsakos, (13), 39–50. https://doi.org/10.21501/21454086.1365

Issue

Section

Articles of scientific and technological research