CrawNet: Multimedia Crawler Resources for Both Surface and Hidden Web

Fernando Pech-May; Alicia Martínez-Rebollar; Hugo Estrada-Esquivel; Eduardo Pedroza-Landa

doi:10.21501/21454086.1365

CrawNet: Multimedia Crawler Resources for Both Surface and Hidden Web

Authors

Fernando Pech-May Centro Nacional de Investigación Científica y Tecnológica (CENIDET)
Alicia Martínez-Rebollar Centro Nacional de Investigación Científica y Tecnológica (CENIDET)
Hugo Estrada-Esquivel Fondo de información y documentación para la Industria, INFOTEC
Eduardo Pedroza-Landa Centro Nacional de Investigación Científica y Tecnológica (CENIDET)

DOI:

https://doi.org/10.21501/21454086.1365

Keywords:

Crawler, hidden web, surface web.

Abstract

The web is the most used information source in both academic, scientific and industry forums. Its explosive growth has generated billions of pages with information which may be categorized as surface web, composed of static pages that are indexed into a hidden web, accessible through search templates. This paper presents the development of a crawler that allows searching, queries, and analysis of information in the surface web and hidden in specific domains of the web.

Downloads

Download data is not yet available.

Author Biographies

Fernando Pech-May, Centro Nacional de Investigación Científica y Tecnológica (CENIDET)

Estudiante de Doctorado en Ciencias de la Computacion

Alicia Martínez-Rebollar, Centro Nacional de Investigación Científica y Tecnológica (CENIDET)

Profesora-Investigadora CENIDET

Hugo Estrada-Esquivel, Fondo de información y documentación para la Industria, INFOTEC

Gerencia de Desarrollo de Nuevos Productos y Servicios, INFOTEC

References

H. Yeye, X. Dong, G. Venkatesh, R. Sriram & S.

Nirav, “Crawling Deep Web Entity Pages”, Proceedings of the Sixth ACM International Conference on Web Search and Data Mining. Rome, Italy, pp. 355-364, 2013

M. Bergman, “White Paper: The Deep Web

Surfacing Hidden Value”. BrightPlanet: The Journal of Electronic Publishing, vol. 7, no. 1, 2012.

M. Álvarez. “Arquitectura para Crawling dirigido

de información contenida en la web oculta”. PhD. Dissertation, Universidad la Coruña, A. Coruña, España, 2007.

S. Lawrence & C. Giles, "Accessibility of Information on the Web", Nature, vol. 400, no. 1, pp. 107-109, Julio, 1999.

B. Fernández. & S. Pardo, “Selección de recursos de información disponibles en el Web invisible”. Acimed. vol. 14, no. 6, 2006.

Z. Wu, L. Jiang, Q. Zheng & J. Liu, “Learning to

surface deep web content”, In Proc. 2011, Twenty-Fourth AAAI Conference on Artificial Intelligence. Georgia, USA, pp. 1967-1968.

W. Yan, “Query selection in deep web crawling: Help your crawler efficiently retrieve data from the largest data sources in the web year”, Scholar's Press, 2014.

K. Chang, B. He, & Z. Zhang, “Toward large scale

integration: Building a MetaQuerier over databases on the web”, 2005. Proceedings of the Second Biennial Conference on Innovative Data Systems Research, Asilomar, CA, USA, January, pp. 44-55.

B.He, M. Patel, Z. Zhang & K. Chen-Chuan, “Accessing the deep web: A survey”, Commun. ACM, vol. 50 no. 5, pp 94-101. Mayo, 2007.

M. Soulemane, M. Rafiuzzaman & H. Mahmud, “Crawling the hidden web: An approach to dynamic web indexing”, International Journal of Computer Applications, vol. 55, no. 1, pp 7-15, Octubre, 2012.

D. Anuradha & A. Babita, “Hidden web extractor:

Dynamic way to uncover the deep web”, International Journal on Computer Science & Engineering, vol. 4, no. 6, pp. 1137-1145. Junio, 2012.

L. Xian, D. Xin, L. Kenneth, M. Weiyi & S. Divesh, “Truth finding on the deep web: Is the problem solved?”, Proceedings of the 39th international Conference on Very Large Data Bases, Trento, Italy, pp. 97-108, 2013

S. Liddle, S. Yau & D. Embley, “On the automatic extraction of data from the hidden web”. In Proc. 2001 International Workshop on Data Semantics in Web Information Systems (DASWIS-2001), London, UK, pp. 212-226.

C. Sherman & G. Price. 2001, “The invisible web:

Uncovering information sources search engines can’t see”. Medford, N.J, CyberAge Books, 2001.

M. Álvarez, J. Raposo, A. Pan, F. Cacheda, F. Bellas, & V. Carneiro. “Crawling the content hidden behind web forms”, Proceedings of the ICCSA, Lecture Notes in Computer Science v 4706, Springer, pp. 322-333, 2007

V. Prieto, M. Álvarez, R. López-García & F.

Cacheda, “A scale for crawler effectiveness on the client-side hidden web”, Computer Science and Information Systems, vol. 9 no. 2, pp. 561-583. Junio, 2012.

D. Lewandowski & P. Mayr, “Exploring the academic invisible web”. Library Hi Tech., vol. 24, no. 4, pp. 529-539, Feb. 2007.

M. Wuand & A. Marian, “A framework for corroborating answers from multiple web sources”. Information Systems, vol. 36, no. 2, pp. 431-449, Jun. 2011.

X. Dong, B. Saha & D. Srivastava, “Less is more: Selecting sources wisely for integration”. PVLDB, vol. 6, no. 2, 2013. Disponible en: http://www.vldb.org/pvldb/vol6/p37-dong.pdf

S. Raghavan, & H. Garcia-Molina, “Crawling the

hidden web,” Proceedings of the 27th International Conference on Very Large Data Bases (VLDB 2001), San Francisco, CA, USA, pp. 129-138, 2001

M. Cafarella, E. Chang, A. Fikes, A. Halevy, W.

Hsieh, A. Lerner, J. Madhavan & S. Muthukrishnan, “Data management projects at Google”. ACM SIGMOD Record vol. 37, no. 1, pp.34-38, 2008.

Salinas Martínez, Osvaldo. “Modelado semántico de documentos con estructura definida”. Tesis, Cd. Victoria, Tamaulipas, México, Centro de Investigación y de Estudios Avanzados del Instituto Politécnico Nacional, 2012.

F. Martínez-Méndez. Recuperación de información: Modelos, sistemas y evaluación, Murcia, España, Ed. El Kiosko, 2012.

Revista Lámpsakos. Edición N°13 (Enero-Junio de 2015)

Downloads

PDF (Español (España))

Published

2015-01-01

How to Cite

Pech-May, F., Martínez-Rebollar, A., Estrada-Esquivel, H., & Pedroza-Landa, E. (2015). CrawNet: Multimedia Crawler Resources for Both Surface and Hidden Web. Lámpsakos, (13), 39–50. https://doi.org/10.21501/21454086.1365

Download Citation

Issue

No. 13 (2015): Edición 13 (enero-junio, 2015): Aportes al Conocimiento en Ingeniería

Section

Articles of scientific and technological research

License

In accordance with national and international copyrights, as well as publishing policies of "Fundación Universitaria Luis Amigó" and its Journal "Lámpsakos" (indexed with ISSN : 2145-4086), I (we ) hereby manifest:
1. The desire to participate as writers and submit to the rules established by the magazine publishers.
2. The commitment not to withdraw the manuscript until the journal finishes the editing process of the ongoing issue.
3. That article is original and unpublished and has not been nominated or submitted together in another magazine; therefore, the rights of the article in evaluation have not been assigned in advance and they do not weigh any lien or limitation for use.
4. The absence of conflict of interest with commercial institution or association of any kind
5. The incorporation of the quotes and references from other authors, tending to avoid plagiarism. Accordingly, the author affirms that the paper being published do not violate copyright, intellectual property or privacy rights of third parties. Morover, if necessary there is a way of demonstrating the respective permits original copyright to the aspects or elements taken from other documents such as texts of more than 500 words, tables, graphs, among others. In the event of any claim or action by a third party regarding copyright on the article, the author (s) will assume full responsibility and come out in defense of the rights herein assigned. Therefore, for all purposes, the Journal "Lámpsakos" of the "Fundación Universitaria Luis Amigó" acts as a third party in good faith.
6. In the event of the publication of the article, the authors free of charge and on an exclusive basis the integrity of the economic rights and the right to print, reprint and reproduction in any form and medium, without any limitation as to territory is concerned, in favor of the Journal "Lámpsakos" of the "Fundación Universitaria Luis Amigó".