CrawNet: Multimedia Crawler Resources for Both Surface and Hidden Web
DOI:
https://doi.org/10.21501/21454086.1365Keywords:
Crawler, hidden web, surface web.Abstract
The web is the most used information source in both academic, scientific and industry forums. Its explosive growth has generated billions of pages with information which may be categorized as surface web, composed of static pages that are indexed into a hidden web, accessible through search templates. This paper presents the development of a crawler that allows searching, queries, and analysis of information in the surface web and hidden in specific domains of the web.Downloads
References
H. Yeye, X. Dong, G. Venkatesh, R. Sriram & S.
Nirav, “Crawling Deep Web Entity Pages”, Proceedings of the Sixth ACM International Conference on Web Search and Data Mining. Rome, Italy, pp. 355-364, 2013
M. Bergman, “White Paper: The Deep Web
Surfacing Hidden Value”. BrightPlanet: The Journal of Electronic Publishing, vol. 7, no. 1, 2012.
M. Álvarez. “Arquitectura para Crawling dirigido
de información contenida en la web oculta”. PhD. Dissertation, Universidad la Coruña, A. Coruña, España, 2007.
S. Lawrence & C. Giles, "Accessibility of Information on the Web", Nature, vol. 400, no. 1, pp. 107-109, Julio, 1999.
B. Fernández. & S. Pardo, “Selección de recursos de información disponibles en el Web invisible”. Acimed. vol. 14, no. 6, 2006.
Z. Wu, L. Jiang, Q. Zheng & J. Liu, “Learning to
surface deep web content”, In Proc. 2011, Twenty-Fourth AAAI Conference on Artificial Intelligence. Georgia, USA, pp. 1967-1968.
W. Yan, “Query selection in deep web crawling: Help your crawler efficiently retrieve data from the largest data sources in the web year”, Scholar's Press, 2014.
K. Chang, B. He, & Z. Zhang, “Toward large scale
integration: Building a MetaQuerier over databases on the web”, 2005. Proceedings of the Second Biennial Conference on Innovative Data Systems Research, Asilomar, CA, USA, January, pp. 44-55.
B.He, M. Patel, Z. Zhang & K. Chen-Chuan, “Accessing the deep web: A survey”, Commun. ACM, vol. 50 no. 5, pp 94-101. Mayo, 2007.
M. Soulemane, M. Rafiuzzaman & H. Mahmud, “Crawling the hidden web: An approach to dynamic web indexing”, International Journal of Computer Applications, vol. 55, no. 1, pp 7-15, Octubre, 2012.
D. Anuradha & A. Babita, “Hidden web extractor:
Dynamic way to uncover the deep web”, International Journal on Computer Science & Engineering, vol. 4, no. 6, pp. 1137-1145. Junio, 2012.
L. Xian, D. Xin, L. Kenneth, M. Weiyi & S. Divesh, “Truth finding on the deep web: Is the problem solved?”, Proceedings of the 39th international Conference on Very Large Data Bases, Trento, Italy, pp. 97-108, 2013
S. Liddle, S. Yau & D. Embley, “On the automatic extraction of data from the hidden web”. In Proc. 2001 International Workshop on Data Semantics in Web Information Systems (DASWIS-2001), London, UK, pp. 212-226.
C. Sherman & G. Price. 2001, “The invisible web:
Uncovering information sources search engines can’t see”. Medford, N.J, CyberAge Books, 2001.
M. Álvarez, J. Raposo, A. Pan, F. Cacheda, F. Bellas, & V. Carneiro. “Crawling the content hidden behind web forms”, Proceedings of the ICCSA, Lecture Notes in Computer Science v 4706, Springer, pp. 322-333, 2007
V. Prieto, M. Álvarez, R. López-García & F.
Cacheda, “A scale for crawler effectiveness on the client-side hidden web”, Computer Science and Information Systems, vol. 9 no. 2, pp. 561-583. Junio, 2012.
D. Lewandowski & P. Mayr, “Exploring the academic invisible web”. Library Hi Tech., vol. 24, no. 4, pp. 529-539, Feb. 2007.
M. Wuand & A. Marian, “A framework for corroborating answers from multiple web sources”. Information Systems, vol. 36, no. 2, pp. 431-449, Jun. 2011.
X. Dong, B. Saha & D. Srivastava, “Less is more: Selecting sources wisely for integration”. PVLDB, vol. 6, no. 2, 2013. Disponible en: http://www.vldb.org/pvldb/vol6/p37-dong.pdf
S. Raghavan, & H. Garcia-Molina, “Crawling the
hidden web,” Proceedings of the 27th International Conference on Very Large Data Bases (VLDB 2001), San Francisco, CA, USA, pp. 129-138, 2001
M. Cafarella, E. Chang, A. Fikes, A. Halevy, W.
Hsieh, A. Lerner, J. Madhavan & S. Muthukrishnan, “Data management projects at Google”. ACM SIGMOD Record vol. 37, no. 1, pp.34-38, 2008.
Salinas Martínez, Osvaldo. “Modelado semántico de documentos con estructura definida”. Tesis, Cd. Victoria, Tamaulipas, México, Centro de Investigación y de Estudios Avanzados del Instituto Politécnico Nacional, 2012.
F. Martínez-Méndez. Recuperación de información: Modelos, sistemas y evaluación, Murcia, España, Ed. El Kiosko, 2012.
Downloads
Published
How to Cite
Issue
Section
License
In accordance with national and international copyrights, as well as publishing policies of "Fundación Universitaria Luis Amigó" and its Journal "Lámpsakos" (indexed with ISSN : 2145-4086), I (we ) hereby manifest:1. The desire to participate as writers and submit to the rules established by the magazine publishers.
2. The commitment not to withdraw the manuscript until the journal finishes the editing process of the ongoing issue.
3. That article is original and unpublished and has not been nominated or submitted together in another magazine; therefore, the rights of the article in evaluation have not been assigned in advance and they do not weigh any lien or limitation for use.
4. The absence of conflict of interest with commercial institution or association of any kind
5. The incorporation of the quotes and references from other authors, tending to avoid plagiarism. Accordingly, the author affirms that the paper being published do not violate copyright, intellectual property or privacy rights of third parties. Morover, if necessary there is a way of demonstrating the respective permits original copyright to the aspects or elements taken from other documents such as texts of more than 500 words, tables, graphs, among others. In the event of any claim or action by a third party regarding copyright on the article, the author (s) will assume full responsibility and come out in defense of the rights herein assigned. Therefore, for all purposes, the Journal "Lámpsakos" of the "Fundación Universitaria Luis Amigó" acts as a third party in good faith.
6. In the event of the publication of the article, the authors free of charge and on an exclusive basis the integrity of the economic rights and the right to print, reprint and reproduction in any form and medium, without any limitation as to territory is concerned, in favor of the Journal "Lámpsakos" of the "Fundación Universitaria Luis Amigó".