DISTRIBUTED WEB-SCALE INFRASTRUCTURE FOR CRAWLING, INDEXING AND SEARCH WITH SEMANTIC SUPPORT

Authors

  • Stefan Dlugolinsky Institute of Informatics, Slovak Academy of Sciences, Bratislava
  • Martin Seleng Institute of Informatics, Slovak Academy of Sciences, Bratislava
  • Michal Laclavik Institute of Informatics, Slovak Academy of Sciences, Bratislava
  • Ladislav Hluchy Institute of Informatics, Slovak Academy of Sciences, Bratislava

DOI:

https://doi.org/10.7494/csci.2012.13.4.5

Keywords:

istributed web crawling, information extraction, information retrieval, semantic search, geocoding, spatial search

Abstract

In this paper, we describe our work in progress in the scope of web-scale information
extraction and information retrieval utilizing distributed computing. We
present a distributed architecture built on top of the MapReduce paradigm for
information retrieval, information processing and intelligent search supported
by spatial capabilities. Proposed architecture is focused on crawling documents
in several different formats, information extraction, lightweight semantic annotation
of the extracted information, indexing of extracted information and
finally on indexing of documents based on the geo-spatial information found
in a document. We demonstrate the architecture on two use cases, where the
first is search in job offers retrieved from the LinkedIn portal and the second is
search in BBC news feeds and discuss several problems we had to face during
the implementation. We also discuss spatial search applications for both cases
because both LinkedIn job offer pages and BBC news feeds contain a lot of spatial
information to extract and process.

Downloads

Download data is not yet available.

Downloads

Published

2012-12-13

How to Cite

Dlugolinsky, S., Seleng, M., Laclavik, M., & Hluchy, L. (2012). DISTRIBUTED WEB-SCALE INFRASTRUCTURE FOR CRAWLING, INDEXING AND SEARCH WITH SEMANTIC SUPPORT. Computer Science, 13(4), 5. https://doi.org/10.7494/csci.2012.13.4.5

Issue

Section

Articles

Most read articles by the same author(s)