CLUO: Web-Scale Text Mining System for Open Source Intelligence Purposes

Przemyslaw Maciolek, Grzegorz Dobrowolski


The amount of textual information published on the Internet is considered tobe in billions of web pages, blog posts, comments, social media updates andothers. Analyzing such quantities of data requires high level of distribution –both data and computing. This is especially true in case of complex algorithms,often used in text mining tasks.The paper presents a prototype implementation of CLUO – an Open SourceIntelligence (OSINT) system, which extracts and analyzes significant quantitiesof openly available information.


Text Mining, Big Data, OSINT, Natural Language Processing, monitoring

Full Text:



NATO Open Source Intelligence Handbook. NATO, 2001.

NATO Intelligence Exploitation of the Internet. NATO, 2002.

National Defense Authorization Act for Fiscal Year 2006. 2006.

Berger A. L., Pietra V. J. D., Pietra S. A. D.: A maximum entropy approach to natural language processing. Comput. Linguist., 22(1):39–71, March 1996.

Cover T., Thomas J.: Elements of Information Theory. Wiley, 1991.

Damianos L. E., Ponte J. M., Wohlever S., Reeder F., Wilson D. G., Hirschman L.: Mitap, text and audio processing for bio-security: A case study. In National Conference on Artificial Intelligence, pp. 807–814, 2002.

Dean J., Ghemawat S.: Mapreduce: simplified data processing on large clusters. Commun. ACM, 51(1):107–113, January 2008.

Dean J., Ghemawat S.: Mapreduce: simplified data processing on large clusters. Commun. ACM, 51:107–113, January 2008.

Fellbaum C.: WordNet – An Electronic Lexical Database. The MIT Press, 1998.

Fielding R. T.: Architectural styles and the design of network-based software ar-

chitectures. PhD thesis, 2000.

Jurafsky D., Martin J. H.: Speech and Language Processing Prentice Hall, 2 ed.,

Leskovec J., Backstrom L., Kleinberg J.: Meme-tracking and the dynamics of the news cycle. In Proc. of the 15th ACM SIGKDD international conference on Knowledge Discovery and Data Mining, KDD ’09, pp. 497–506, New York, NY, USA, 2009. ACM.

Lubaszewski W., Gajęcki M.: Automatic extraction of semantic association from polish text. Computer Science, 4:119–130, 2002.

Maciolek P., Dobrowolski G.: Is shallow semantic analysis really that shallow? a study on improving text classification performance. In IMCSIT, pp. 455–460, 2010.

Manning C., Raghavan P., Schutze H.: Introduction to Information Retrieval. Cambridge University Press, 1 ed., 2008.

Maziarz M., Piasecki M., Szpakowicz S.: Approaching plWordNet 2.0. In Proc. of the 6th Global Wordnet Conference, Matsue, Japan, January 2012.

Piasecki M., Szpakowicz S., Broda B.: A Wordnet from the Ground Up. Oficyna Wydawnicza Politechniki Wroclawskiej, Wroclaw, 2009.

Porter M. F.: An algorithm for suffix stripping. Program, 1980.

Przepiórkowski A., Bańko M., Górski R. L., Lewandowska-Tomaszczyk B., eds. Narodowy Korpus Języka Polskiego [Eng.: National Corpus of Polish]. Wydawnictwo Naukowe PWN, Warsaw, 2012.

Toutanova K., Klein D., Manning C., Singer Y.: Feature-rich part-of-speech tagging with a cyclic dependency network. In Proc. of HLT-NAACL 2003, 2003.

Toutanova K., Manning C. D.: Enriching the knowledge sources used in a maximum entropy part-of-speech tagger. In Proc. of the Joint SIGDAT Conference on Empirical Methods in Natural Language Processing and Very Large Corpora, 2000.



  • There are currently no refbacks.