CLUO: Web-Scale Text Mining System for Open Source Intelligence Purposes

Przemyslaw Maciolek; Grzegorz Dobrowolski

doi:10.7494/csci.2013.14.1.45

Authors

Przemyslaw Maciolek
Grzegorz Dobrowolski

DOI:

https://doi.org/10.7494/csci.2013.14.1.45

Keywords:

Text Mining, Big Data, OSINT, Natural Language Processing, monitoring

Abstract

The amount of textual information published on the Internet is considered tobe in billions of web pages, blog posts, comments, social media updates andothers. Analyzing such quantities of data requires high level of distribution –both data and computing. This is especially true in case of complex algorithms,often used in text mining tasks.The paper presents a prototype implementation of CLUO – an Open SourceIntelligence (OSINT) system, which extracts and analyzes signiﬁcant quantitiesof openly available information.

Downloads

Download data is not yet available.

References

NATO Open Source Intelligence Handbook. NATO, 2001.

NATO Intelligence Exploitation of the Internet. NATO, 2002.

National Defense Authorization Act for Fiscal Year 2006. 2006.

Berger A. L., Pietra V. J. D., Pietra S. A. D.: A maximum entropy approach to natural language processing. Comput. Linguist., 22(1):39–71, March 1996.

Cover T., Thomas J.: Elements of Information Theory. Wiley, 1991.

Damianos L. E., Ponte J. M., Wohlever S., Reeder F., Wilson D. G., Hirschman L.: Mitap, text and audio processing for bio-security: A case study. In National Conference on Artiﬁcial Intelligence, pp. 807–814, 2002.

Dean J., Ghemawat S.: Mapreduce: simpliﬁed data processing on large clusters. Commun. ACM, 51(1):107–113, January 2008.

Dean J., Ghemawat S.: Mapreduce: simpliﬁed data processing on large clusters. Commun. ACM, 51:107–113, January 2008.

Fellbaum C.: WordNet – An Electronic Lexical Database. The MIT Press, 1998.

Fielding R. T.: Architectural styles and the design of network-based software ar-

chitectures. PhD thesis, 2000.

Jurafsky D., Martin J. H.: Speech and Language Processing Prentice Hall, 2 ed.,

Leskovec J., Backstrom L., Kleinberg J.: Meme-tracking and the dynamics of the news cycle. In Proc. of the 15th ACM SIGKDD international conference on Knowledge Discovery and Data Mining, KDD ’09, pp. 497–506, New York, NY, USA, 2009. ACM.

Lubaszewski W., Gajęcki M.: Automatic extraction of semantic association from polish text. Computer Science, 4:119–130, 2002.

Maciolek P., Dobrowolski G.: Is shallow semantic analysis really that shallow? a study on improving text classiﬁcation performance. In IMCSIT, pp. 455–460, 2010.

Manning C., Raghavan P., Schutze H.: Introduction to Information Retrieval. Cambridge University Press, 1 ed., 2008.

Maziarz M., Piasecki M., Szpakowicz S.: Approaching plWordNet 2.0. In Proc. of the 6th Global Wordnet Conference, Matsue, Japan, January 2012.

Piasecki M., Szpakowicz S., Broda B.: A Wordnet from the Ground Up. Oﬁcyna Wydawnicza Politechniki Wroclawskiej, Wroclaw, 2009.

Porter M. F.: An algorithm for suﬃx stripping. Program, 1980.

Przepiórkowski A., Bańko M., Górski R. L., Lewandowska-Tomaszczyk B., eds. Narodowy Korpus Języka Polskiego [Eng.: National Corpus of Polish]. Wydawnictwo Naukowe PWN, Warsaw, 2012.

Toutanova K., Klein D., Manning C., Singer Y.: Feature-rich part-of-speech tagging with a cyclic dependency network. In Proc. of HLT-NAACL 2003, 2003.

Toutanova K., Manning C. D.: Enriching the knowledge sources used in a maximum entropy part-of-speech tagger. In Proc. of the Joint SIGDAT Conference on Empirical Methods in Natural Language Processing and Very Large Corpora, 2000.