Adapting a Constituency Parser to User-Generated Content in Polish Opinion Mining


  • Agnieszka Pluwak Institute of Slavic Studies, Polish Academy of Sciences, Warsaw Fido Intelligence, Gdansk
  • Wojciech Korczynski AGH University of Science and Technology, Faculty of Computer Science, Electronics and Telecommunications, Department of Computer Science, Krakow
  • Marek Kisiel-Dorohinicki AGH University of Science and Technology, Faculty of Computer Science, Electronics and Telecommunications, Department of Computer Science, Krakow



user generated content, text normalization, parsing, sentiment analysis


The paper focuses on the adjustment of NLP tools for Polish; e.g., morphological analyzers and parsers, to user-generated content (UGC). The authors discuss two rule-based techniques applied to improve their efficiency: pre-processing (text normalization) and parser adaptation (modified segmentation and parsing rules). A new solution to handle OOVs based on inflectional translation is also offered.


Download data is not yet available.


Aw A., Zhang M., Xiao J., Su J.: A phrase-based statistical model for SMS text normalization. In: Proceedings of the COLING/ACL on Main conference poster sessions, pp. 33–40, Association for Computational Linguistics, 2006.

Beaufort R., Roekhaut S., Cougnon L.A., Fairon C.: A hybrid rule/model-based finite-state framework for normalizing SMS messages. In: Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics, pp. 770–779, Association for Computational Linguistics, 2010.

Buczynski A., Wawer A.: Shallow parsing in sentiment analysis of product reviews. In: Proceedings of the Partial Parsing workshop at LREC, vol. 2008, pp. 14–18, 2008.

Chiticariu L., Li Y., Reiss F.R.: Rule-Based Information Extraction is Dead! Long Live Rule-Based Information Extraction Systems! In: EMNLP, pp. 827–832, 2013.

Choudhury M., Saraf R., Jain V., Mukherjee A., Sarkar S., Basu A.: Investigation and modeling of the structure of texting language. International Journal of Document Analysis and Recognition (IJDAR), vol. 10(3–4), pp. 157–174, 2007.

Cook P., Stevenson S.: An unsupervised model for text message normalization. In: Proceedings of the workshop on computational approaches to linguistic creativity, pp. 71–78, Association for Computational Linguistics, 2009.

Graliński F.: Formalizacja nieciągłości zdań przy zastosowaniu rozszerzonej gramatyki bezkontekstowej. Ph.D. thesis, Adam Mickiewicz University, Faculty of Mathematics and Computer Science, Poznań, 2007.

Graliński F., Jassem K., Junczys-Dowmunt M.: PSI-toolkit: A natural language processing pipeline. In: A. Przepiórkowski, M. Piasecki, K. Jassem, P. Fuglewicz, eds., Computational Linguistics, Studies in Computational Intelligence, vol. 458, pp. 27–39, Springer, 2013.

Grzenia J.: Komunikacja językowa w Internecie. Wydawnictwo Naukowe PWN, Warszawa, 2006.

Gunelius S.: The data explosion in 2014 minute by minute – Infographic. Newstex, vol. 12(07), 2014.

Haniewicz K., Kaczmarek M., Adamczyk M., Rutkowski W.: Polarity lexicon for the polish language: Design and extension with random walk algorithm. In: Advances in Systems Science, pp. 173–182, Springer, 2014.

Hu M., Liu B.: Mining opinion features in customer reviews. In: AAAI, vol. 4, pp. 755–760, 2004.

Hwa R.: Sample selection for statistical parsing. Computational Linguistics, vol. 30(3), pp. 253–276, 2004.

Kędzia P., Piasecki M., Orlińska M.: Word Sense Disambiguation Based on Large Scale Polish CLARIN Heterogeneous Lexical Resources. Cognitive Studies, (15), pp. 269–292, 2015,

Kobus C., Yvon F., Damnati G.: Normalizing SMS: are two metaphors better than one? In: Proceedings of the 22nd International Conference on Computational Linguistics, vol. 1, pp. 441–448, Association for Computational Linguistics, 2008.

Kopeć M.: Polski Korpus Koreferencyjny – wersja 0.85. 2013,

Krupa T.: Studium przypadku – system ISPAD. In: B. Wiszniewski, ed., Inteligentne wydobywanie informacji ze społecznościowych serwisów internetowych, Automatyka i Informatyka. Technologie Informacyjne. Internet i Sieci Semantyczne, pp. 121–139, PWNT, 2011.

Liu B.: Sentiment analysis and opinion mining. Synthesis Lectures on Human Language Technologies, vol. 5(1), pp. 1–167, 2012.

Luo W., Litman D.J., Chan J.: Reducing Annotation Effort on Unbalanced Corpus based on Cost Matrix. In: HLT-NAACL, pp. 8–15, 2013.

Manning C.: Part-of-speech tagging from 97% to 100%: is it time for some linguistics? In: Computational Linguistics and Intelligent Text Processing, pp. 171–189, Springer, 2011.

Manning C.: Evaluation of Constituency Parsers. Stanford lectures online. 2012,

Manning C., Schütze H.: Foundations of statistical natural language processing. MIT press, 1999.

Martínez P., Segura I., Declerck T., Martínez J.L.: TrendMiner: Large-scale Crosslingual Trend Mining Summarization of Real-time Media Streams. Procesamiento del Lenguaje Natural, vol. 53, pp. 163–166, 2014.

Nagarajan M., Gamon M.: Workshop on Language and Social Media – Introduction. In: Proceedings of LSM 2011, 2011.

Ong W.J.: Orality and literacy. Routledge, 2013.

Piasecki M.: Polish tagger TaKIPI: Rule based construction and optimisation. Task Quarterly, vol. 11(1–2), pp. 151–167, 2007.

Przepiórkowski A., Bańko M., Górski R.L., Lewandowska-Tomaszczyk B.: Narodowy Korpus Języka Polskiego. 2012,

Raghunathan K., Lee H., Rangarajan S., Chambers N., Surdeanu M., Jurafsky D., Manning C.: A multi-pass sieve for coreference resolution. In: Proceedings of the 2010 Conference on Empirical Methods in Natural Language Processing, pp. 492–501, Association for Computational Linguistics, 2010.

Ratuszniak B.: Monitoring social media. Co oferują firmy? [online], 2012,, accessed: 25.04.2012.

Ray A.: Customer Affinity Meets Brand Vectors: Sentiment that Matters. 2013, sentiment Analysis Symposium, New York.

Read J., Flickinger D., Dridan R., Oepen S., Øvrelid L.: The WeSearch Corpus, Treebank, and Treecache. A comprehensive sample of user-generated content. In: In Proceedings of the 8th International Conference on Language Resources and Evaluation, Citeseer, 2012.

Rodrıguez-Penagos C., Atserias J., Codina-Filba J., Garcıa-Narbona D., Grivolla J., Lambert P., Saurı R.: Combining lexicon-based ML and heuristics. In: Second Joint Conference on Lexical and Computational Semantics (*SEM), vol. 2, pp. 483–489, Association for Computational Linguistics, 2013.

Searle J.R.: Speech acts: An essay in the philosophy of language, vol. 626. Cambridge University Press, 1969.

Skórzewski P.: Gobio and PSI-Toolkit: Adapting a deep parser to an NLP toolkit. In: Z. Vetulani, H. Uszkoreit, eds., Proceedings of the 6th Language and Technology Conference, pp. 523–526, Fundacja UAM, Poznań, 2013.

Świdziński M.: Gramatyka formalna języka polskiego. Wydawnictwo Uniwersytetu Warszawskiego, 1992.

Van Hee C., Van de Kauter M., De Clercq O., Lefever E., Hoste V.: LT3: Sentiment Classification in User-Generated Content Using a Rich Feature Set. In: Proceedings of the 8th International Workshop on Semantic Evaluation (SemEval 2014), pp. 406–410, Association for Computational Linguistics and Dublin City University, Dublin, Ireland, 2014.

Zdunkiewicz-Jedynak D., Ciunovič M.: Ćwiczenia ze stylistyki. Wydawnictwo Naukowe PWN, 2010.

Zhenzhen X., Dawei Y., Brian D.D.: Normalizing microtext. In: Proceedings of the AAAI-11 Workshop on Analyzing Microtext. San Francisco, AAAI, pp. 74–79, 2011.




How to Cite

Pluwak, A., Korczynski, W., & Kisiel-Dorohinicki, M. (2016). Adapting a Constituency Parser to User-Generated Content in Polish Opinion Mining. Computer Science, 17(1), 23.




Most read articles by the same author(s)