Adapting a Constituency Parser to User-Generated Content in Polish Opinion Mining

Agnieszka Pluwak, Wojciech Korczynski, Marek Kisiel-Dorohinicki

Abstract


The paper focuses on the adjustment of NLP tools for Polish; e.g., morphological analyzers and parsers, to user-generated content (UGC). The authors discuss two rule-based techniques applied to improve their efficiency: pre-processing (text normalization) and parser adaptation (modified segmentation and parsing rules). A new solution to handle OOVs based on inflectional translation is also offered.

Keywords


user generated content; text normalization; parsing; sentiment analysis

Full Text:

PDF

References


Aw A., Zhang M., Xiao J., Su J.: A phrase-based statistical model for SMS text normalization. In: Proceedings of the COLING/ACL on Main conference poster sessions, pp. 33–40, Association for Computational Linguistics, 2006.

Beaufort R., Roekhaut S., Cougnon L.A., Fairon C.: A hybrid rule/model-based finite-state framework for normalizing SMS messages. In: Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics, pp. 770–779, Association for Computational Linguistics, 2010.

Buczynski A., Wawer A.: Shallow parsing in sentiment analysis of product reviews. In: Proceedings of the Partial Parsing workshop at LREC, vol. 2008, pp. 14–18, 2008.

Chiticariu L., Li Y., Reiss F.R.: Rule-Based Information Extraction is Dead! Long Live Rule-Based Information Extraction Systems! In: EMNLP, pp. 827–832, 2013.

Choudhury M., Saraf R., Jain V., Mukherjee A., Sarkar S., Basu A.: Investigation and modeling of the structure of texting language. International Journal of Document Analysis and Recognition (IJDAR), vol. 10(3–4), pp. 157–174, 2007.

Cook P., Stevenson S.: An unsupervised model for text message normalization. In: Proceedings of the workshop on computational approaches to linguistic creativity, pp. 71–78, Association for Computational Linguistics, 2009.

Graliński F.: Formalizacja nieciągłości zdań przy zastosowaniu rozszerzonej gramatyki bezkontekstowej. Ph.D. thesis, Adam Mickiewicz University, Faculty of Mathematics and Computer Science, Poznań, 2007.

Graliński F., Jassem K., Junczys-Dowmunt M.: PSI-toolkit: A natural language processing pipeline. In: A. Przepiórkowski, M. Piasecki, K. Jassem, P. Fuglewicz, eds., Computational Linguistics, Studies in Computational Intelligence, vol. 458, pp. 27–39, Springer, 2013.

Grzenia J.: Komunikacja językowa w Internecie. Wydawnictwo Naukowe PWN, Warszawa, 2006.

Gunelius S.: The data explosion in 2014 minute by minute – Infographic. Newstex, vol. 12(07), 2014.

Haniewicz K., Kaczmarek M., Adamczyk M., Rutkowski W.: Polarity lexicon for the polish language: Design and extension with random walk algorithm. In: Advances in Systems Science, pp. 173–182, Springer, 2014.

Hu M., Liu B.: Mining opinion features in customer reviews. In: AAAI, vol. 4, pp. 755–760, 2004.

Hwa R.: Sample selection for statistical parsing. Computational Linguistics, vol. 30(3), pp. 253–276, 2004.

Kędzia P., Piasecki M., Orlińska M.: Word Sense Disambiguation Based on Large Scale Polish CLARIN Heterogeneous Lexical Resources. Cognitive Studies, (15), pp. 269–292, 2015, http://dx.doi.org/10.11649/cs.2015.019.

Kobus C., Yvon F., Damnati G.: Normalizing SMS: are two metaphors better than one? In: Proceedings of the 22nd International Conference on Computational Linguistics, vol. 1, pp. 441–448, Association for Computational Linguistics, 2008.

Kopeć M.: Polski Korpus Koreferencyjny – wersja 0.85. 2013, http://zil.ipipan.waw.pl/PolishCoreferenceCorpus.

Krupa T.: Studium przypadku – system ISPAD. In: B. Wiszniewski, ed., Inteligentne wydobywanie informacji ze społecznościowych serwisów internetowych, Automatyka i Informatyka. Technologie Informacyjne. Internet i Sieci Semantyczne, pp. 121–139, PWNT, 2011.

Liu B.: Sentiment analysis and opinion mining. Synthesis Lectures on Human Language Technologies, vol. 5(1), pp. 1–167, 2012.

Luo W., Litman D.J., Chan J.: Reducing Annotation Effort on Unbalanced Corpus based on Cost Matrix. In: HLT-NAACL, pp. 8–15, 2013.

Manning C.: Part-of-speech tagging from 97% to 100%: is it time for some linguistics? In: Computational Linguistics and Intelligent Text Processing, pp. 171–189, Springer, 2011.

Manning C.: Evaluation of Constituency Parsers. Stanford lectures online. 2012, http://www.youtube.com/watch?v=mMXgbrts82M.

Manning C., Schütze H.: Foundations of statistical natural language processing. MIT press, 1999.

Martínez P., Segura I., Declerck T., Martínez J.L.: TrendMiner: Large-scale Crosslingual Trend Mining Summarization of Real-time Media Streams. Procesamiento del Lenguaje Natural, vol. 53, pp. 163–166, 2014.

Nagarajan M., Gamon M.: Workshop on Language and Social Media – Introduction. In: Proceedings of LSM 2011, 2011.

Ong W.J.: Orality and literacy. Routledge, 2013.

Piasecki M.: Polish tagger TaKIPI: Rule based construction and optimisation. Task Quarterly, vol. 11(1–2), pp. 151–167, 2007.

Przepiórkowski A., Bańko M., Górski R.L., Lewandowska-Tomaszczyk B.: Narodowy Korpus Języka Polskiego. 2012, www.nkjp.pl.

Raghunathan K., Lee H., Rangarajan S., Chambers N., Surdeanu M., Jurafsky D., Manning C.: A multi-pass sieve for coreference resolution. In: Proceedings of the 2010 Conference on Empirical Methods in Natural Language Processing, pp. 492–501, Association for Computational Linguistics, 2010.

Ratuszniak B.: Monitoring social media. Co oferują firmy? [online], 2012, http://goo.gl/8p7mGp, accessed: 25.04.2012.

Ray A.: Customer Affinity Meets Brand Vectors: Sentiment that Matters. 2013, sentiment Analysis Symposium, New York.

Read J., Flickinger D., Dridan R., Oepen S., Øvrelid L.: The WeSearch Corpus, Treebank, and Treecache. A comprehensive sample of user-generated content. In: In Proceedings of the 8th International Conference on Language Resources and Evaluation, Citeseer, 2012.

Rodrıguez-Penagos C., Atserias J., Codina-Filba J., Garcıa-Narbona D., Grivolla J., Lambert P., Saurı R.: Combining lexicon-based ML and heuristics. In: Second Joint Conference on Lexical and Computational Semantics (*SEM), vol. 2, pp. 483–489, Association for Computational Linguistics, 2013.

Searle J.R.: Speech acts: An essay in the philosophy of language, vol. 626. Cambridge University Press, 1969.

Skórzewski P.: Gobio and PSI-Toolkit: Adapting a deep parser to an NLP toolkit. In: Z. Vetulani, H. Uszkoreit, eds., Proceedings of the 6th Language and Technology Conference, pp. 523–526, Fundacja UAM, Poznań, 2013.

Świdziński M.: Gramatyka formalna języka polskiego. Wydawnictwo Uniwersytetu Warszawskiego, 1992.

Van Hee C., Van de Kauter M., De Clercq O., Lefever E., Hoste V.: LT3: Sentiment Classification in User-Generated Content Using a Rich Feature Set. In: Proceedings of the 8th International Workshop on Semantic Evaluation (SemEval 2014), pp. 406–410, Association for Computational Linguistics and Dublin City University, Dublin, Ireland, 2014.

Zdunkiewicz-Jedynak D., Ciunovič M.: Ćwiczenia ze stylistyki. Wydawnictwo Naukowe PWN, 2010.

Zhenzhen X., Dawei Y., Brian D.D.: Normalizing microtext. In: Proceedings of the AAAI-11 Workshop on Analyzing Microtext. San Francisco, AAAI, pp. 74–79, 2011.




DOI: http://dx.doi.org/10.7494/csci.2016.17.1.23

Refbacks

  • There are currently no refbacks.