Adapting a Constituency Parser to User-Generated Content in Polish Opinion Mining
DOI:
https://doi.org/10.7494/csci.2016.17.1.23Keywords:
user generated content, text normalization, parsing, sentiment analysisAbstract
The paper focuses on the adjustment of NLP tools for Polish; e.g., morphological analyzers and parsers, to user-generated content (UGC). The authors discuss two rule-based techniques applied to improve their efficiency: pre-processing (text normalization) and parser adaptation (modified segmentation and parsing rules). A new solution to handle OOVs based on inflectional translation is also offered.Downloads
References
Aw A., Zhang M., Xiao J., Su J.: A phrase-based statistical model for SMS text normalization. In: Proceedings of the COLING/ACL on Main conference poster sessions, pp. 33–40, Association for Computational Linguistics, 2006.
Beaufort R., Roekhaut S., Cougnon L.A., Fairon C.: A hybrid rule/model-based finite-state framework for normalizing SMS messages. In: Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics, pp. 770–779, Association for Computational Linguistics, 2010.
Buczynski A., Wawer A.: Shallow parsing in sentiment analysis of product reviews. In: Proceedings of the Partial Parsing workshop at LREC, vol. 2008, pp. 14–18, 2008.
Chiticariu L., Li Y., Reiss F.R.: Rule-Based Information Extraction is Dead! Long Live Rule-Based Information Extraction Systems! In: EMNLP, pp. 827–832, 2013.
Choudhury M., Saraf R., Jain V., Mukherjee A., Sarkar S., Basu A.: Investigation and modeling of the structure of texting language. International Journal of Document Analysis and Recognition (IJDAR), vol. 10(3–4), pp. 157–174, 2007.
Cook P., Stevenson S.: An unsupervised model for text message normalization. In: Proceedings of the workshop on computational approaches to linguistic creativity, pp. 71–78, Association for Computational Linguistics, 2009.
Graliński F.: Formalizacja nieciągłości zdań przy zastosowaniu rozszerzonej gramatyki bezkontekstowej. Ph.D. thesis, Adam Mickiewicz University, Faculty of Mathematics and Computer Science, Poznań, 2007.
Graliński F., Jassem K., Junczys-Dowmunt M.: PSI-toolkit: A natural language processing pipeline. In: A. Przepiórkowski, M. Piasecki, K. Jassem, P. Fuglewicz, eds., Computational Linguistics, Studies in Computational Intelligence, vol. 458, pp. 27–39, Springer, 2013.
Grzenia J.: Komunikacja językowa w Internecie. Wydawnictwo Naukowe PWN, Warszawa, 2006.
Gunelius S.: The data explosion in 2014 minute by minute – Infographic. Newstex, vol. 12(07), 2014.
Haniewicz K., Kaczmarek M., Adamczyk M., Rutkowski W.: Polarity lexicon for the polish language: Design and extension with random walk algorithm. In: Advances in Systems Science, pp. 173–182, Springer, 2014.
Hu M., Liu B.: Mining opinion features in customer reviews. In: AAAI, vol. 4, pp. 755–760, 2004.
Hwa R.: Sample selection for statistical parsing. Computational Linguistics, vol. 30(3), pp. 253–276, 2004.
Kędzia P., Piasecki M., Orlińska M.: Word Sense Disambiguation Based on Large Scale Polish CLARIN Heterogeneous Lexical Resources. Cognitive Studies, (15), pp. 269–292, 2015, http://dx.doi.org/10.11649/cs.2015.019.
Kobus C., Yvon F., Damnati G.: Normalizing SMS: are two metaphors better than one? In: Proceedings of the 22nd International Conference on Computational Linguistics, vol. 1, pp. 441–448, Association for Computational Linguistics, 2008.
Kopeć M.: Polski Korpus Koreferencyjny – wersja 0.85. 2013, http://zil.ipipan.waw.pl/PolishCoreferenceCorpus.
Krupa T.: Studium przypadku – system ISPAD. In: B. Wiszniewski, ed., Inteligentne wydobywanie informacji ze społecznościowych serwisów internetowych, Automatyka i Informatyka. Technologie Informacyjne. Internet i Sieci Semantyczne, pp. 121–139, PWNT, 2011.
Liu B.: Sentiment analysis and opinion mining. Synthesis Lectures on Human Language Technologies, vol. 5(1), pp. 1–167, 2012.
Luo W., Litman D.J., Chan J.: Reducing Annotation Effort on Unbalanced Corpus based on Cost Matrix. In: HLT-NAACL, pp. 8–15, 2013.
Manning C.: Part-of-speech tagging from 97% to 100%: is it time for some linguistics? In: Computational Linguistics and Intelligent Text Processing, pp. 171–189, Springer, 2011.
Manning C.: Evaluation of Constituency Parsers. Stanford lectures online. 2012, http://www.youtube.com/watch?v=mMXgbrts82M.
Manning C., Schütze H.: Foundations of statistical natural language processing. MIT press, 1999.
Martínez P., Segura I., Declerck T., Martínez J.L.: TrendMiner: Large-scale Crosslingual Trend Mining Summarization of Real-time Media Streams. Procesamiento del Lenguaje Natural, vol. 53, pp. 163–166, 2014.
Nagarajan M., Gamon M.: Workshop on Language and Social Media – Introduction. In: Proceedings of LSM 2011, 2011.
Ong W.J.: Orality and literacy. Routledge, 2013.
Piasecki M.: Polish tagger TaKIPI: Rule based construction and optimisation. Task Quarterly, vol. 11(1–2), pp. 151–167, 2007.
Przepiórkowski A., Bańko M., Górski R.L., Lewandowska-Tomaszczyk B.: Narodowy Korpus Języka Polskiego. 2012, www.nkjp.pl.
Raghunathan K., Lee H., Rangarajan S., Chambers N., Surdeanu M., Jurafsky D., Manning C.: A multi-pass sieve for coreference resolution. In: Proceedings of the 2010 Conference on Empirical Methods in Natural Language Processing, pp. 492–501, Association for Computational Linguistics, 2010.
Ratuszniak B.: Monitoring social media. Co oferują firmy? [online], 2012, http://goo.gl/8p7mGp, accessed: 25.04.2012.
Ray A.: Customer Affinity Meets Brand Vectors: Sentiment that Matters. 2013, sentiment Analysis Symposium, New York.
Read J., Flickinger D., Dridan R., Oepen S., Øvrelid L.: The WeSearch Corpus, Treebank, and Treecache. A comprehensive sample of user-generated content. In: In Proceedings of the 8th International Conference on Language Resources and Evaluation, Citeseer, 2012.
Rodrıguez-Penagos C., Atserias J., Codina-Filba J., Garcıa-Narbona D., Grivolla J., Lambert P., Saurı R.: Combining lexicon-based ML and heuristics. In: Second Joint Conference on Lexical and Computational Semantics (*SEM), vol. 2, pp. 483–489, Association for Computational Linguistics, 2013.
Searle J.R.: Speech acts: An essay in the philosophy of language, vol. 626. Cambridge University Press, 1969.
Skórzewski P.: Gobio and PSI-Toolkit: Adapting a deep parser to an NLP toolkit. In: Z. Vetulani, H. Uszkoreit, eds., Proceedings of the 6th Language and Technology Conference, pp. 523–526, Fundacja UAM, Poznań, 2013.
Świdziński M.: Gramatyka formalna języka polskiego. Wydawnictwo Uniwersytetu Warszawskiego, 1992.
Van Hee C., Van de Kauter M., De Clercq O., Lefever E., Hoste V.: LT3: Sentiment Classification in User-Generated Content Using a Rich Feature Set. In: Proceedings of the 8th International Workshop on Semantic Evaluation (SemEval 2014), pp. 406–410, Association for Computational Linguistics and Dublin City University, Dublin, Ireland, 2014.
Zdunkiewicz-Jedynak D., Ciunovič M.: Ćwiczenia ze stylistyki. Wydawnictwo Naukowe PWN, 2010.
Zhenzhen X., Dawei Y., Brian D.D.: Normalizing microtext. In: Proceedings of the AAAI-11 Workshop on Analyzing Microtext. San Francisco, AAAI, pp. 74–79, 2011.