Building semantic user profile for Polish web news portal
DOI:
https://doi.org/10.7494/csci.2018.19.3.2753Keywords:
user profiling, words embeddings, topic modeling, natural language processing, gender predictionAbstract
We present our research at Onet, the largest Polish news portal, aimed at constructing meaningful user profiles that are most descriptive of their interests in the context of the media content they browse.
We used two distinct state-of-the-art numerical text-representation techniques: LDA topic modeling and Word2Vec word embeddings. We trained our models on the corpora of articles in Polish and compare them with a baseline model built on a general language corpora.
We compared the performance of algorithms on two distinct tasks - similar articles retrieval and users gender classification. Our results show that the choice of text representation depends on the task - Word2Vec is more suitable for text comparison, especially for short texts such as titles. In the user profiling task, the best performance was obtained with a combination of features: topics from the article text and word embeddings from the title.
Downloads
References
Rafał L. Górski Adam Przepiórkowski, Mirosław Banko and Barbara
Lewandowska-Tomaszczyk. Narodowy Korpus Jezyka Polskiego. Wydawnictwo
Naukowe PWN, Warszawa, 2012.
Jae-wook Ahn, Peter Brusilovsky, Jonathan Grady, Daqing He, and Sue Yeon
Syn. Open user profiles for adaptive news systems: Help or harm? In Proceedings
of the 16th International Conference on World Wide Web, WWW ’07, pages 11–
, New York, NY, USA, 2007. ACM.
Anton Alekseyev and Sergey I. Nikolenko. Predicting the Age of Social Network
Users from User-Generated Texts withWord Embeddings. In Proc. 5th conference
on Artificial Intelligence and Natural Language, pages 3–13, 2016.
January 11, 2018 str. 18/21
Anton Alekseyev and Sergey I. Nikolenko. Word Embeddings of User Profiling
in Online Social Networks. Computación y Sistemas, 21(2):203–226, 2017.
Xiao Bai, B. Barla Cambazoglu, Francesco Gullo, Amin Mantrach, and Fabrizio
Silvestri. Exploiting Search History of Users for News Personalization. Inf. Sci.,
(C):125–137, April 2017.
David M. Blei, Andrew Y. Ng, and Michael I. Jordan. Latent Dirichlet Allocation.
J. Mach. Learn. Res., 3:993–1022, March 2003.
Koen De Bock and Dirk Van den Poel. Predicting website audience demographics
for web advertising targeting using multi-website clickstream data. FUNDAMENTA
INFORMATICAE, 98(1):49–70, 2010.
Duong Duc, Pham Son, Tan Hanh, and Le Thien. A Resamping Approach for
Customer Gender Prediction Based on E-Commerce Data. Journal of Science and
Technology: Issue on Information and Communications Technology, 3(1):76–81,
Susan Gauch, Mirco Speretta, Aravind Chandramouli, and Alessandro Micarelli.
User Profiles for Personalized Information Access, pages 54–89. Springer Berlin
January 11, 2018 str. 19/21
Heidelberg, Berlin, Heidelberg, 2007.
Sharad Goel, Jake M Hofman, and M Irmak Sirer. Who Does What on the Web:
A Large-Scale Study of Browsing Behavior. In ICWSM, 2012.
Matthew Hoffman, Francis R. Bach, and David M. Blei. Online Learning for
Latent Dirichlet Allocation. In J. D. Lafferty, C. K. I. Williams, J. Shawe-Taylor,
R. S. Zemel, and A. Culotta, editors, Advances in Neural Information Processing
Systems 23, pages 856–864. Curran Associates, Inc., 2010.
Polskie Badania Internetu. Polski internet w listopadzie 2017, 2017.
Eleonora Ivanova. Predicting website audience demographics based on browsing
history. G2 pro gradu, diplomityö, Aalto University School of Business, 2013.
Joanna Jedrzejowicz and Magdalena Zakrzewska. Word Embeddings Versus LDA
for Topic Assignment in Documents, pages 357–366. Springer International Publishing,
Cham, 2017.
Karen Sparck Jones. A statistical interpretation of term specificity and its application
in retrieval. Journal of Documentation, 28(1):11–21, 1972.
S. Kabbur, E. H. Han, and G. Karypis. Content-based methods for predicting
web-site demographic attributes. In 2010 IEEE International Conference on Data
Mining, pages 863–868, Dec 2010.
I. Kim. Predicting Audience Demographics of Web Sites Using Local Cues. David
Eccles School of Business, University of Utah, 2011.
Paweł Kedzia, Gabriela Czachor, Maciej Piasecki, and Jan Kocon. Vector representations
of polish words (Word2Vec method), 2016. CLARIN-PL digital repository.
Michal Kompan and Mária Bieliková. Content-Based News Recommendation,
pages 61–72. Springer Berlin Heidelberg, Berlin, Heidelberg, 2010.
Jiahui Liu, Peter Dolan, and Elin Rønby Pedersen. Personalized News Recommendation
Based on Click Behavior. In Proceedings of the 15th International
Conference on Intelligent User Interfaces, IUI ’10, pages 31–40, New York, NY,
USA, 2010. ACM.
Zhongqi Lu, Zhicheng Dou, Jianxun Lian, Xing Xie, and Qiang Yang. Content-
Based Collaborative Filtering for News Topic Recommendation. In Aaai, pages
–223, 2015.
Tapio Luostarinen and Oskar Kohonen. Using topic models in content-based
news recommender systems. In Proceedings of the 19th Nordic Conference of
Computational Linguistics (NODALIDA 2013); May 22-24; 2013; Oslo University;
Norway. NEALT Proceedings, number 085 in 16, pages 239–251. Linköping
University Electronic Press, 2013.
Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. Efficient Estimation
of Word Representations in Vector Space. CoRR, abs/1301.3781, 2013.
Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, and Jeff Dean. Distributed
representations of words and phrases and their compositionality. In
C. J. C. Burges, L. Bottou, M. Welling, Z. Ghahramani, and K. Q. Weinberger,
January 11, 2018 str. 20/21
editors, Advances in Neural Information Processing Systems 26, pages 3111–3119.
Curran Associates, Inc., 2013.
Agnieszka Mykowiecka, Małgorzata Marciniak, and Piotr Rychlik. Testing word
embeddings for Polish. Cognitive Studies | Études cognitives, 17, 2017.
Özlem Özgöbek, Jon Atle Gulla, and Riza Cenk Erdur. A Survey on Challenges
and Methods in News Recommendation. In WEBIST (2), pages 278–285, 2014.
Do Viet Phuong and Tu Minh Phuong. Gender Prediction Using Browsing History,
pages 271–283. Springer International Publishing, Cham, 2014.
Francisco Rangel, Paolo Rosso, Ben Verhoeven, Walter Daelemans, Martin Potthast,
and Benno Stein. Overview of the 4th author profiling task at pan 2016:
Cross-genre evaluations. In Working Notes Papers of the CLEF 2016 Evaluation
Labs. CEUR Workshop Proceedings, Évora, Portugal, 2016/09 2016. CLEF and
CEUR-WS.org, CLEF and CEUR-WS.org.
Chong Wang and David M. Blei. Collaborative Topic Modeling for Recommending
Scientific Articles. In Proceedings of the 17th ACM SIGKDD International
Conference on Knowledge Discovery and Data Mining, KDD ’11, pages 448–456,
New York, NY, USA, 2011. ACM.
Geoffrey I. Webb, Michael J. Pazzani, and Daniel Billsus. Machine Learning
for User Modeling. User Modeling and User-Adapted Interaction, 11(1-2):19–29,
March 2001.