Building semantic user profile for Polish web news portal

Joanna Misztal-Radecka


We present our research at Onet, the largest Polish news portal, aimed at constructing meaningful user profiles that are most descriptive of their interests in the context of the media content they browse.
We used two distinct state-of-the-art numerical text-representation techniques: LDA topic modeling and Word2Vec word embeddings. We trained our models on the corpora of articles in Polish and compare them with a baseline model built on a general language corpora.

We compared the performance of algorithms on two distinct tasks - similar articles retrieval and users gender classification. Our results show that the choice of text representation depends on the task - Word2Vec is more suitable for text comparison, especially for short texts such as titles. In the user profiling task, the best performance was obtained with a combination of features: topics from the article text and word embeddings from the title.


user profiling, words embeddings, topic modeling, natural language processing, gender prediction

Full Text:



Rafał L. Górski Adam Przepiórkowski, Mirosław Banko and Barbara

Lewandowska-Tomaszczyk. Narodowy Korpus Jezyka Polskiego. Wydawnictwo

Naukowe PWN, Warszawa, 2012.

Jae-wook Ahn, Peter Brusilovsky, Jonathan Grady, Daqing He, and Sue Yeon

Syn. Open user profiles for adaptive news systems: Help or harm? In Proceedings

of the 16th International Conference on World Wide Web, WWW ’07, pages 11–

, New York, NY, USA, 2007. ACM.

Anton Alekseyev and Sergey I. Nikolenko. Predicting the Age of Social Network

Users from User-Generated Texts withWord Embeddings. In Proc. 5th conference

on Artificial Intelligence and Natural Language, pages 3–13, 2016.

January 11, 2018 str. 18/21

Anton Alekseyev and Sergey I. Nikolenko. Word Embeddings of User Profiling

in Online Social Networks. Computación y Sistemas, 21(2):203–226, 2017.

Xiao Bai, B. Barla Cambazoglu, Francesco Gullo, Amin Mantrach, and Fabrizio

Silvestri. Exploiting Search History of Users for News Personalization. Inf. Sci.,

(C):125–137, April 2017.

David M. Blei, Andrew Y. Ng, and Michael I. Jordan. Latent Dirichlet Allocation.

J. Mach. Learn. Res., 3:993–1022, March 2003.

Koen De Bock and Dirk Van den Poel. Predicting website audience demographics

for web advertising targeting using multi-website clickstream data. FUNDAMENTA

INFORMATICAE, 98(1):49–70, 2010.

Duong Duc, Pham Son, Tan Hanh, and Le Thien. A Resamping Approach for

Customer Gender Prediction Based on E-Commerce Data. Journal of Science and

Technology: Issue on Information and Communications Technology, 3(1):76–81,

Susan Gauch, Mirco Speretta, Aravind Chandramouli, and Alessandro Micarelli.

User Profiles for Personalized Information Access, pages 54–89. Springer Berlin

January 11, 2018 str. 19/21

Heidelberg, Berlin, Heidelberg, 2007.

Sharad Goel, Jake M Hofman, and M Irmak Sirer. Who Does What on the Web:

A Large-Scale Study of Browsing Behavior. In ICWSM, 2012.

Matthew Hoffman, Francis R. Bach, and David M. Blei. Online Learning for

Latent Dirichlet Allocation. In J. D. Lafferty, C. K. I. Williams, J. Shawe-Taylor,

R. S. Zemel, and A. Culotta, editors, Advances in Neural Information Processing

Systems 23, pages 856–864. Curran Associates, Inc., 2010.

Polskie Badania Internetu. Polski internet w listopadzie 2017, 2017.

Eleonora Ivanova. Predicting website audience demographics based on browsing

history. G2 pro gradu, diplomityö, Aalto University School of Business, 2013.

Joanna Jedrzejowicz and Magdalena Zakrzewska. Word Embeddings Versus LDA

for Topic Assignment in Documents, pages 357–366. Springer International Publishing,

Cham, 2017.

Karen Sparck Jones. A statistical interpretation of term specificity and its application

in retrieval. Journal of Documentation, 28(1):11–21, 1972.

S. Kabbur, E. H. Han, and G. Karypis. Content-based methods for predicting

web-site demographic attributes. In 2010 IEEE International Conference on Data

Mining, pages 863–868, Dec 2010.

I. Kim. Predicting Audience Demographics of Web Sites Using Local Cues. David

Eccles School of Business, University of Utah, 2011.

Paweł Kedzia, Gabriela Czachor, Maciej Piasecki, and Jan Kocon. Vector representations

of polish words (Word2Vec method), 2016. CLARIN-PL digital repository.

Michal Kompan and Mária Bieliková. Content-Based News Recommendation,

pages 61–72. Springer Berlin Heidelberg, Berlin, Heidelberg, 2010.

Jiahui Liu, Peter Dolan, and Elin Rønby Pedersen. Personalized News Recommendation

Based on Click Behavior. In Proceedings of the 15th International

Conference on Intelligent User Interfaces, IUI ’10, pages 31–40, New York, NY,

USA, 2010. ACM.

Zhongqi Lu, Zhicheng Dou, Jianxun Lian, Xing Xie, and Qiang Yang. Content-

Based Collaborative Filtering for News Topic Recommendation. In Aaai, pages

–223, 2015.

Tapio Luostarinen and Oskar Kohonen. Using topic models in content-based

news recommender systems. In Proceedings of the 19th Nordic Conference of

Computational Linguistics (NODALIDA 2013); May 22-24; 2013; Oslo University;

Norway. NEALT Proceedings, number 085 in 16, pages 239–251. Linköping

University Electronic Press, 2013.

Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. Efficient Estimation

of Word Representations in Vector Space. CoRR, abs/1301.3781, 2013.

Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, and Jeff Dean. Distributed

representations of words and phrases and their compositionality. In

C. J. C. Burges, L. Bottou, M. Welling, Z. Ghahramani, and K. Q. Weinberger,

January 11, 2018 str. 20/21

editors, Advances in Neural Information Processing Systems 26, pages 3111–3119.

Curran Associates, Inc., 2013.

Agnieszka Mykowiecka, Małgorzata Marciniak, and Piotr Rychlik. Testing word

embeddings for Polish. Cognitive Studies | Études cognitives, 17, 2017.

Özlem Özgöbek, Jon Atle Gulla, and Riza Cenk Erdur. A Survey on Challenges

and Methods in News Recommendation. In WEBIST (2), pages 278–285, 2014.

Do Viet Phuong and Tu Minh Phuong. Gender Prediction Using Browsing History,

pages 271–283. Springer International Publishing, Cham, 2014.

Francisco Rangel, Paolo Rosso, Ben Verhoeven, Walter Daelemans, Martin Potthast,

and Benno Stein. Overview of the 4th author profiling task at pan 2016:

Cross-genre evaluations. In Working Notes Papers of the CLEF 2016 Evaluation

Labs. CEUR Workshop Proceedings, Évora, Portugal, 2016/09 2016. CLEF and, CLEF and

Chong Wang and David M. Blei. Collaborative Topic Modeling for Recommending

Scientific Articles. In Proceedings of the 17th ACM SIGKDD International

Conference on Knowledge Discovery and Data Mining, KDD ’11, pages 448–456,

New York, NY, USA, 2011. ACM.

Geoffrey I. Webb, Michael J. Pazzani, and Daniel Billsus. Machine Learning

for User Modeling. User Modeling and User-Adapted Interaction, 11(1-2):19–29,

March 2001.



  • There are currently no refbacks.