A DEEP LEARNING DRIVEN TEXT CLASSIFICATION APPROACH WITH NAMED ENTITY RECOGNITION
DOI:
https://doi.org/10.7494/csci.2026.27.1.6738Abstract
In natural language processing with text data, which forms the basis of the studies in the field of Artificial Intelligence, various studies such as semantics and natural language generation are carried out, especially the solution of classification problems. This study aims to analyze the effect of detected named entities on text classification performance to make the text preprocessing stage more effective. In order to reduce the analysis time and increase the performance, after the classical preprocessing stage, word filtering was performed with Named Entity Recognition according to the thresholds determined in the 5% and 10% ranges. Analysis was performed with various machine learning, deep learning algorithms, Bidirectional Encoder Representations from Transformers (BERT) and the obtained results are discussed in the last part of the study. In the problem of classifying 50,000 news texts, 93% with Support Vector Machine (SVM) algorithm in statistical classification with machine learning, 87% with Long shortterm memory (LSTM), and 83% with BERT success was achieved. In the analyses performed with LSTM and BERT, although the model performances were numerically lower, it was observed that the semantic integrity was stronger in text classification and that the success increased after Named Entity Recognition (NER) filtering in general. Thus, it can be interpreted that the dataset that is passed through the NER filter according to the threshold values positively
affects the model's success in terms of time and performance.
Downloads
References
Sebastiani, F. (2002). Machine learning in automated text categorization. ACM Computing Surveys, 34(1), 1–47.
https://doi.org/10.1145/505282.505283
S. B. Bahçeci. “DOĞAL DIL IŞLEME’NIN ALT DALI: VARLIK ISMI TANIMA.” Medium. Accessed: Apr. 20,
[Online]. Available: https://safaburakbahceci29.medium.com/doğal-dil-işlemenin-alt-dali-varlik-ismi-tanimaeeb9f4551f06
R. Shelke and S. Vanjale, “Recursive LSTM for the Classification of Named Entity Recognition for Hindi
Language”, Ingénierie des systèmes d inf., vol. 27, no. 4, pp. 679–684, Aug. 2022. Accessed: Apr. 20, 2024. [Online].
Available: https://doi.org/10.18280/isi.270420
M. Siino, I. Tinnirello, and M. La Cascia, “Is text preprocessing still worth the time? A comparative survey on the
influence of popular preprocessing methods on Transformers and traditional classifiers”, Inf. Syst., vol. 121,
p. 102342, Mar. 2024. Accessed: Apr. 19, 2024. [Online]. Available: https://doi.org/10.1016/j.is.2023.102342
K. Li and C. Kang, “Deep feature extraction with tri-channel textual feature map for text classification”, Pattern
Recognit. Lett., Dec. 2023. Accessed: Apr. 19, 2024. [Online]. Available: https://doi.org/10.1016/j.patrec.2023.12.019
G. Lu, X. Ju, X. Chen, W. Pei, and Z. Cai, “GRACE: Empowering LLM-based software vulnerability detection
with graph structure and in-context learning”, J. Syst. Softw., p. 112031, Mar. 2024. Accessed: Apr. 19, 2024.
[Online]. Available: https://doi.org/10.1016/j.jss.2024.112031
J. Camacho-Collados and M. Taher Pilehvar. “On the Role of Text Preprocessing in Neural Network
Architectures: An Evaluation Study on Text Categorization and Sentiment Analysis.” arXiv.org. Accessed: Apr. 19,
[Online]. Available: https://arxiv.org/abs/1707.01780
N. Patil, A. Patil, and B. V. Pawar, “Named Entity Recognition using Conditional Random Fields”, Procedia
Comput. Sci., vol. 167, pp. 1181–1188, 2020. Accessed: Apr. 20, 2024. [Online].
Available: https://doi.org/10.1016/j.procs.2020.03.431
M. Mahmood. “Stop Words and Named Entity Recognition (NER) Filtering for Airline Sentiment Text
PreProcessing.” Medium. Accessed: Apr. 22, 2024. [Online]. Available: https://blog.devgenius.io/stop-words-andnamed-
entity-recognition-ner-filtering-for-airline-sentiment-twitter-dataset-text-52c3643fcac9
S. Situmeang, “Impact of Text Preprocessing on Named Entity Recognition Based on Conditional Random Field
in Indonesian Text”, Mantik, vol. 6, no. 1, pp. 423-430, May 2022.
M. U. SALUR and I. AYDIN, “The Impact of Preprocessing on Classification Performance in Convolutional
Neural Networks for Turkish Text”, in 2018 Int. Conf. Artif. Intell. Data Process. (IDAP), Malatya, Turkey, Sep. 28–
, 2018. IEEE, 2018. Accessed: Apr. 19, 2024. [Online]. Available: https://doi.org/10.1109/idap.2018.8620722
K. G. Schilling et al., “Influence of preprocessing, distortion correction and cardiac triggering on the quality of
diffusion MR images of spinal cord”, Magnetic Reson. Imag., Feb. 2024. Accessed: Apr. 19, 2024. [Online].
Available: https://doi.org/10.1016/j.mri.2024.01.008
I. Ali, N. Mughal, Z. H. Khan, J. Ahmed, and G. Mujtaba, “Resume Classification System using Natural
Language Processing and Machine Learning Techniques”, Mehran Univ. Res. J. Eng. Technol., vol. 41, no. 1, pp. 65–
, Jan. 2022. Accessed: Apr. 20, 2024. [Online]. Available: https://doi.org/10.22581/muet1982.2201.07
O. Uslu ve S. Özmen-akyol, “Türkçe Haber Metinlerinin Makine Öğrenmesi Yöntemleri Kullanılarak
Sınıflandırılması”, ESTUDAM Bilişim, c. 2, sy. 1, ss. 15–20, 2021.
R. Szczepanek, “A Deep Learning Model of Spatial Distance and Named Entity Recognition (SD-NER) for Flood
Mark Text Classification”, Water, vol. 15, no. 6, p. 1197, Mar. 2023. Accessed: Apr. 20, 2024. [Online].
Available: https://doi.org/10.3390/w15061197
B. He and J. Zhang, “An Association Rule Mining Method Based on Named Entity Recognition and Text
Classification”, Arabian J. Sci. Eng., May 2022. Accessed: Apr. 20, 2024. [Online].
Available: https://doi.org/10.1007/s13369-022-06870-x
W. Hemati and A. Mehler, “LSTMVoter: chemical named entity recognition using a conglomerate of sequence
labeling tools”, J. Cheminform., vol. 11, no. 1, Jan. 2019. Accessed: Apr. 20, 2024. [Online].
Available: https://doi.org/10.1186/s13321-018-0327-2
M. Ali, G. Tan, and A. Hussain, “Bidirectional Recurrent Neural Network Approach for Arabic Named Entity
Recognition”, Future Internet, vol. 10, no. 12, p. 123, Dec. 2018. Accessed: Apr. 20, 2024. [Online].
Available: https://doi.org/10.3390/fi10120123
N. Suat-Rojas, C. Gutierrez-Osorio, and C. Pedraza, “Extraction and Analysis of Social Networks Data to Detect
Traffic Accidents”, Information, vol. 13, no. 1, p. 26, Jan. 2022. Accessed: Apr. 20, 2024. [Online].
Available: https://doi.org/10.3390/info13010026
N. Perera, T. T. L. Nguyen, M. Dehmer, and F. Emmert-Streib, “Comparison of Text Mining Models for Food
and Dietary Constituent Named-Entity Recognition”, Mach. Learn. Knowl. Extraction, vol. 4, no. 1, pp. 254–275,
Mar. 2022. Accessed: Apr. 20, 2024. [Online]. Available: https://doi.org/10.3390/make4010012
M. Aydoğan and A. Karci, “Improving the accuracy using pre-trained word embeddings on deep neural networks
for Turkish text classification”, Physica A: Statistical Mechanics its Appl., vol. 541, p. 123288, Mar. 2020. Accessed:
Apr. 20, 2024. [Online]. Available: https://doi.org/10.1016/j.physa.2019.123288
Pankaj, P. Pandey, Muskan, and N. Soni, “Sentiment Analysis on Customer Feedback Data: Amazon Product
Reviews”, in 2019 Int. Conf. Mach. Learn., Big Data, Cloud Parallel Comput. (COMITCon), Faridabad, India,
Feb. 14–16, 2019. IEEE, 2019. Accessed: Apr. 20, 2024. [Online].
Available: https://doi.org/10.1109/comitcon.2019.8862258
J. Ahmed and M. Ahmed, “ONLINE NEWS CLASSIFICATION USING MACHINE LEARNING
TECHNIQUES”, IIUM Eng. J., vol. 22, no. 2, pp. 210–225, Jul. 2021. Accessed: Apr. 19, 2024. [Online].
Available: https://doi.org/10.31436/iiumej.v22i2.1662
A. Goel, J. Gautam, and S. Kumar, “Real time sentiment analysis of tweets using Naive Bayes”, in 2016 2nd Int.
Conf. Next Gener. Comput. Technol. (NGCT), Dehradun, India, Oct. 14–16, 2016. IEEE, 2016. Accessed: Apr. 20,
[Online]. Available: https://doi.org/10.1109/ngct.2016.7877424
G. Hou, Y. Jian, Q. Zhao, X. Quan, and H. Zhang, “Language model based on deep learning network for
biomedical named entity recognition”, Methods, Apr. 2024. Accessed: Apr. 22, 2024. [Online].
Available: https://doi.org/10.1016/j.ymeth.2024.04.013
F. E. Dalkilic, S. Gelisli, and B. Diri, “Named Entity Recognition from Turkish texts”, in 2010 IEEE 18th Signal
Process. Commun. Appl. Conf. (SIU), Diyarbakir, Turkey, Apr. 22–24, 2010. IEEE, 2010. Accessed: Apr. 20, 2024.
[Online]. Available: https://doi.org/10.1109/siu.2010.5653553
L. Nemes and A. Kiss, “Information Extraction and Named Entity Recognition Supported Social Media
Sentiment Analysis during the COVID-19 Pandemic”, Appl. Sci., vol. 11, no. 22, p. 11017, Nov. 2021. Accessed:
Apr. 20, 2024. [Online]. Available: https://doi.org/10.3390/app112211017
N. Pavitha et al., “Movie Recommendation and Sentiment Analysis Using Machine Learning”, Global
Transitions Proc., Apr. 2022. Accessed: Apr. 20, 2024. [Online].
Available: https://doi.org/10.1016/j.gltp.2022.03.012
M. AminiMotlagh, H. Shahhoseini, and N. Fatehi, “A reliable sentiment analysis for classification of tweets in
social networks”, Social Netw. Anal. Mining, vol. 13, no. 1, Dec. 2022. Accessed: Apr. 20, 2024. [Online].
Available: https://doi.org/10.1007/s13278-022-00998-2
R. Misra and P. Arora, “Sarcasm detection using news headlines dataset”, AI Open, vol. 4, pp. 13–18, 2023.
Accessed: Apr. 19, 2024. [Online]. Available: https://doi.org/10.1016/j.aiopen.2023.01.001
M. Li, J. Zhu, X. Yang, Y. Yang, Q. Gao, and H. Wang, “CL-WSTC: Continual Learning for Weakly Supervised
Text Classification on the Internet”, in WWW '23: ACM Web Conf. 2023, Austin TX USA. New York, NY, USA:
ACM, 2023. Accessed: Apr. 19, 2024. [Online]. Available: https://doi.org/10.1145/3543507.3583249
R. Misra. “News Category Dataset.” arXiv.org. Accessed: Apr. 19, 2024. [Online].
Available: https://arxiv.org/abs/2209.11429
J. Sun and P. Gloor, ““Towards Re-Inventing Psychohistory”: Predicting the Popularity of Tomorrow’s News
from Yesterday’s Twitter and News Feeds”, J. Syst. Sci. Syst. Eng., Nov. 2020. Accessed: Apr. 19, 2024. [Online].
Available: https://doi.org/10.1007/s11518-020-5470-4
Deerwester, S., Dumais, S.T., Furnas, G.W., Landauer, T.K. and Harshman, R. (1990), Indexing by latent
semantic analysis. J. Am. Soc. Inf. Sci., 41: 391-407. https://doi.org/10.1002/(SICI)1097-
(199009)41:6<391::AID-ASI1>3.0.CO;2-9
N. Leelawat et al., “Twitter data sentiment analysis of tourism in Thailand during the COVID-19 pandemic using
machine learning”, Heliyon, vol. 8, no. 10, Oct. 2022, Art. no. e10894. Accessed: Apr. 20, 2024. [Online].
Available: https://doi.org/10.1016/j.heliyon.2022.e10894
A. Aizawa, “An information-theoretic perspective of tf–idf measures”, Inf. Process. & Manage., vol. 39, no. 1,
pp. 45–65, Jan. 2003. Accessed: Apr. 20, 2024. [Online]. Available: https://doi.org/10.1016/s0306-4573(02)00021-3
T. Mikolov, K. Chen, G. Corrado, and J. Dean. “Efficient Estimation of Word Representations in Vector
Space.” arXiv.org. Accessed: Apr. 20, 2024. [Online]. Available: https://arxiv.org/abs/1301.3781
D. S. Asudani, N. K. Nagwani, and P. Singh, “Impact of word embedding models on text analytics in deep
learning environment: a review”, Artif. Intell. Rev., Feb. 2023. Accessed: Apr. 19, 2024. [Online].
Available: https://doi.org/10.1007/s10462-023-10419-1
Q. Wang, P. Liu, Z. Zhu, H. Yin, Q. Zhang, and L. Zhang, “A Text Abstraction Summary Model Based on
BERT Word Embedding and Reinforcement Learning”, Appl. Sci., vol. 9, no. 21, p. 4701, Nov. 2019. Accessed:
Apr. 19, 2024. [Online]. Available: https://doi.org/10.3390/app9214701
C. McCormick. “BERT Word Embeddings Tutorial · Chris McCormick.” Chris McCormick · Machine
Learning Tutorials and Insights. Accessed: Apr. 20, 2024. [Online].
Available: https://mccormickml.com/2019/05/14/BERT-word-embeddings-tutorial
W. Sun, S. Liu, Y. Liu, L. Kong, and Z. Jian, “Named Entity Recognition Networks Based on Syntactically
Constrained Attention”, Appl. Sci., vol. 13, no. 6, p. 3993, Mar. 2023. Accessed: Apr. 19, 2024. [Online].
Available: https://doi.org/10.3390/app13063993
P. P. “Text Preprocessing in Natural Language Processing (NLP).” LinkedIn: Log In or Sign Up. Accessed:
Apr. 19, 2024. [Online]. Available: https://www.linkedin.com/pulse/text-preprocessing-natural-languageprocessing-
nlp-prema-p-jurmc/
“spaCy · Industrial-strength Natural Language Processing in Python.” Accessed: Apr. 21, 2024. [Online].
Available: https://spacy.io/
S. Hochreiter and J. Schmidhuber, “Long Short-Term Memory”, Neural Comput., vol. 9, no. 8, pp. 1735–1780,
Nov. 1997. Accessed: Apr. 19, 2024. [Online]. Available: https://doi.org/10.1162/neco.1997.9.8.1735
M. Liang and T. Niu, “Research on Text Classification Techniques Based on Improved TF-IDF Algorithm and
LSTM Inputs”, Procedia Comput. Sci., vol. 208, pp. 460–470, 2022. Accessed: Apr. 19, 2024. [Online].
Available: https://doi.org/10.1016/j.procs.2022.10.064
J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova. “BERT: Pre-training of Deep Bidirectional Transformers
for Language Understanding.” arXiv.org. Accessed: Apr. 19, 2024. [Online].
Available: https://arxiv.org/abs/1810.04805
X. Chen, P. Cong, and S. Lv, “A Long-Text Classification Method of Chinese News Based on BERT and
CNN”, IEEE Access, vol. 10, pp. 34046–34057, 2022. Accessed: Apr. 19, 2024. [Online].
Available: https://doi.org/10.1109/access.2022.3162614
Pedregosa F, Varoquaux, Ga"el, Gramfort A, Michel V, Thirion B, Grisel O, et al. Scikit-learn: Machine
learning in Python. Journal of machine learning research. 2011;12(Oct):2825–30.
Chollet F, others. Keras [Internet]. GitHub; 2015. Available from: https://github.com/fchollet/keras
Anaconda Software Distribution [Internet]. Anaconda Documentation. Anaconda Inc.; 2020. Available from:
G. Koppe, A. Meyer-Lindenberg, and D. Durstewitz, “Deep learning for small and big data in
psychiatry”, Neuropsychopharmacology, vol. 46, no. 1, pp. 176–190, Jul. 2020. Accessed: Apr. 19, 2024. [Online].
Available: https://doi.org/10.1038/s41386-020-0767-z
F. Hang, L. Xie, Z. Zhang, W. Guo, and H. Li, “Research on the application of network security defence in
database security services based on deep learning integrated with big data analytics”, Int. J. Intell. Netw., Feb. 2024.
Accessed: Apr. 19, 2024. [Online]. Available: https://doi.org/10.1016/j.ijin.2024.02.006
U. Yaseen and S. Langer, “Neural Text Classification and Stacked Heterogeneous Embeddings for Named
Entity Recognition in SMM4H 2021”, in Proc. Sixth Social Media Mining Health (#SMM4H) Workshop Shared
Task, Mexico City, Mexico. Stroudsburg, PA, USA: Assoc. Comput. Linguistics, 2021. Accessed: Apr. 19, 2024.
[Online]. Available: https://doi.org/10.18653/v1/2021.smm4h-1.14
H. B. Patil and A. S. Patil, “Evaluating the Effect of Preprocessing Tools for Marathi Text Retrieval”, Procedia
Comput. Sci., vol. 233, pp. 902–908, 2024. Accessed: Apr. 19, 2024. [Online].
Available: https://doi.org/10.1016/j.procs.2024.03.279
J. Ahmed and M. Ahmed, “Classification, Detection and Sentiment Analysis using Machine Learning over Next
Generation Communication Platforms”, Microprocessors Microsyst., p. 104795, Feb. 2023. Accessed: Apr. 20, 2024.
[Online]. Available: https://doi.org/10.1016/j.micpro.2023.104795.
Downloads
Published
Issue
Section
License
Copyright (c) 2026 Computer Science

This work is licensed under a Creative Commons Attribution 4.0 International License.