JOURNAL OF INTELLIGENT SYSTEMS WITH APPLICATIONS
JOISwA

E-ISSN: 2667-6893

This work is licensed under a Creative Commons Attribution 4.0 International License.

Effect of Text Preprocessing Methods on the Performance of Social Media Posts Classification

How to cite: ATAYOLU, Y., KUTLU, Y.. Effect of text preprocessing methods on the performance of social media posts classification. Akıllı Sistemler ve Uygulamaları Dergisi (Journal of Intelligent Systems with Applications) 2024; 7(1): 1-6

Full Text: PDF.

Total number of downloads: 177

Title: Effect of Text Preprocessing Methods on the Performance of Social Media Posts Classification

Abstract: This study investigates the classification performance of RoBERTa, BERT, XLNet, and T5 for mental health-related tasks and evaluates the impact of various text preprocessing steps. Initially, the classification results of these models were obtained by applying a minimal preprocessing step, specifically the removal of repeated expressions. Subsequent analysis assessed the effects of additional preprocessing techniques, including the removal of stop words, technical stop words, URLs, and punctuation, as well as text normalization steps such as lemmatization and conversion to lowercase. RoBERTa achieved the highest classification accuracy and F1 score, particularly excelling in the detection of depression and suicide tendencies. All preprocessing steps, apart from removing repeated expressions, reduced overall classification accuracy and performance. For the depression class, converting text to lowercase also had a positive effect, showing an inverse relationship between preprocessing intensity and performance in most cases. The results underscore the need to tailor preprocessing steps carefully to the task and dataset.

Keywords: Text preprocessing, mental health classification, large language models

Bibliography:

Bibring, E. (1953). The mechanism of depression. In P. Greenacre (Ed.), Affective disorders; psychoanalytic contributions to their study (pp. 13–48). International Universities Press.
De Choudhury, M., Gamon, M., Counts, S., & Horvitz, E. (2013). Predicting depression via social media. In Proceedings of the international AAAI conference on web and social media (Vol. 7, No. 1, pp. 128-137).
Leenaars, A. A. (2010). Edwin S. Shneidman on suicide. Suicidology online, 1(1), 5-18.
Homan, S., Gabi, M., Klee, N., Bachmann, S., Moser, A. M., Michel, S., ... & Kleim, B. (2022). Linguistic features of suicidal thoughts and behaviors: A systematic review. Clinical psychology review, 95, 102161.
Craske, M. G., Rauch, S. L., Ursano, R., Prenoveau, J., Pine, D. S., & Zinbarg, R. E. (2011). What is an anxiety disorder?. Focus, 9(3), 369-388.
Wang, T., & Bashir, M. (2020). Does social media behaviors reflect users’ anxiety. A case study of twitter activities.
Gruebner, O., Sykora, M., Lowe, S. R., Shankardass, K., Galea, S., & Subramanian, S. V. (2017). Big data opportunities for social behavioral and mental health research.
Jackendoff, R. (1996). How language helps us think. Pragmatics & Cognition, 4(1), 1-34.
Voleti, R., Liss, J. M., & Berisha, V. (2019). A review of automated speech and language features for assessment of cognitive and thought disorders. IEEE journal of selected topics in signal processing, 14(2), 282-298.
Chai, C. P. (2023). Comparison of text preprocessing methods. Natural Language Engineering, 29(3), 509-553.
Sarica, S., & Luo, J. (2021). Stopwords in technical language processing. Plos one, 16(8), e0254937.
Naseem, U., Razzak, I., & Eklund, P. W. (2021). A survey of pre-processing techniques to improve short-text quality: a case study on hate speech detection on twitter. Multimedia Tools and Applications, 80, 35239-35266.
Samad, M. D., Khounviengxay, N. D., & Witherow, M. A. (2020). Effect of text processing steps on twitter sentiment classification using word embedding. arXiv preprint arXiv:2007.13027.
Ladani, D. J., & Desai, N. P. (2020, March). Stopword identification and removal techniques on tc and ir applications: A survey. In 2020 6th International Conference on Advanced Computing and Communication Systems (ICACCS) (pp. 466-472). IEEE.
Jefriyanto, J., Ainun, N., & Al Ardha, M. A. (2023). Application of Naïve Bayes Classification to Analyze Performance Using Stopwords. Journal of Information System, Technology and Engineering, 1(2), 49-53.
Rahimi, Z., & Homayounpour, M. M. (2023). The impact of preprocessing on word embedding quality: A comparative study. Language Resources and Evaluation, 57(1), 257-291.
HaCohen-Kerner, Y., Miller, D., & Yigal, Y. (2020). The influence of preprocessing on text classification using a bag-of-words representation. PloS one, 15(5), e0232525.
Jung, G., Shin, J., & Lee, S. (2023). Impact of preprocessing and word embedding on extreme multi-label patent classification tasks. Applied Intelligence, 53(4), 4047-4062.
Ameer, I., Arif, M., Sidorov, G., Gòmez-Adorno, H., & Gelbukh, A. (2022). Mental illness classification on social media texts using deep learning and transfer learning. arXiv preprint arXiv:2207.01012.
Novikova, J., & Shkaruta, K. (2022). DECK: Behavioral tests to improve interpretability and generalizability of BERT models detecting depression from text. arXiv preprint arXiv:2209.05286.
Cabral, R. C., Han, S. C., Poon, J., & Nenadic, G. (2024). MM-EMOG: Multi-Label Emotion Graph Representation for Mental Health Classification on Social Media. Robotics, 13(3), 53.
Wang, G., Liu, X., Ying, Z., Yang, G., Chen, Z., Liu, Z., ... & Chen, Y. (2023). Optimized glycemic control of type 2 diabetes with reinforcement learning: a proof-of-concept trial. Nature Medicine, 29(10), 2633-2642.
Ji, S., Zhang, T., Ansari, L., Fu, J., Tiwari, P., & Cambria, E. (2021). Mentalbert: Publicly available pre-trained language models for mental healthcare. arXiv preprint arXiv:2110.15621
Kabir, M., Ahmed, T., Hasan, M. B., Laskar, M. T. R., Joarder, T. K., Mahmud, H., & Hasan, K. (2023). DEPTWEET: A typology for social media texts to detect depression severities. Computers in Human Behavior, 139, 107503.
Thakur, N. (2024). Mpox Narrative on Instagram: A Labeled Multilingual Dataset of Instagram Posts on Mpox for Sentiment, Hate Speech, and Anxiety Analysis. arXiv preprint arXiv:2409.05292.
Ramírez-Cifuentes, D., Freire, A., Baeza-Yates, R., Puntí, J., Medina-Bravo, P., Velazquez, D. A., ... & Gonzàlez, J. (2020). Detection of suicidal ideation on social media: multimodal, relational, and behavioral analysis. Journal of medical internet research, 22(7), e17758.
Mahmud, S. A. (n.d.). Suicidal Tweet Detection Dataset. Retrieved October 19, 2024, from https://www.kaggle.com/datasets/aunanya875/suicidaltweet-detection-dataset
Internet. (n.d.). Twitter Suicidal Data. Retrieved October 19, 2024, from https://github.com/laxmimerit/twitter-suicidal-intention-dataset/blob/master/twitter-suicidal_data.csv
Yadav, A. (n.d.). Twitter Suicide Data. Retrieved October 19, 2024, from https://github.com/warriorwizard/suicidal-ideation-detection/tree/main/Dataset
Helmy, A., 2024, "Depression dataset for English tweets classified binary", https://doi.org/10.7910/DVN/ISZCSA, Harvard Dataverse, V1
Low, D. M., Rumker, L., Torous, J., Cecchi, G., Ghosh, S. S., & Talkar, T. (2020). Natural Language Processing Reveals Vulnerable Mental Health Support Groups and Heightened Health Anxiety on Reddit During COVID-19: Observational Study. Journal of medical Internet research, 22(10), e22635.
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., ... ve Polosukhin, I. (2017). Attention is all you need. Advances in neural information processing systems, 30.
Devlin, J., Chang, M. W., Lee, K. ve Toutanova, K. (2018). Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805.
Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., & Stoyanov, V. (2019). RoBERTa: A robustly optimized BERT pretraining approach. arXiv preprint arXiv:1907.11692, 364.
Yang, Z., Dai, Z., Yang, Y., Carbonell, J., Salakhutdinov, R. R. ve Le, Q. V. (2019). Xlnet: Generalized autoregressive pretraining for language understanding. Advances in neural information processing systems, 32.
Raffel, C., Shazeer, N., Roberts, A., Lee, K., Narang, S., Matena, M., ... ve Liu, P. J. (2020). Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of machine learning research, 21(140), 1-67.
Shneidman, E. S. (1993). Commentary: Suicide as psychache. The Journal of nervous and mental disease, 181(3), 145-147.