How to cite: ATAYOLU, Y., KUTLU, Y.. Effect of text preprocessing methods on the performance of social media posts classification. Akıllı Sistemler ve Uygulamaları Dergisi (Journal of Intelligent Systems with Applications) 2024; 7(1): 1-6
Full Text: PDF.
Total number of downloads: 177
Title: Effect of Text Preprocessing Methods on the Performance of Social Media Posts Classification
Abstract: This study investigates the classification performance of RoBERTa, BERT, XLNet, and T5 for mental health-related tasks and evaluates the impact of various text preprocessing steps. Initially, the classification results of these models were obtained by applying a minimal preprocessing step, specifically the removal of repeated expressions. Subsequent analysis assessed the effects of additional preprocessing techniques, including the removal of stop words, technical stop words, URLs, and punctuation, as well as text normalization steps such as lemmatization and conversion to lowercase. RoBERTa achieved the highest classification accuracy and F1 score, particularly excelling in the detection of depression and suicide tendencies. All preprocessing steps, apart from removing repeated expressions, reduced overall classification accuracy and performance. For the depression class, converting text to lowercase also had a positive effect, showing an inverse relationship between preprocessing intensity and performance in most cases. The results underscore the need to tailor preprocessing steps carefully to the task and dataset.
Keywords: Text preprocessing, mental health classification, large language models