Paper Title
Synthetic Minority Over-Sampling Technique and Its Variants for Imbalanced Social Media Text Corpus
Abstract
Social media data that people use intensively in their daily lives is increasing rapidly. Social media datasets are
valuable for machine learning applications to understand public reactions to specific events or products. By applying
sentiment analysis steps, datasets can be classified as positive, negative and neutral. Thanks to the sentiment analysis steps,
emotions behind the shared text data can be analyzed. However, due to the multidimensional and complex structure of these
datasets, it is possible to encounter an imbalanced class problem which means that samples belong to a certain class are
outnumbered by the samples of the other classes. Classification of imbalanced datasets causes poor generalization
performance among the classifiers. To get more robust performance measurements, it’s more convenient to apply resampling
techniques to such datasets. In recent years, many approaches have been proposed to solve the class imbalance problem.
This study aims to compare Synthetic Minority Oversampling Technique (SMOTE) and its variants to rebalance the social
media text corpus in order to get better classification results. With this way, a balanced training set has been prepared for
future analysis of social media data. Logistic Regression, Support Vector Machines and Random Forest algorithms are used
as classifiers. Our results show that SMOTE-Tomek Links outperforms other SMOTE and its variants and best performance
values are obtained with Random Forest algorithm.
Keywords - Imbalanced Text Corpus, Class Imbalance, Resampling Techniques, SMOTE, Variants Of SMOTE.