A Comparative Study on Statistical Machine Learning Algorithms and Thresholding Strategies for Automatic Text categorization
Version 2 2025-07-08, 01:57Version 2 2025-07-08, 01:57
Version 1 2023-05-23, 02:56Version 1 2023-05-23, 02:56
conference contribution
posted on 2025-07-08, 01:57authored byKH Lee, J Kay, Byeong KangByeong Kang, U Rosebrock
Two main research areas in statistical text categorization are similarity- based learning algorithms and associated thresholding strategies. The combination of these techniques significantly influences the overall performance of text categorization. After investigating two similarity-based classifiers (k-NN and Rocchio) and three common thresholding techniques (RCut, PCut, and SCut), we describe a new learning algorithm known as the keyword association network (KAN) and a new thresholding strategy (RinSCut) to improve performance over existing techniques. Extensive experiments have been conducted on the Reuters-21578 and 20-Newsgroups data sets. The experimental results show that our new approaches give better results for both micro-averaged F1 and macro-averaged F1 scores.