University of Tasmania
Browse

A Comparative Study on Statistical Machine Learning Algorithms and Thresholding Strategies for Automatic Text categorization

Version 2 2025-07-08, 01:57
Version 1 2023-05-23, 02:56
conference contribution
posted on 2025-07-08, 01:57 authored by KH Lee, J Kay, Byeong KangByeong Kang, U Rosebrock
Two main research areas in statistical text categorization are similarity- based learning algorithms and associated thresholding strategies. The combination of these techniques significantly influences the overall performance of text categorization. After investigating two similarity-based classifiers (k-NN and Rocchio) and three common thresholding techniques (RCut, PCut, and SCut), we describe a new learning algorithm known as the keyword association network (KAN) and a new thresholding strategy (RinSCut) to improve performance over existing techniques. Extensive experiments have been conducted on the Reuters-21578 and 20-Newsgroups data sets. The experimental results show that our new approaches give better results for both micro-averaged F1 and macro-averaged F1 scores.

History

Publication title

Proceedings / PRICAI 2002

Volume

2417

Editors

Jaime G Carbonell & Jorg Siekmann

Pagination

444-453

ISBN

3-540-44038-0

Department/School

Information and Communication Technology

Publisher

Springer-Verlag Berlin Heidelberg

Publication status

  • Published

Place of publication

Germany

Event title

7th Pacific Rim International Conference on Artificial Intelligence

Event Venue

Tokyo, Japan

Date of Event (Start Date)

2002-08-18

Date of Event (End Date)

2002-08-22

Socio-economic Objectives

220499 Information systems, technologies and services not elsewhere classified

Usage metrics

    University Of Tasmania

    Exports

    RefWorks
    BibTeX
    Ref. manager
    Endnote
    DataCite
    NLM
    DC