whole_KimYangsok2004_thesis.pdf (2.62 MB)
Text noise filtering methods for web information management
thesisposted on 2023-05-26, 19:04 authored by Kim, YS
As people use the Web information as their major knowledge resource, the development of computerized Web information management systems is becoming one of the major streams in the Internet area. There are three major problems in this development: the first problem is about the ambiguity of target documents, the so called 'ontology problem'. Text mining and ontology research mainly focus on this aspect. The second problem is that it is not easy to find the location of information. This has been a well known problem from the early stages of Web technology. Many people focus on the push style information delivery technology to replace the current pull style - for example, RSS (Really Simple Syndication) and automated Web information monitoring systems. The third issue is about the complexity of the information on the Web page. This has been less considered in Web research, but people are now starting to recognize it as a more crucial conundrum in the real world application. This research thesis focuses on this third problem. The goal of the research is to identify the core information from the heterogeneous Web page information. This core information contains materials which publishers want to impart to users. However, Web pages also contain 'noisy information' such as redundant information and functional information. Whereas core information helps knowledge management, 'noise information' may impede efficient knowledge management. Noisy text filtering methods consist of three filtering modules: phrase length based filter, tag based filter, redundant words elimination filter, and redundant phrases elimination filter. Extensive comparative experiments have been conducted with real world data sets which are collected from online news Web service sites (ABC, BBC, and CNN). Experiment results show this approach works efficiently and effectively.
Rights statementCopyright 2004 the author - The University is continuing to endeavour to trace the copyright owner(s) and in the meantime this item has been reproduced here in good faith. We would be pleased to hear from the copyright owner(s). Thesis (MComp)--University of Tasmania, 2004. Includes bibliographical references