Unsupervised deep learning approach for information extraction

Fianyi, Israel

doi:10.25959/25209641.v1

Final Thesis - FIANYI.pdf (5.46 MB)

Unsupervised deep learning approach for information extraction

thesis

posted on 2024-04-18, 03:30 authored by Israel FianyiIsrael Fianyi

Unsupervised Information Extraction (UIE) refers to the act of automictically extracting relevant information (which consists of entities, the relations between entities and their attributes) from unstructured/unannotated user-generated textual sources. Information extraction is a subsidiary of Natural Language Processing (NLP) in Artificial Intelligence with various problems such as Relation Extraction, Named Entity Recognition, Event Extraction, Named Entity-linking, Coreference Resolution, Text Segmentation, Terminology Extraction, Question-Answering, and many more. Despite the extensive progress of deep learning approaches in various Natural Language Processing (NLP) tasks, unsupervised representation learning remains a challenge. The traditional approach for purely unsupervised learning in information extraction requires large unlabelled training datasets. It is widely reported that unsupervised deep learning algorithms can only learn quality representations when large datasets are available for training the neural network. However, there are scenarios where there are insufficient or small datasets and large datasets do not simply exist. Consequently, most of the existing algorithms do not work properly and cannot yield the required performance with small dataset. Most researchers have proposed the collection of more data to mitigate the challenge of small datasets, nevertheless, collecting data for everything can be daunting and expensive practice. Andrew NG, an artificial intelligence pioneer recently published in spectrum of Institute of Electrical and Electronic Engineers (IEEE) about the need for a smart size data-centric solutions to address the big issues in our society.
Therefore, the ability to extract adequate and relevant information from platforms/domains with small datasets is critical. The right amount of data needed in deep learning for a high performing natural language processing task is still an active study area.
This thesis explores several and different unsupervised deep representation learning techniques and generates requisite representations for information extraction from small unlabelled data. The overarching question is whether small datasets can be leveraged for an effective NLP task with unsupervised learning algorithms. For an algorithm to learn word representation in a natural language task with small dataset, a different skill set is required to use these existing algorithms to learn from a small dataset. The proposed method of this thesis investigates these challenges by developing fundamental unsupervised deep representation learning models for various information extraction tasks, evaluated experimentally.
First, the thesis investigates various unsupervised word embedding techniques to generate semantic representations for Joint Entity and Relation Extraction by training deep learning models on a small dataset. The proposed method compares various methods to obtain word representation for the proposed information extraction task. Furthermore, the study introduces a zero-shot approach with word embedding techniques to augment small datasets that generalises well with multi-domain datasets. Second, the thesis applies different unsupervised pretraining approaches to learn salient representations from small datasets for an Open Relation Extraction (ORE) task, whilst investigating the influence of the number of hidden layers on unsupervised pretraining in a deep learning network. The study found that it helps generate requisite representation learning from small datasets. The investigation also proposes a novel Attention-sequence Bidirectional Long Short Memory with reconstruction loss techniques after pretraining on small unlabelled data for an unsupervised Relation Extraction task.
Third, the thesis investigates unsupervised transfer learning techniques preceded by unsupervised pretraining on a small unlabelled dataset for an unsupervised Named Entity Linking (NEL) task. The transfer learning introduced is premised on learning prior knowledge from source domain data for the target domain data without relying on external knowledge bases.
Finally, the proposed method helps overcome some of the existing critical challenges in unsupervised representation learning. This is possible because of the application of unsupervised pretraining for word representation learning, characterised by the unique amalgamation of techniques to synergise pretrained vectors with an embedding layer for transfer learning. The proposed approach generates the requisite representations with small datasets which facilitated low training speed, better optimisation, and generalisation of the network, compared with the state-of-the-art clustering accuracy performance method.
The effectiveness of the proposed models for this thesis is evaluated on various unlabelled datasets for three different information extraction problems, named entity recognition (joint[1]entity recognition), open relation extraction and named entity linking. Overall, the investigation shows that unsupervised pretraining helps deep learning to learn quality representations from small unlabelled datasets for unsupervised information extraction. The outcome of these investigations can be applied to solve several problems relating to unsupervised representation learning and the lack of large datasets as well as scenarios where there are small unlabelled datasets.

History

Sub-type

PhD Thesis

Pagination

xv, 207 pages

Department/School

School of Information and Communication Technology

Event title

Graduation

Date of Event (Start Date)

2023-02-24

Rights statement

Usage metrics

Keywords

Unsupervised learning deep learning Natural Language Processing Small dataset Representation Learning Word embedding Unsupervised Transfer Learning

Licence

In Copyright

Exports

RefWorks

BibTeX

Ref. manager

Endnote

DataCite

NLM

DC

Unsupervised deep learning approach for information extraction

History

Sub-type

Pagination

Department/School

Event title

Date of Event (Start Date)

Rights statement

Usage metrics

Categories

Keywords

Licence

Exports