Sunanda_thesis_24_Jan_2006.pdf (726.36 kB)
Synthesising Web Search Queries from Example Text Documents
thesisposted on 2023-05-26, 08:14 authored by Patro, S
The huge number of documents available on the Web represents a challenge for the information retrieval (IR) systems. The explosive growth in the number of documents published on the Web has made the web search engines the main means of initiating the interaction with the Internet. There are many good search engines, but users are often not satisfied with the results provided by the search engines. In many cases, the answers returned by the search engines are not relevant to the user information need, forcing the user to manually sift through the long list to locate the desired documents. \ Often, when using a search engine the user needs to repeatedly refine their query as they do not have enough domain knowledge to formulate the query precisely. Although the average users know what kind of information they want, it becomes difficult for them to translate it to the search engine in an effective way so that the search engine understands the user needs. The specification of such a query is limited by the user's vocabulary and knowledge of the search domain. Even when disjunctions or conjunctions of keywords are chosen as the way of expressing the search goal, as existing search engines do, the user may not know what set of keywords they should use to define the collection of the desired documents precisely. Good query formulation requires that a user can somehow predict which terms appear in documents relevant to the information need. Accurate term prediction requires extensive knowledge about the document collection. Such knowledge may be hard to obtain, especially in large document collections.\ \ In the field of information retrieval, it has been recognised that, although users had difficulty expressing exactly the information that they require, they could judge the retrieved documents as relevant or irrelevant based on their information need. This lead to the notion of Relevance feedback: users marking documents as relevant to their needs and presenting this information to the information retrieval system. The system can use this information to retrieve more documents like the relevant ones by a process known as Query expansion.\ \ This research explores the use of relevance feedback techniques to automatically discover related words to a query from the contents of the user-identified relevant documents. With these set of words it gives an algorithm to synthesise the user query in the form of a Boolean expression. The basic idea is that, a synthesised query providing a richer representation of the user's query would increase the number of relevant documents retrieved when used as a query to a search engine. The three objectives for the algorithm are to ensure that the synthesised query has good recall, good precision and not least, is of a form and size acceptable to the intended search engine. \ \ The query synthesis algorithm starts by imposing a task in the form of a first-cut search query to a search engine. The outcome from the search engine is displayed in terms of a set of documents. Considering that the documents found on the Web being text documents, the user would attribute the documents as Relevant or Irrelevant based on their information needs. From these two sets of documents the algorithm creates a Boolean search query in the following five steps: \ \ 1. The Boolean query construction begins with the construction of a CNF (Conjunctive Normal Form) Boolean expression of terms that selects every document in set Relevant and rejects every document in set Irrelevant. However the expressions so constructed are often too large to be acceptable to a search engine. \ \ 2. The CNF expression is transformed into equivalent DNF (Disjunctive Normal Form) expression. Redundant minterms are removed from further consideration and the set of non redundant minterms are referred as Mset.\ \ 3. A Boolean expression Query is constructed by selecting minterms from Mset.The goal is to select a small set of minterms that selects each document in set Relevant. The constructed query is then written in a form suitable to the search engine.\ \ 4. The process stops if the Query is acceptable to the search engine. Boolean expression Query becomes the required synthesised query. Otherwise, the Query needs modification in step 5.\ \ 5. Minterms are modified to create a new minterm set Mset and the process repeats again from step3.\ \ In this research, Google is used as the prime example of a search engine because of its popularity and cached link features on the Web. To confirm the success of the proposed query synthesis algorithm, a survey was organised with day to day users of a general purpose search engine like Google. To conduct the survey, a list of topics in diverse domains was chosen to collect data from the Web and a set of queries were generated by applying the proposed algorithm on these data sets. The participants were then asked to create queries for these host topics consistent with the information need. No constraint was placed on them regarding the time, number of tries or quality of their query. The target was to compare the quality of the human generated queries with the synthesised queries using evaluation measures known as precision, coverage and their combination called F1 measure. \ \ The traditional precision and coverage measurements collected during the survey show that the synthesised queries overwhelmingly perform better than the user queries. F1 measure is employed as the main evaluation metric as it combines both precision and recall into a single metric and favour a balanced performance of the two metrics. It resolves the anomalous situations, where a query with large coverage but low precision may not be considered as satisfactory as one with a modest coverage but high precision. The number of relevant documents among the first 10 and 20 retrieved links is used as a measure of the precision. Due to the difficulties of calculating recall, a new measure called coverage is used. \ \ The result shows that the synthesised queries can provide better values of F1 measure than the queries generated from a user's best effort. Higher values in F1 measure indicates the high values of precision and coverage obtained by the synthesised queries. Besides achieving the above goals, the proposed algorithm is able to synthesise queries in a form and size acceptable to the search engine. \ \ To verify whether the outcome of the survey is not resulted by chance a statistical procedure known as paired t-test is applied on the data obtained from the survey. The results of this test suggest that the synthesised queries provide better results when compared to human generated queries, which is statistically significant (P-value<0.00001).\ The data obtained from the user survey has also been used to provide insights into the quality of human queries as function of its syntactical and other characteristics.