RIC: Research Interest Comparator
Harold R. Garner 0, Ph.D.
Professor
Affiliation: UTSW
Office: not provided
Phone #: not provided
Fax: not provided
Email: garner@swmed.edu
Home Page: not provided
Lab: not provided
All Results
|
New This Year
|
Abstract
|
Selected Publications
|
RIC Statistics
Results - NEW THIS MONTH:
No matching results
Abstract:
Abstract:
We implemented a variety of Information Retrieval algorithms described in the literature, and rated their relative performance in a variety of similarity search tasks on both Medline abstracts and TREC. Our aim is to discover which of several techniques performs best in a natural language query of the Medline database. We discuss each of the algorithms we implemented, and describe its relative weaknesses and merits. Results from our study are measured by a variety of standards, and indicate that for our task some algorithms are clearly superior to others.
1. Introduction
It is a cliché by now to point out the importance of automatic and intelligent information retrieval. The need is clear, and we will not labor the point here. The type of system most effective in filling this need is the topic of this paper. There are two basic approaches to the problem of how to get the right text (i.e., a journal article) into the hands of the person who wants it or needs it. These approaches are sometimes known as the Boolean Approach and the Similarity Approach. [Shatkay 2000] NCBI’s PubMed offers a widely known Boolean-type search service that allows a user to query into a database using simple keywords. This is the most common form of literature search, and most everyone is comfortable and familiar with it. Boolean searches into databases generally provide operators such as AND, OR, or quotes to aid users in narrowing the field of the query to return more specific results. In the Similarity Approach, the user is asked to provide some ordinary (natural language) text, and the program returns some set of “relevant” works, with relevancy being judged by the “similarity” of two documents. Each of these approaches to text retrieval has its advantages and disadvantages. These are discussed in [Shatkay 2000] among other places. Briefly, the failing of the Boolean Approach is that it may not be restrictive enough. For example, searching Medline for “gene AND cancer” may return millions of documents. On the other hand, the Similarity Approach can fail in various ways depending on how “similarity” is measured. Additionally, algorithms implementing some flavor of Similarity Approach can be slow and resource-intensive. It is with this latter “similarity approach” we will be concerned for the remainder of this work.
We implement a number of similarity search algorithms and compare their performance on Medline and TREC. The goal of this brief survey of natural language similarity algorithms is to see if one particular algorithm recommends itself above the rest as a clear-cut “best” similarity search algorithm. In Section 2 we describe each of the algorithms we chose to implement. Section 3 lays out our strategy for evaluating performance and the assumptions we made. Section 4 presents our results. We end in Section 5 with analysis and conclusions, and the future direction of this work.
2. Techniques for Text Comparison
In this section, we will discuss the text comparison techniques we chose to implement, and explain them in some detail.
Text similarity can be described and measured in a variety of ways. We can categorize these methods roughly into word level, phrase level, and sentence level. Word level techniques measure similarity by the number of important words two pieces of text have in common. Phrase level techniques go one step deeper by looking for common strings of words. Not only are common words important, but their order comes in to play. This added level of sophistication is useful because it allows much finer distinctions of target concepts. Unfortunately this level of sophistication carries the cost of increased complexity and compute-time. More complex still are sentence level comparisons. These techniques can still measure similarity based on presence or absence of certain words or strings of words, but may also include advanced Artificial Intelligence or Natural Language Processing techniques which move beyond the realm of simple Information Retrieval. This degree of complexity may make reference to domain-specific knowledge bases, can be cumbersome to implement, and may not be readily adaptable to other domains of knowledge.
RIC Statistics:
Extraction Method: Keyword Count with Lexical Variants Added
Eliminated words list: MedlinePlus List
Similarity Method: Weighted Keyword Count
Weighting Method: Term Frequency * Inverse Document Frequency
Database: Medline Updates from current month
Publication Type: All
Score Calculation Method: Cosine Similarity Method
Sort by: Score
Results computed on: 6/9/2006