RIC: Research Interest Comparator

Chi, Robert, Student
Affiliation: Academia Sinica
Email: cnchi@iis.sinica.edu.tw
Home Page: http://blog.robertchi.com
New This Month | New This Year | Abstract | Selected Publications | RIC Statistics Results - FULL MEDLINE:

Abstract:
BACKGROUND: Massive text mining of the biological literature holds great promise of relating disparate information and discovering new knowledge. However, disambiguation of gene symbols is a major bottleneck. RESULTS: We developed a simple thesaurus-based disambiguation algorithm that can operate with very little training data. The thesaurus comprises the information from five human genetic databases and MeSH. The extent of the homonym problem for human gene symbols is shown to be substantial (33% of the genes in our combined thesaurus had one or more ambiguous symbols), not only because one symbol can refer to multiple genes, but also because a gene symbol can have many non-gene meanings. A test set of 52,529 Medline abstracts, containing 690 ambiguous human gene symbols taken from OMIM, was automatically generated. Overall accuracy of the disambiguation algorithm was up to 92.7% on the test set. CONCLUSION: The ambiguity of human gene symbols is substantial, not only because one symbol may denote multiple genes but particularly because many symbols have other, non-gene meanings. The proposed disambiguation approach resolves most ambiguities in our test set with high accuracy, including the important gene/not a gene decisions. The algorithm is fast and scalable, enabling gene-symbol disambiguation in massive text mining applications.
Keywords extracted from the abstract: [ eliminated words list ]
Count Word
2.773 abstracts
3.949 accuracy
6.441 algorithm
3.893 ambiguities
3.436 ambiguity
6.475 ambiguous
2.071 applications
1.396 approach
3.064 automatically
1.470 background
1.224 biological
3.847 bottleneck
1.378 combined
2.830 comprises
1.407 conclusion
1.478 containing
0.847 data
2.393 databases
2.305 decisions
3.991 denote
1.308 developed
17.966 disambiguation
3.682 discovering
3.293 disparate
2.920 enabling
1.708 extent
2.033 fast
3.873 gene
6.000 gene-symbol
1.856 generated
Count Word
3.114 genes
1.174 genetic
1.704 great
0.994 high
3.074 holds
5.345 homonym
1.614 human
1.237 important
1.327 including
2.771 information
1.702 knowledge
1.583 literature
1.709 little
1.299 major
4.847 massive
7.008 meanings
2.730 medline
2.804 mesh
6.038 mining
0.902 more
1.061 most
2.731 multiple
11.263 non-gene
3.426 omim
1.994 only
2.907 operate
0.926 other
1.551 overall
1.752 particularly
1.756 problem
Count Word
2.728 promise
1.743 proposed
3.098 refer
2.615 relating
3.633 resolves
0.741 results
4.309 scalable
5.049 set
1.714 simple
4.209 substantial
8.934 symbol
14.968 symbols
1.721 taken
5.381 text
9.146 thesaurus
6.000 thesaurus-based
1.680 training
RIC Statistics:
Extraction Method: Keyword Count with Lexical Variants Added
Eliminated words list: MedlinePlus List
Similarity Method: Weighted keyword count
Weighting Method: Term Frequency * Inverse Document Frequency
Database: Medline abstracts (1967 - Present)
Publication Type: All
Score Calculation Method: Cosine Similarity Method
Sort by: Score
Submission date and time: 2-3-2007, 10:16:31
Computation time: 00:00:04
Last updated: Saturday, 03-Feb-2007 10:16:35 CST