TY - GEN
T1 - Annotating genes using textual patterns
AU - Cakmak, Ali
AU - Ozsoyoglu, Gultekin
PY - 2007
Y1 - 2007
N2 - Annotating genes with Gene Ontology (GO) terms is crucial for biologists to characterize the traits of genes in a standardized way. However, manual curation of textual data, the most reliable form of gene annotation by GO terms, requires significant amounts of human effort, is very costly, and cannot catch up with the rate of increase in biomedical publications. In this paper, we present GEANN, a system to automatically infer new GO annotations for genes from biomedical papers based on the evidence support linked to PubMed, a biological literature database of 14 million papers. GEANN (i) extracts from text significant terms and phrases associated with a GO term, (ii) based on the extracted terms, constructs textual extraction patterns with reliability scores for GO terms, (iii) expands the pattern set through pattern crosswalks, (iv) employs semantic pattern matching, rather than syntactic pattern matching, which allows for the recognition of phrases with close meanings, and (iv) annotates genes based on the quality of the matched pattern to the genomic entity occurring in the text. On the average, in our experiments, GEANN has reached to the precision level of 78% at the 57% recall level.
AB - Annotating genes with Gene Ontology (GO) terms is crucial for biologists to characterize the traits of genes in a standardized way. However, manual curation of textual data, the most reliable form of gene annotation by GO terms, requires significant amounts of human effort, is very costly, and cannot catch up with the rate of increase in biomedical publications. In this paper, we present GEANN, a system to automatically infer new GO annotations for genes from biomedical papers based on the evidence support linked to PubMed, a biological literature database of 14 million papers. GEANN (i) extracts from text significant terms and phrases associated with a GO term, (ii) based on the extracted terms, constructs textual extraction patterns with reliability scores for GO terms, (iii) expands the pattern set through pattern crosswalks, (iv) employs semantic pattern matching, rather than syntactic pattern matching, which allows for the recognition of phrases with close meanings, and (iv) annotates genes based on the quality of the matched pattern to the genomic entity occurring in the text. On the average, in our experiments, GEANN has reached to the precision level of 78% at the 57% recall level.
UR - http://www.scopus.com/inward/record.url?scp=38449120403&partnerID=8YFLogxK
M3 - Conference contribution
C2 - 17990494
AN - SCOPUS:38449120403
SN - 9812704175
SN - 9789812704177
T3 - Pacific Symposium on Biocomputing 2007, PSB 2007
SP - 221
EP - 232
BT - Pacific Symposium on Biocomputing 2007, PSB 2007
T2 - Pacific Symposium on Biocomputing, PSB 2007
Y2 - 3 January 2007 through 7 January 2007
ER -