Electronic Theses and Dissertations

Identifier

757

Author

Sujoy Roy

Date

2012

Document Type

Dissertation (Access Restricted)

Degree Name

Doctor of Philosophy

Major

Computer Science

Committee Chair

Ramin Homayouni

Committee Member

Sajjan Shiva

Committee Member

Vasile Rus

Abstract

Rapid growth of biomedical literature related to genes and molecular pathways presents a serious challenge for interpretation of genomic data. This dissertation investigates two approaches, utilizing matrix and tensor factorizations respectively, for mining gene regulatory information from biomedical text. In the first approach, singular value decomposition (SVD) or latent semantic indexing (LSI) is used for identification and ranking of putative regulatory transcription factors (TFs) from microarray derived differentially expressed genes (DEGs). Two LSI models were built using different term weighting schemes to devise pair-wise similarities between 21,027 mouse genes annotated in the Entrez Gene repository. Amongst these genes, 433 were designated as TFs in the TRANSFAC database. The LSI derived TF-to-gene similarities were used to calculate TF literature enrichment p-values, and rank the TFs for a given set of genes. Evaluation was conducted using five different publicly available microarray datasets. Sets of gold standard TFs known to be functionally relevant to the study in question, were constructed. Receiver Operating Characteristic (ROC) curves showed that the log-entropy LSI model outperformed the tf-normal LSI model and a benchmark co-occurrence based method, for four out of five datasets; as well as motif searching approaches, in identifying putative TFs. For the second study, the utility of nonnegative tensor factorization (NTF) is explored to extract keywords describing semantic relationships between genes and the TFs that regulate them. A 3-mode tensor was generated for 70859 terms or keywords, 7672 genes and 992 TFs extracted from shared MEDLINE abstracts. Multiway clusters of terms, genes and TFs were evaluated at various tensor factorization approximation ranks. Examination of several clusters, using gene pathway and category databases, revealed that NTF accurately identified cohorts of genes and TFs in well established signaling pathways, as well as simultaneously functionally annotated them. In addition, the method revealed functionally cohesive term-gene-TF clusters that were not well documented, perhaps pointing to new discoveries. Taken together, this body of work provides proof of concept that matrix and tensor factorization methods could be useful in interpretation of genomic data.

Comments

Data is provided by the student.

Library Comment

Dissertation or thesis originally submitted to the local University of Memphis Electronic Theses & dissertation (ETD) Repository.

Share

COinS