Identifying gene-environment interactions for prognosis using a robust approach


For many complex diseases, prognosis is of essential importance. It has been shown that, beyond the main effects of genetic (G) and environmental (E) risk factors, gene-environment (G × E) interactions also play a critical role. In practical data analysis, part of the prognosis outcome data can have a distribution different from that of the rest of the data because of contamination or a mixture of subtypes. Literature has shown that data contamination as well as a mixture of distributions, if not properly accounted for, can lead to severely biased model estimation. In this study, we describe prognosis using an accelerated failure time (AFT) model. An exponential squared loss is proposed to accommodate data contamination or a mixture of distributions. A penalization approach is adopted for regularized estimation and marker selection. The proposed method is realized using an effective coordinate descent (CD) and minorization maximization (MM) algorithm. The estimation and identification consistency properties are rigorously established. Simulation shows that without contamination or mixture, the proposed method has performance comparable to or better than the nonrobust alternative. However, with contamination or mixture, it outperforms the nonrobust alternative and, under certain scenarios, is superior to the robust method based on quantile regression. The proposed method is applied to the analysis of TCGA (The Cancer Genome Atlas) lung cancer data. It identifies interactions different from those using the alternatives. The identified markers have important implications and satisfactory stability.

Publication Title

Econometrics and Statistics