AI for colorectal cancer: towards the improved CMS classification and interpretability

Beschreibung

Access to large complex biomedical data today allows scientists to take full advantage of AI-driven approaches in a variety of applications with high societal impact. One such application is precision medicine, which is gradually becoming reality for some cancers. Unfortunately, for colorectal cancer (CRC) predictive biomarkers are scarce and are more effective at identifying non-responders than patients who may benefit from treatment. In order to select CRC patients who might respond well to a certain type of therapy, clinical doctors use the consensus molecular subtypes (CMS) to classify CRC tumors. The standard CMS classification contains four CRC subtypes, which were inferred by a Random Forest (RF) classifier based on gene expression data from CRC tumor samples. However, currently a significant percentage of CRC tumors (up to 37%) are unclassifiable based on RNAseq expression data. Moreover, one of the subtypes appears to constitute the “rest” class, which contains tumors not fitting into any of the other three subtypes and having no apparent common biological characteristics. Thus, current CMS classification requires improvement, in order to include more samples and to be more useful in clinical setups that enable CRC prognosis and prediction of patient’s response to treatment. Here we propose to improve the CMS tumor classification for CRC patients. In addition to the RF algorithm originally used for CMS classification, we will apply deep learning (DL), which should allow detection of more complex data patterns, especially since our learning algorithms will integrate two molecular data types. Specifically, in addition to the gene expression data, we will also use genome-wide data on synonymous mutations, which has been mostly underappreciated until now. Increasing evidence suggests that synonymous mutations (i.e., nucleotide mutations that preserve the encoded amino acid) are not always "silent", as it is typically thought, but can impact gene expression, alternative splicing, protein folding, as well as other processes, and were shown to be linked to diseases such as cancers, including CRC. In particular, in colorectal cancer synonymous mutations were shown to be over-represented in ion transport channels, which play a significant role in many cell processes and are considered as targets for cancer treatment. While both the RF and DL are typically used as black boxes, this work will also contribute towards the interpretability of the CMS classification by the RF or DL classifiers or an ensemble classifier system including both. This will be achieved by scrutinising the role of synonymous mutations and their relative influence on CMS predictions in CRC patients in combination with gene expression data. The goal is to discover new combinations of features that are more important for each CMS. This could be done by integrating an attention layer in a DL model, so that different molecular features that are associated with a specific CMS can be examined using their network weights as a proxy to their relative significance. These features may be associated with certain morphological characteristics of a CRC tumor as observed on histopathological images. Consequently, this analysis will contribute to our understanding of the biological basis of CMS subtypes in CRC, and therefore will help to improve CRC prognosis and treatment.