Citation

Reference
Jiarui Chen, Hong-Hin Cheong, and Shirley Weng In Siu.
xDeep-AcPEP: Deep Learning Method for Anticancer Peptide Activity Prediction based on Convolutional Neural Network and Multi-Task Learning.

J. Chem. Inf. Model. 2021, 61, 8, 3789-3803.

Reference
Hong-Hin Cheong, Weimin Zuo, Jiarui Chen, Chon-Wai Un, Yain-Whar Si, Koon Ho Wong, Hang Fai Kwok, Shirley Weng In Siu.
In Silico Identification and In Vitro Validation of Anticancer Peptides for Colorectal Cancer by a Multi-step Artificial Intelligence Screening Workflow.

Submitted, 2022.

What are Anticancer Peptides (ACP)?

Cancer is one of the leading causes of death worldwide. Conventional cancer treatment relies on radiotherapy and chemotherapy, but both methods bring severe side effects to patients, as these therapies not only attack cancer cells but also damage normal cells. Anticancer peptides (ACPs) are a promising alternative as therapeutic agents that are efficient and selective against tumor cells. Several modes of mechanism of ACPs are known: They attack cancers by disrupting their cell membranes. They penetrate into the mitochondria, causing release of cytochrome C and apoptosis. They may target certain membrane receptors, modulating signal transduction and cell cycle.

AcPEP: Method to Classify ACPs and non-ACPs

The development of the ACP classifier is presented in Figure. It includes feature extraction (step 1) – each sample sequence was converted into numerical feature vectors of 32 feature groups using iFeature. Initial feature group selection (step 2) – The feature groups were evaluated for target relevance by comparing the performance of their Random Forest models. Model selection (step 3) – The features of the best performing feature groups were concatenated and subjected to extensive evaluation using a combination of feature preprocessing, feature selection and learning algorithms to build prediction models which 8+1(raw method) feature preprocessing methods, 6+1(raw method) feature selection methods and 5 learning algorithm as shown in Figure. So totally 315 different model training procedures were compared by 10-fold cross validation to identify the optimal one.

Figure 3. The development of the ACP classifier.

Model Performance

In 10-fold cross-validation (CV) with different combinations of feature selection, feature scaling, and machine learning algorithms, the top 5 models were all based on SVC. As shown in Table 1, the accuracy of these models from CV is within one standard deviation of the top 1 mode. Specifically, Standard:RFE-SVC:SVC outperformed the others while Standard:ANOVA:SVC achieved similar performance with greatly reduced dimensionality. Using an independent test set, our models were compared with 11 previously published ACP classification methods and showed 1-2% improvement in accuracy (Figure 2). Considering better generalization, we chose Standard:ANOVA:SVC as our final model and named it AcPEP.

Scaler
Standard
Selection
RFE-SVC
Model
SVC
Dims
1601
ACC
0.9471
MCC
0.8983
AUCROC
0.9934
F1
0.9445
SEN
0.9
SPC
0.9943
PRC
0.9937
CV(mean)
0.9471
CV(std)
0.0205
Scaler
Power
Selection
RFE-SVC
Model
SVC
Dims
1601
ACC
0.9407
MCC
0.885
AUCROC
0.9924
F1
0.9379
SEN
0.8957
SPC
0.9857
PRC
0.9843
CV(mean)
0.9407
CV(std)
0.0181
Scaler
Standard
Selection
ANOVA
Model
SVC
Dims
742
ACC
0.9307
MCC
0.8635
AUCROC
0.9783
F1
0.9282
SEN
0.8957
SPC
0.9657
PRC
0.9631
CV(mean)
0.9307
CV(std)
0.0128
Scaler
Standard
Selection
RF
Model
SVC
Dims
1601
ACC
0.9279
MCC
0.8584
AUCROC
0.9816
F1
0.9249
SEN
0.8886
SPC
0.9671
PRC
0.9643
CV(mean)
0.9279
CV(std)
0.0203
Scaler
Power
Selection
RF
Model
SVC
Dims
1601
ACC
0.9271
MCC
0.857
AUCROC
0.9786
F1
0.9241
SEN
0.8871
SPC
0.9671
PRC
0.9643
CV(mean)
0.9271
CV(std)
0.0188
Table 1: The 10-fold CV performance of the best 5 ACP classifiers developed in this work
Figure 2: Performance comparison with 11 online ACP prediction methods on the independent test set.

xDeep-AcPEP: Method to Predict the Biological Activity of ACPs against Cancers

xDeep-AcPEP is a novel regression method based on convolutional neural network and multi-task learning to predict the bioactivity of anticancer peptides. A set of cancer-specific models were trained using the CancerPPD data sets to predict for six tumor cells: breast, colon, cervix, lung, skin, and prostate.

As shown in the workflow figure (Figure 1), we chose the following 4 descriptors to describe a sequence into numerical form: AAINDEX (AAI), BLOSUM62 (BLO), Z-scale descriptor (ZSC) and Binary profile (BIN). The encoder contains two 1D-convolutional layers with ReLU, two average pooling layers, two batch normalization layers and one max pooling layer. The regressor contains three fully connected layers with one final output neuron. We define the applicability domain (AD) of each model to allow estimation of the uncertainty in the prediction for an unknown instance. The Euclidean distance between an instance and the centroid of the training data in the feature space is measured. If the instance is within a pre-defined cutoff (Z), then prediction can be made with confidence.

Figure 1. The development workflow of xDeep-AcPEP.

Model Performance

Using repeated five-fold cross validation, we assessed the performance of our models in a range of AD cutoffs (Z=0.5 to 2.0), i.e. four domains with incremental coverage areas were defined. The results in Figure 2 show:

  1. For all tissue types, there is a trend that the performance of the model improves as the scope of the AD shrinks (decreasing Z).

  2. With AD shrinks, a large amount of data is dropout and may lead to an unstable change in the resulting model (increasing standard deviation).

Switching from Z= 1.0 to Z= 0.5, a large amount of data is dropout that led to a substantial change in the resulting model. We want to find a balance between data coverage and model performance, i.e. we want to include as much data as possible while trying to reduce noisy data or outliners that are affecting the performance. Because of the unstable performance of the AD models using Z= 0.5, we eventually selected 1.0 as the default Z value.

Overall, the optimal models with AD=1.0 achieve an average MSE of 0.24 (-log M) and PCC of 0.74.

Figure 2. Multi-task models with different AD cutoffs.