Citation
J. Chem. Inf. Model. 2021, 61, 8, 3789-3803.
Submitted, 2022.
What are Anticancer Peptides (ACP)?
Cancer is one of the leading causes of death worldwide. Conventional cancer treatment relies on radiotherapy and chemotherapy, but both methods bring severe side effects to patients, as these therapies not only attack cancer cells but also damage normal cells. Anticancer peptides (ACPs) are a promising alternative as therapeutic agents that are efficient and selective against tumor cells. Several modes of mechanism of ACPs are known: They attack cancers by disrupting their cell membranes. They penetrate into the mitochondria, causing release of cytochrome C and apoptosis. They may target certain membrane receptors, modulating signal transduction and cell cycle.
AcPEP: Method to Classify ACPs and non-ACPs
The development of the ACP classifier is presented in Figure. It includes feature extraction (step 1) – each sample sequence was converted into numerical feature vectors of 32 feature groups using iFeature. Initial feature group selection (step 2) – The feature groups were evaluated for target relevance by comparing the performance of their Random Forest models. Model selection (step 3) – The features of the best performing feature groups were concatenated and subjected to extensive evaluation using a combination of feature preprocessing, feature selection and learning algorithms to build prediction models which 8+1(raw method) feature preprocessing methods, 6+1(raw method) feature selection methods and 5 learning algorithm as shown in Figure. So totally 315 different model training procedures were compared by 10-fold cross validation to identify the optimal one.
Model Performance
In 10-fold cross-validation (CV) with different combinations of feature selection, feature scaling, and machine learning algorithms, the top 5 models were all based on SVC. As shown in Table 1, the accuracy of these models from CV is within one standard deviation of the top 1 mode. Specifically, Standard:RFE-SVC:SVC outperformed the others while Standard:ANOVA:SVC achieved similar performance with greatly reduced dimensionality. Using an independent test set, our models were compared with 11 previously published ACP classification methods and showed 1-2% improvement in accuracy (Figure 2). Considering better generalization, we chose Standard:ANOVA:SVC as our final model and named it AcPEP.
Scaler Standard | Selection RFE-SVC | Model SVC | Dims 1601 | ACC 0.9471 | MCC 0.8983 | AUCROC 0.9934 | F1 0.9445 | SEN 0.9 | SPC 0.9943 | PRC 0.9937 | CV(mean) 0.9471 | CV(std) 0.0205 |
Scaler Power | Selection RFE-SVC | Model SVC | Dims 1601 | ACC 0.9407 | MCC 0.885 | AUCROC 0.9924 | F1 0.9379 | SEN 0.8957 | SPC 0.9857 | PRC 0.9843 | CV(mean) 0.9407 | CV(std) 0.0181 |
Scaler Standard | Selection ANOVA | Model SVC | Dims 742 | ACC 0.9307 | MCC 0.8635 | AUCROC 0.9783 | F1 0.9282 | SEN 0.8957 | SPC 0.9657 | PRC 0.9631 | CV(mean) 0.9307 | CV(std) 0.0128 |
Scaler Standard | Selection RF | Model SVC | Dims 1601 | ACC 0.9279 | MCC 0.8584 | AUCROC 0.9816 | F1 0.9249 | SEN 0.8886 | SPC 0.9671 | PRC 0.9643 | CV(mean) 0.9279 | CV(std) 0.0203 |
Scaler Power | Selection RF | Model SVC | Dims 1601 | ACC 0.9271 | MCC 0.857 | AUCROC 0.9786 | F1 0.9241 | SEN 0.8871 | SPC 0.9671 | PRC 0.9643 | CV(mean) 0.9271 | CV(std) 0.0188 |
xDeep-AcPEP: Method to Predict the Biological Activity of ACPs against Cancers
xDeep-AcPEP is a novel regression method based on convolutional neural network and multi-task learning to predict the bioactivity of anticancer peptides. A set of cancer-specific models were trained using the CancerPPD data sets to predict for six tumor cells: breast, colon, cervix, lung, skin, and prostate.
As shown in the workflow figure (Figure 1), we chose the following 4 descriptors to describe a sequence into numerical form: AAINDEX (AAI), BLOSUM62 (BLO), Z-scale descriptor (ZSC) and Binary profile (BIN). The encoder contains two 1D-convolutional layers with ReLU, two average pooling layers, two batch normalization layers and one max pooling layer. The regressor contains three fully connected layers with one final output neuron. We define the applicability domain (AD) of each model to allow estimation of the uncertainty in the prediction for an unknown instance. The Euclidean distance between an instance and the centroid of the training data in the feature space is measured. If the instance is within a pre-defined cutoff (Z), then prediction can be made with confidence.
Model Performance
Using repeated five-fold cross validation, we assessed the performance of our models in a range of AD cutoffs (Z=0.5 to 2.0), i.e. four domains with incremental coverage areas were defined. The results in Figure 2 show:
For all tissue types, there is a trend that the performance of the model improves as the scope of the AD shrinks (decreasing Z).
With AD shrinks, a large amount of data is dropout and may lead to an unstable change in the resulting model (increasing standard deviation).
Switching from Z= 1.0 to Z= 0.5, a large amount of data is dropout that led to a substantial change in the resulting model. We want to find a balance between data coverage and model performance, i.e. we want to include as much data as possible while trying to reduce noisy data or outliners that are affecting the performance. Because of the unstable performance of the AD models using Z= 0.5, we eventually selected 1.0 as the default Z value.
Overall, the optimal models with AD=1.0 achieve an average MSE of 0.24 (-log M) and PCC of 0.74.