acppred

Screening anticancer peptide from amino acid sequence data


Keywords
anticancer, peptide, sequence, data, deep, learning
Install
pip install acppred==0.0.2

Documentation

Contrastive learning for enhancing feature extraction in anticancer peptides

A deep learning model designed to screen anticancer peptides (ACPs) using peptide sequences only. A contrastive learning technique was applied to enhance model performance, yielding better results than a model trained solely on binary classification loss. Furthermore, two independent encoders were employed as a replacement for data augmentation, a technique commonly used in contrastive learning.

Dependencies

  • pytorch>=2.0.1
  • numpy>=1.25.2
  • biopython

Datasets

Datasets for model training were obtained from ACPred-LAF. Six benchmark datasets were used for model training:

  • ACP-Mixed-80
  • ACP2.0 main
  • ACP2.0 alternative
  • ACP500+ACP164
  • ACP500+ACP2710
  • LEE+Independent

For more detailed information, refer to this research article.

Model training

Use the following command to start model training:

python train.py --model_info {model_info} --batch_size {batch_size} --dropout_rate {dropout_rate}
                --lr {learning_rate} --epoch {maximum_training_epochs} --dataset {bechmark_dataset}
                --alpha {alpha} --beta {beta} --temp {temperature} --gpu {gpu_number}
  • model_info: Choose an encoder architecture from the ./model/model_params directory for model training. For example, --model_info ./model/model_params/cnn1.json.

  • batch_size: Batch size used during model training

  • dropout_rate: Dropout rate applied during model training

  • learning_rate: Learning rate set for model training.

  • maximum_training_epochs: Maximum number of training epochs.

  • benchmark_dataset: Select one dataset from the six available benchmark datasets for model training.

    • Options
      • ACP_Mixed-80: ACP-Mixed-80 dataset
      • ACP2_main: ACP2.0 main dataset
      • ACP2_alter: ACP2.0 alternative dataset
      • ACP500_ACP164: ACP500+ACP164 dataset
      • ACP500_ACP2710: ACP500+ACP2710 dataset
      • LEE_Indep: LEE+Independent dataset
  • alpha: Adjusts the balance between cross-entropy and contrastive loss components. Range: 0.0 to 1.0.

  • beta: Balances the two types of cross-entropy losses (cross-entropy loss 1 and 2).

    • Options
      • 0: Only cross-entroly loss 1 is used for model training.
      • 0.5: Both cross-entropy loss 1 and 2 are used for model training.
      • 1: Only cross-entroly loss 2 is used for model training.
  • temperature: Temperature parameter in contrastive loss calculation.

  • gpu: GPU number to be used for model training, as identified by the nvidia-smi`` command. Use -1`` for CPU training.

Inference (Predicting ACPs)

To predict Anticancer Peptides (ACPs) using only peptide sequences, prepare your peptide sequence list in the FASTA format. For more detailed information about the FASTA format, refer to this link.

Use the following command to run the inference:

python inf.py --batch_size {batch_size} --model_type {model_type}
              --device {device} --output {output_file}
  • batch_size: The batch size used during inference
  • model_type: Specifies the type of optimized model. There are six optimized models available for predicting ACPs, each trained on one of six benchmark datasets. The default recommended option is ACP-Mixed-80.
    • Options
      • ACP_Mixed_80: The optimized model that was trained using the ACP-Mixed-80 benchmark dataset.
      • ACP2_main: The optimized model that was trained using the ACP2.0 main benchmark dataset.
      • ACP2_alter: The optimized model that was trained using the ACP2.0 alternative benchmark dataset.
      • ACP500_ACP164: The optimized model that was trained using the ACP500+ACP164 benchmark dataset.
      • ACP500_ACP2710: The optimized model that was trained using the ACP500+ACP2710 benchmark dataset.
      • LEE_Indep: The optimized model that was trained using the LEE+Independent benchmark dataset.
  • device: The device used for predicting ACPs
    • Options
      • cpu
      • gpu
  • output_file: The file where prediction results will be saved.

Note: Due to variability in the maximum peptide sequence length across each benchmark dataset, there are restrictions on the maximum input peptide sequence length for each model type.

Model Type Maximum Number of Amino Acid Residues
ACP2_main 50
ACP2_alter 50
LEE_Indep 95
ACP500_ACP164 206
ACP500_ACP2710 206
ACP_Mixed_80 207