DeepPurpose documentation!¶
Welcome! This is the documentation for DeepPurpose. DeepPurpose is a Deep Learning Based Drug Repurposing and Virtual Screening Toolkit (using PyTorch). It allows very easy usage (only one line of code!) for non-computational domain researchers to be able to obtain a list of potential drugs using deep learning while facilitating deep learning method research in this topic by providing a flexible framework (less than 10 lines of codes!) and baselines. The Github repository is located here.
Features of DeepPurpose¶
DeepPurpose is a Deep Learning Based Drug Repurposing and Virtual Screening Toolkit (using PyTorch). It allows very easy usage (only one line of code!) for non-computational domain researchers to be able to obtain a list of potential drugs using deep learning while facilitating deep learning method research in this topic by providing a flexible framework (less than 10 lines of codes!) and baselines. The Github repository is located here.
Features¶
- For non-computational researchers, ONE line of code from raw data to output drug repurposing/virtual screening result, aiming to allow wet-lab biochemists to leverage the power of deep learning. The result is ensembled from five pretrained deep learning models!
- For computational researchers, 15+ powerful encodings for drugs and proteins, ranging from deep neural network on classic cheminformatics fingerprints, CNN, transformers to message passing graph neural network, with 50+ combined models! Most of the combinations of the encodings are not yet in existing works. All of these under 10 lines but with lots of flexibility! Switching encoding is as simple as changing the encoding names!
- Realistic and user-friendly design:
- automatic identification to do drug target binding affinity (regression) or drug target interaction prediction (binary) task.
- support cold target, cold drug settings for robust model evaluations and support single-target high throughput sequencing assay data setup.
- many dataset loading/downloading/unzipping scripts to ease the tedious preprocessing, including antiviral, COVID19 targets, BindingDB, DAVIS, KIBA, …
- many pretrained checkpoints.
- easy monitoring of training process with detailed training metrics output such as test set figures (AUCs) and tables, also support early stopping.
- detailed output records such as rank list for repurposing result.
- various evaluation metrics: ROC-AUC, PR-AUC, F1 for binary task, MSE, R-squared, Concordance Index for regression task.
- label unit conversion for skewed label distribution such as Kd.
- time reference for computational expensive encoding.
- PyTorch based, support CPU, GPU, Multi-GPUs.
What is drug repurposing, virtual screening and drug-target interaction prediction?¶
Drug Repurposing¶
Drug repurposing aims to repivot an existing drug to a new therapy.
Virtual Screening¶
- Virtual screening means to use computer software to automatically screen
- a huge space of potential drug-target pairs to obtain a predicted binding score.
Drug-Target Interaction¶
Both of these tasks are able to save cost, time, and facilitate drug discovery. Deep learning has shown strong performance in repurposing and screening. It relies on the accurate and fast prediction of a fundamental task: drug-target interaction prediction. DTI prediction task aims to predict the input drug target pair’s interaction probability or binding score. Given a powerful DTI model that is able to generalize over a new unseen dataset, we can then extend to repurposing/screening. For repurposing, given a new target of interest, we can first pair it to a repurposing drug library. Then this list of input drug-target pairs is fed into the trained DTI model, which will output the predicted binding score. Similarly, for virtual screening, given a list of screening drug-target pairs we want, the DTI model can output the predicted interaction binding scores. We can then rank the predicted outcome based on their binding scores and test the top-k options in the wet lab after manual inspection. DeepPurpose automates this process. By only requiring one line of code, it aggregates five pretrained deep learning models and retrieves a list of ranked potential outcomes.
Identifying Drug-Target Interactions (DTI) will greatly narrow down the scope of search of candidate medications, and thus can plays a pivotal role in drug discovery. Drugs usually interact with one or more proteins to achieve their functions. However, discovering novel interactions between drugs and target proteins is crucial for the development of new drugs, since the aberrant expression of proteins may cause side effects of drugs.
Considering that in vitro experiments are extremely costly and time-consuming, high efficiency computational prediction methods could serve as promising strategies for drug-target interaction (DTI) prediction. In this project, our goal is to focus on deep learning approaches for drug-target interaction (DTI) prediction.
Download Code & Install¶
Download Code¶
$ git clone https://github.com/kexinhuang12345/DeepPurpose.git
$ ### Download code repository
$
$
$ cd DeepPurpose
$ ### Change directory to DeepPurpose
First time usage: setup conda environment¶
$ conda env create -f environment.yml
$ ## Build virtual environment with all packages installed using conda
$
$ conda activate DeepPurpose
$ ## Activate conda environment
$
$
$ conda deactivate ### exit
Second time and later¶
$ conda activate DeepPurpose
$ ## Activate conda environment
$
$
$ conda deactivate ### exit
Case Study¶
- 1a. Antiviral Drugs Repurposing for SARS-CoV2 3CLPro, using One Line.
Given a new target sequence (e.g. SARS-CoV2 3CL Protease), retrieve a list of repurposing drugs from a curated drug library of 81 antiviral drugs. The Binding Score is the Kd values. Results aggregated from five pretrained model on BindingDB dataset!
from DeepPurpose import oneliner
oneliner.repurpose(*load_SARS_CoV2_Protease_3CL(), *load_antiviral_drugs())
- 1b. New Target Repurposing using Broad Drug Repurposing Hub, with One Line.
Given a new target sequence (e.g. MMP9), retrieve a list of repurposing drugs from Broad Drug Repurposing Hub, which is the default. Results also aggregated from five pretrained model! Note the drug name here is the Pubchem CID since some drug names in Broad is too long.
from DeepPurpose import oneliner
oneliner.repurpose(*load_MMP9())
- 2. Repurposing using Customized training data, with One Line.
Given a new target sequence (e.g. SARS-CoV 3CL Pro), training on new data (AID1706 Bioassay), and then retrieve a list of repurposing drugs from a proprietary library (e.g. antiviral drugs). The model can be trained from scratch or finetuned from the pretraining checkpoint!
from DeepPurpose import oneliner
from DeepPurpose.dataset import *
oneliner.repurpose(*load_SARS_CoV_Protease_3CL(), *load_antiviral_drugs(no_cid = True), *load_AID1706_SARS_CoV_3CL(), \
split='HTS', convert_y = False, frac=[0.8,0.1,0.1], pretrained = False, agg = 'max_effect')
- A Framework for Drug Target Interaction Prediction, with less than 10 lines of codes.
Under the hood of one model from scratch, a flexible framework for method researchers:
from DeepPurpose import models
from DeepPurpose.utils import *
from DeepPurpose.dataset import *
# Load Data, an array of SMILES for drug,
# an array of Amino Acid Sequence for Target
# and an array of binding values/0-1 label.
# e.g. ['Cc1ccc(CNS(=O)(=O)c2ccc(s2)S(N)(=O)=O)cc1', ...],
# ['MSHHWGYGKHNGPEHWHKDFPIAKGERQSPVDIDTH...', ...],
# [0.46, 0.49, ...]
# In this example, BindingDB with Kd binding score is used.
X_drug, X_target, y = process_BindingDB(download_BindingDB(SAVE_PATH),
y = 'Kd',
binary = False,
convert_to_log = True)
# Type in the encoding names for drug/protein.
drug_encoding, target_encoding = 'MPNN', 'Transformer'
# Data processing, here we select cold protein split setup.
train, val, test = data_process(X_drug, X_target, y,
drug_encoding, target_encoding,
split_method='cold_protein',
frac=[0.7,0.1,0.2])
# Generate new model using default parameters;
# also allow model tuning via input parameters.
config = generate_config(drug_encoding, target_encoding, \
transformer_n_layer_target = 8)
net = models.model_initialize(**config)
# Train the new model.
# Detailed output including a tidy table storing
# validation loss, metrics, AUC curves figures and etc.
# are stored in the ./result folder.
net.train(train, val, test)
# or simply load pretrained model from a model directory path
# or reproduced model name such as DeepDTA
net = models.model_pretrained(MODEL_PATH_DIR or MODEL_NAME)
# Repurpose using the trained model or pre-trained model
# In this example, loading repurposing dataset using
# Broad Repurposing Hub and SARS-CoV 3CL Protease Target.
X_repurpose, drug_name, drug_cid = load_broad_repurposing_hub(SAVE_PATH)
target, target_name = load_SARS_CoV_Protease_3CL()
_ = models.repurpose(X_repurpose, target, net, drug_name, target_name)
# Virtual screening using the trained model or pre-trained model
X_repurpose, drug_name, target, target_name = \
['CCCCCCCOc1cccc(c1)C([O-])=O', ...], ['16007391', ...], \
['MLARRKPVLPALTINPTIAEGPSPTSEGASEANLVDLQKKLEEL...', ...],\
['P36896', 'P00374']
_ = models.virtual_screening(X_repurpose, target, net, drug_name, target_name)
- Virtual Screening with Customized Training Data with One Line
Given a list of new drug-target pairs to be screened, retrieve a list of drug-target pairs with top predicted binding scores.
from DeepPurpose import oneliner
oneliner.virtual_screening(['MKK...LIDL', ...], ['CC1=C...C4)N', ...])
DeepPurpose.models¶
Classifier¶
class DeepPurpose.models.Classifier(nn.Sequential)
Classifier (Source) is to make the prediction for DBTA, it serve as a basic component of class DBTA.
constructor create an instance of Classifier.
__init__(self, model_drug, model_protein, **config)
- model_drug (DeepPurpose.models.XX) - Encoder model for drug. XX can be “transformer”, “MPNN”, “CNN”, “CNN_RNN” …,
- model_protein (DeepPurpose.models.XX) - Encoder model for protein. XX can be “transformer”, “CNN”, “CNN_RNN” …,
- config (kwargs, keyword arguments) - specify the parameter of classifier.
Calling functions implement the feedforward procedure of Classifier.
forward(self, v_D, v_P)
- v_D (many types) - input feature for drug encoder model, like “DeepPurpose.models.transformer”, “DeepPurpose.models.CNN”, “DeepPurpose.models.CNN_RNN”, “DeepPurpose.models.MPNN”.
- v_P (many types) - input feature for protein encoder model, like “DeepPurpose.models.transformer”, “DeepPurpose.models.CNN”, “DeepPurpose.models.CNN_RNN”.
Drug Target Binding Affinity (DTBA) Model¶
class DeepPurpose.models.DBTA
Drug Target Binding Affinity (DBTA) (Source) include all component, including drug encoder, target encoder and classifier/regressor.
constructor create an instance of DBTA.
__init__(self, **config)
- config (kwargs, keyword arguments) - specify the parameter of DBTA.
- drug_encoding (str) - Encoder mode for drug. It can be “transformer”, “MPNN”, “CNN”, “CNN_RNN” …,
- target_encoding (str) - Encoder mode for protein. It can be “transformer”, “CNN”, “CNN_RNN” …,
- result_folder (str) - directory that store the learning log/results.
- concrete parameter for encoder model (repeated)
test_ include all the test procedure.
test_(self, data_generator, model, repurposing_mode = False, test = False):
- data_generator (iterator) - iterator of torch.utils.data.DataLoader. It can be test data or validation data.
- model (DeepPurpose.models.Classifier) - model of DBTA.
- repurposing_mode (bool) - If repurposing_mode is True, then do repurposing. Otherwise, do compute the accuracy (including AUC score).
- test (bool) - If test is True, plot ROC-AUC and PR-AUC curve. Otherwise, pass.
train include all the training procedure.
train(self, train, val, test = None, verbose = True)
- train (torch.utils.data.dataloader) - Train data loader
- val (torch.utils.data.dataloader) - Valid data loader
- test (torch.utils.data.dataloader) - Test data loader
- verbose (bool) - If verbose is True, then print training record every 100 iterations.
predict include all the inference procedure.
predict(self, df_data)
- df_data (pd.DataFrame) - specify data that we need to predict.
save_model save the well-trained model to specific directory.
save_model(self, path_dir)
- path_dir (str, a directory) - the path where model is saved.
load_pretrained load the well-trained model so that we are able to make inference directly and don’t have to train model from scratch.
load_pretrained(self, path)
- path (str, a directory) - the path where model is loaded.
Transformer¶
DeepPurpose.models.transformer(nn.Sequential)
Transformer (Source) can be used to encode both drug and protein on SMILES.
constructor create an instance of Transformer.
__init__(self, encoding, **config)
- encoding (string, “drug” or “protein”) - specify input type of the model, “drug” or “protein”.
- config (kwargs, keyword arguments) - specify the parameter of transformer. The keys include
- transformer_dropout_rate (float) - dropout rate of transformer.
- input_dim_drug (int) - input dimension when encoding drug.
- transformer_emb_size_drug (int) - dimension of embedding in input layer when encoding drug.
- transformer_n_layer_drug (int) - number of layers in transformer when encoding drug.
- todo
Calling functions implement the feedforward procedure of MPNN.
forward(self, v)
- v (tuple of length 2) - input feature of transformer. v[0] (np.array) is index of atoms. v[1] (np.array) is the corresponding mask.
Message Passing Neural Network (MPNN)¶
class DeepPurpose.models.MPNN(nn.Sequential)
Message Passing Neural Network (MPNN) (Source) encode drug in its graph representation.
constructor create an instance of MPNN class.
__init__(self, mpnn_hidden_size, mpnn_depth)
- mpnn_hidden_size (int) - specify dimension of hidden layer in MPNN, e.g, mpnn_hidden_size = 256.
- mpnn_depth (int) - specify depth of MPNN, e.g., mpnn_depth = 3.
Calling functions implement the feedforward procedure of MPNN.
forward(self, feature)
- feature (tuple of length 5)
- feature[0] (torch.Tensor) - atom-level feature
- feature[1] (torch.Tensor) - bond-level feature
- feature[2] (torch.Tensor) - neighbor information of every atom
- feature[3] (torch.Tensor) - neighbor information of every bond
- feature[4] (torch.Tensor) - store number of atoms and bonds for each molecule in a batch
CNN+RNN¶
class DeepPurpose.models.CNN_RNN(nn.Sequential)
CNN_RNN (Source) means a GRU/LSTM on top of a CNN on SMILES.
constructor create an instance of CNN_RNN
__init__(self, encoding, **config)
- encoding (string, “drug” or “protein”) - specify input type, “drug” or “protein”.
- config (kwargs, keyword arguments) - specify the parameter of transformer. The keys include
- cnn_drug_filters (list, each element is int) - specify the size of filter when encoding drug, e.g., cnn_drug_filters = [32,64,96].
- cnn_drug_kernels (list, each element is int) - specify the size of kernel when encoding drug, e.g., cnn_drug_kernels = [4,6,8].
- rnn_drug_hid_dim (int) - specify the hidden dimension of RNN when encoding drug, e.g., rnn_drug_hid_dim = 64.
- rnn_drug_n_layers (int) - specify number of layer in RNN when encoding drug, .e.g, rnn_drug_n_layers = 2.
- rnn_drug_bidirectional (bool) - specify if RNN is bidirectional when encoding drug, .e.g, rnn_drug_bidirectional = True.
- hidden_dim_drug (int) - specify the hidden dimension when encoding drug, e.g., hidden_dim_drug = 256.
- cnn_target_filters (list, each element is int) - specify the size of filter when encoding protein, e.g, cnn_target_filters = [32,64,96].
- cnn_target_kernels (list, each element is int) - specify the size of kernel when encoding protein, e.g, cnn_target_kernels = [4,8,12].
- hidden_dim_protein (int) - specify the hidden dimension when encoding protein, e.g., hidden_dim_protein = 256.
- rnn_target_hid_dim (int) - specify hidden dimension of RNN when encoding protein, e.g., rnn_target_hid_dim = 64.
- rnn_target_n_layers (int) - specify the number of layer in RNN when encoding protein, e.g., rnn_target_n_layers = 2.
- rnn_target_bidirectional (bool) - specify if RNN is bidirectional when encoding protein, e.g., rnn_target_bidirectional = True
Calling functions implement the feedforward procedure of CNN_RNN.
forward(self, v)
- v (torch.Tensor) - input feature of CNN_RNN.
CNN¶
class DeepPurpose.models.CNN(nn.Sequential)
CNN (Convolutional Neural Network) (Source) can be used to encode both drug and protein on SMILES.
constructor create an instance of CNN.
__init__(self, encoding, **config)
- encoding (string, “drug” or “protein”) - specify input type of model, “drug” or “protein”.
- config (kwargs, keyword arguments) - specify the parameter of CNN. The keys include
- cnn_drug_filters (list, each element is int) - specify the size of filter when encoding drug, e.g., cnn_drug_filters = [32,64,96].
- cnn_drug_kernels (list, each element is int) - specify the size of kernel when encoding drug, e.g., cnn_drug_kernels = [4,6,8].
- hidden_dim_drug (int) - specify the hidden dimension when encoding drug, e.g., hidden_dim_drug = 256.
- cnn_target_filters (list, each element is int) - specify the size of filter when encoding protein, e.g, cnn_target_filters = [32,64,96].
- cnn_target_kernels (list, each element is int) - specify the size of kernel when encoding protein, e.g, cnn_target_kernels = [4,8,12].
- hidden_dim_protein (int) - specify the hidden dimension when encoding protein, e.g., hidden_dim_protein = 256.
Calling functions implement the feedforward procedure of CNN.
forward(self, v)
- v (torch.Tensor) - input feature of CNN.
MLP¶
class DeepPurpose.models.MLP(nn.Sequential)
Multi-Layer Perceptron (MLP) (Source) is a class of feedforward artificial neural network. An MLP consists of at least three layers of nodes: an input layer, a hidden layer and an output layer.
constructor create an instance of MLP
__init__(self, input_dim, hidden_dim, hidden_dims)
- input_dim (int) - dimension of input feature.
- hidden_dim (int) - dimension of hidden layer.
Calling functions implement the feedforward procedure of MLP.
forward(self, v)
- v (torch.Tensor) - input feature of MLP.
DeepPurpose.dataset¶
We have downloaded most of the small dataset in the repository.
read_file_training_dataset_bioassay¶
read_file_training_dataset_bioassay load bioarray dataset, with one target sequence and multiple drugs and their interaction score with the target.
dataset.read_file_training_dataset_bioassay(path)
- path (str, a directory) - the path of bioassay dataset file. We have requirement on format of file. First line is target sequence. From 2nd line to n-th line, each line a SMILES and interaction score with target sequence. Example: ./toy_data/AID1706.txt
read_file_training_dataset_drug_target_pairs¶
read_file_training_dataset_drug_target_pairs load drug target pairs dataset. We have requirement on format of file. Each line contains a drug SMILES and target sequence and their interaction score. Example: ./toy_data/dti.txt
dataset.read_file_training_dataset_drug_target_pairs(path)
- path (str, a directory) - the path of drug target pairs dataset file. We have requirement on format of file. First line is target sequence. From 2nd line to n-th line, each line a SMILES and interaction score with target sequence. Example: ./toy_data/AID1706.txt
read_file_virtual_screening_drug_target_pairs¶
read_file_virtual_screening_drug_target_pairs load virtual screening drug target pairs dataset. We have requirement on format of file. Each line contains a drug SMILES and target sequence. Example: ./toy_data/dti.txt
dataset.read_file_virtual_screening_drug_target_pairs(path)
- path (str, a directory) - the path of virtual screening drug target pairs dataset file.
load bioarray dataset (read_file_training_dataset_bioassay)¶
read_file_repurposing_library load drug repurposing dataset. We have requirement on format of file. Each line contains a drug SMILES and its name. Example: ./toy_data/??
dataset.read_file_repurposing_library(path)
- path (str, a directory) - the path of drug repurposing dataset file.
read_file_target_sequence¶
read_file_target_sequence load drug repurposing dataset. We have requirement on format of file. The file only have one line. The line contains target name and target sequence. Example: ./toy_data/??
dataset.read_file_target_sequence(path)
- path (str, a directory) - the path of target sequence dataset file.
download_DrugTargetCommons¶
download_DrugTargetCommons load DrugTargetCommons dataset, save it to a specific path. If the path doesn’t exist, create the folder.
dataset.download_DrugTargetCommons(path)
- path (str, a directory) - the path that save DrugTargetCommons dataset file. Example: “./data”.
process_BindingDB¶
process_BindingDB processes BindingDB dataset.
dataset.process_BindingDB(path = None, df = None, y = 'Kd', binary = False, convert_to_log = True, threshold = 30)
- path (str, a directory) - the path that save BindingDB dataset file. Example: “./data/BindingDB_All.tsv”.
- df (pandas.DataFrame) - Dataframe that contains input data, if first parameter “path” is None, use the “df”.
- y (str; can be “Kd”, “Ki”, “IC50” or “EC50”) - specify the binding score.
- binary (bool) - If binary is True, formulate prediction task as a binary classification task. Otherwise, formulate the prediction task as a regression task.
- convert_to_log (bool) - If True, convert the target score to logspace for easier regression’
- threshold (float) - The threshold that select target score ??
load_process_DAVIS¶
load_process_DAVIS load DAVIS dataset.
dataset.load_process_DAVIS(path = './data', binary = False, convert_to_log = True, threshold = 30)
- path (str, a directory) - the path that save DAVIS dataset file. Example: “./data”.
- binary (bool) - If binary is True, formulate prediction task as a binary classification task. Otherwise, formulate the prediction task as a regression task.
- convert_to_log (bool) - If True, convert the target score to logspace for easier regression’
- threshold (float) - The threshold that select target score ??
load_process_KIBA¶
load_process_KIBA load KIBA dataset.
load_process_KIBA(path = './data', binary = False, threshold = 9):
- path (str, a directory) - the path that save KIBA dataset file. Example: “./data”.
- binary (bool) - If binary is True, formulate prediction task as a binary classification task. Otherwise, formulate the prediction task as a regression task.
- threshold (float) - The threshold that select target score ??
load_AID1706_txt_file¶
load_AID1706_txt_file load KIBA dataset.
load_AID1706_txt_file(path = './data')
- path (str, a directory) - the path that save AID1706 dataset file. Example: “./data”.
load_AID1706_SARS_CoV_3CL¶
load_AID1706_SARS_CoV_3CL load AID1706_SARS_CoV_3CL dataset.
load_AID1706_SARS_CoV_3CL(path = './data', binary = True, threshold = 15, balanced = True, oversample_num = 30, seed = 1)
- path (str, a directory) - the path that save AID1706_SARS_CoV_3CL dataset file. Example: “./data”.
- binary (bool) - If binary is True, formulate prediction task as a binary classification task. Otherwise, formulate the prediction task as a regression task.
- threshold (float) - The threshold that select target score ??
- balanced (bool) - If True, do oversampling to make number of positive and negative samples equal.
- oversample_num (int) - control the oversample rate.
- seed (int) - random seed in oversample.
load_antiviral_drugs¶
load_antiviral_drugs load antiviral drugs dataset.
load_antiviral_drugs(path = './data', no_cid = False)
- path (str, a directory) - the path that save antiviral drugs dataset file. Example: “./data”.
- no_cid (bool) - If False, including “Pubchem CID”.
load_broad_repurposing_hub¶
load_broad_repurposing_hub load repurposing dataset.
load_broad_repurposing_hub(path = './data'):
- path (str, a directory) - the path that save repurposing dataset file. Example: “./data”.
DeepPurpose.chemutils¶
DeepPurpose.chemutils.onek_encoding_unk¶
Given an atom and an allowable atom set, allowable atom set contains a special symbol for unknown atom. The target of onek_encoding_unk function is to transform the atom into one-hot vector. If the atom doesn’t exist in the allowable atom set, the use label it as unknown atom.
def onek_encoding_unk(x, allowable_set):
if x not in allowable_set:
x = allowable_set[-1]
return list(map(lambda s: x == s, allowable_set))
DeepPurpose.chemutils.atom_features¶
Given an atom in molecular graph, return its feature based on the atom itself, its degree and other information.
def atom_features(atom):
return torch.Tensor(onek_encoding_unk(atom.GetSymbol(), ELEM_LIST)
+ onek_encoding_unk(atom.GetDegree(), [0,1,2,3,4,5])
+ onek_encoding_unk(atom.GetFormalCharge(), [-1,-2,1,2,0])
+ onek_encoding_unk(int(atom.GetChiralTag()), [0,1,2,3])
+ [atom.GetIsAromatic()])
DeepPurpose.chemutils.bond_features¶
Given a bond in molecular graph, return its feature based on the bond itself, its connection information.
def bond_features(bond):
bt = bond.GetBondType()
stereo = int(bond.GetStereo())
fbond = [bt == Chem.rdchem.BondType.SINGLE, bt == Chem.rdchem.BondType.DOUBLE, bt == Chem.rdchem.BondType.TRIPLE, bt == Chem.rdchem.BondType.AROMATIC, bond.IsInRing()]
fstereo = onek_encoding_unk(stereo, [0,1,2,3,4,5])
return torch.Tensor(fbond + fstereo)
DeepPurpose.oneliner¶
DeepPurpose.oneliner.repurpose¶
text:todo
def repurpose(target, target_name = None,
X_repurpose = None,
drug_names = None,
train_drug = None,
train_target = None,
train_y = None,
save_dir = './save_folder',
pretrained_dir = None,
finetune_epochs = 10,
finetune_LR = 0.001,
finetune_batch_size = 32,
convert_y = True,
subsample_frac = 1,
pretrained = True,
split = 'random',
frac = [0.7,0.1,0.2],
agg = 'agg_mean_max',
output_len = 30):
DeepPurpose.oneliner.virtual_screening¶
text:todo
def virtual_screening(
target,
X_repurpose = None,
target_name = None,
drug_names = None,
train_drug = None,
train_target = None,
train_y = None,
save_dir = './save_folder',
pretrained_dir = None,
finetune_epochs = 10,
finetune_LR = 0.01,
finetune_batch_size = 32,
convert_y = True,
subsample_frac = 1,
pretrained = True,
split = 'random',
frac = [0.7,0.1,0.2],
agg = 'agg_mean_max',
output_len = 30):
DeepPurpose.model_helper¶
todo
DeepPurpose.utils¶
Drug Target Binding Affinity (DTBA) Model¶
Drug/Target Encoder¶
Drug encoding¶
Drug Encodings | Description |
---|---|
Morgan | Extended-Connectivity Fingerprints |
Pubchem | Pubchem Substructure-based Fingerprints |
Daylight | Daylight-type fingerprints |
rdkit_2d_normalized | Normalized Descriptastorus |
CNN | Convolutional Neural Network on SMILES |
CNN_RNN | A GRU/LSTM on top of a CNN on SMILES |
Transformer | Transformer Encoder on ESPF |
MPNN | Message-passing neural network |
Target encoding¶
Target Encodings | Description |
---|---|
AAC | Amino acid composition up to 3-mers |
PseudoAAC | Pseudo amino acid composition |
Conjoint_triad | Conjoint triad features |
Quasi-seq | Quasi-sequence order descriptor |
CNN | Convolutional Neural Network on target seq |
CNN_RNN | A GRU/LSTM on top of a CNN on target seq |
Transformer | Transformer Encoder on ESPF |
Encoder Model¶
Encoder Model | Description |
---|---|
CNN | Convolutional Neural Network on SMILES |
CNN_RNN | A GRU/LSTM on top of a CNN on SMILES |
Transformer | Transformer Encoder on SMILES |
MPNN | Message Passing Neural Network on Molecular Graph |
MLP | MultiLayer Perceptron on fix-dim feature vector |
Technical Details¶
First, we describe the common modules we import in DeepPurpose.
import torch
from torch.autograd import Variable
import torch.nn.functional as F
from torch import nn
import numpy as np
import pandas as pd
Processing Data¶
We have downloaded most of the small dataset in the repository.
Drug-Target Binding Benchmark Dataset¶
We list public Drug-Target Binding Benchmark Dataset that is supported by DeepPurpose and corresponding downloading and processing function.
Dataset | downloading and processing Function |
---|---|
BindingDB | download_BindingDB() to download the data and process_BindingDB() to process the data |
DAVIS | load_process_DAVIS() to download and process the data |
KIBA | load_process_KIBA() to download and process the data |
Download Link
Repurposing Dataset¶
We list public Repurposing Dataset that is supported by DeepPurpose and corresponding downloading and processing function.
Dataset | downloading and processing Function |
---|---|
Curated Antiviral Drugs Library | load_antiviral_drugs() to load and process the data |
Broad Repurposing Hub | load_broad_repurposing_hub() downloads and process the data |
Download Link
Bioassay Data for COVID-19¶
Dataset | downloading and processing Function |
---|---|
AID1706 | load_AID1706_SARS_CoV_3CL() to load and process |
Download Link
COVID-19 Targets¶
Dataset | downloading and processing Function |
---|---|
SARS-CoV 3CL Protease | load_SARS_CoV_Protease_3CL() |
SARS-CoV2 3CL Protease | load_SARS_CoV2_Protease_3CL() |
SARS_CoV2 RNA Polymerase | load_SARS_CoV2_RNA_polymerase() |
SARS-CoV2 Helicase | load_SARS_CoV2_Helicase() |
SARS-CoV2 3to5_exonuclease | load_SARS_CoV2_3to5_exonuclease() |
SARS-CoV2 endoRNAse | load_SARS_CoV2_endoRNAse() |
Configuration¶
generate_config generate all the configuration that can be used in learning and inference.
utils.generate_config(
drug_encoding,
target_encoding,
result_folder = "./result/",
input_dim_drug = 1024,
input_dim_protein = 8420,
hidden_dim_drug = 256,
hidden_dim_protein = 256,
cls_hidden_dims = [1024, 1024, 512],
mlp_hidden_dims_drug = [1024, 256, 64],
mlp_hidden_dims_target = [1024, 256, 64],
batch_size = 256,
train_epoch = 10,
test_every_X_epoch = 20,
LR = 1e-4,
transformer_emb_size_drug = 128,
transformer_intermediate_size_drug = 512,
transformer_num_attention_heads_drug = 8,
transformer_n_layer_drug = 8,
transformer_emb_size_target = 128,
transformer_intermediate_size_target = 512,
transformer_num_attention_heads_target = 8,
transformer_n_layer_target = 4,
transformer_dropout_rate = 0.1,
transformer_attention_probs_dropout = 0.1,
transformer_hidden_dropout_rate = 0.1,
mpnn_hidden_size = 50,
mpnn_depth = 3,
cnn_drug_filters = [32,64,96],
cnn_drug_kernels = [4,6,8],
cnn_target_filters = [32,64,96],
cnn_target_kernels = [4,8,12],
rnn_Use_GRU_LSTM_drug = 'GRU',
rnn_drug_hid_dim = 64,
rnn_drug_n_layers = 2,
rnn_drug_bidirectional = True,
rnn_Use_GRU_LSTM_target = 'GRU',
rnn_target_hid_dim = 64,
rnn_target_n_layers = 2,
rnn_target_bidirectional = True
)
- drug_encoding (str) - Encoder mode for drug. It can be “transformer”, “MPNN”, “CNN”, “CNN_RNN” …,
- target_encoding (str) - Encoder mode for protein. It can be “transformer”, “CNN”, “CNN_RNN” …,
- input_dim_drug (int) - Dimension of input drug feature.
- input_dim_protein (int) - Dimension of input protein feature.
- hidden_dim_drug (int) - Dimension of hidden layer of drug feature.
- hidden_dim_protein (int) - Dimension of hidden layer of protein feature.
- batch_size (int) - batch size
- train_epoch (int) - training epoch
- test_every_X_epoch (int) - test every X epochs
- LR (float) - Learning rate.
- cls_hidden_dims (list of int) - hidden dimensions of classifier.
- mlp_hidden_dims_drug (list of int) - hidden dimension of MLP when encoding drug.
- mlp_hidden_dims_target (list of int) - hidden dimension of MLP when encoding protein.
- transformer_emb_size_drug (int) - embedding size of transformer when encoding drug.
- transformer_intermediate_size_drug (int) -
- transformer_num_attention_heads_drug (int) -
- transformer_n_layer_drug (int) -
- transformer_emb_size_target (int) -
- transformer_intermediate_size_target (int) -
- transformer_num_attention_heads_target (int) -
- transformer_n_layer_target (int) -
- transformer_dropout_rate (float) -