vignettes/Example_Workflow.Rmd

---
title: "Example Workflow"
output: rmarkdown::html_vignette
vignette: >
  %\VignetteIndexEntry{Sample_Workflow}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r, include = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)
```
### Introduction
This vignette goes through an example workflow using the `epictope` package. `epictope` facilitates the identification of optimal tagging locations for proteins of interest. Optimal locations are identified by considering four features; sequence conservation, secondary structure, solvent accessibility, and disordered binding regions. These four features are factored into a scoring function that calculates a cumulative score for each position in the protein. The weight of each feature and how they contribute to the final score can be defined by the user. Below, we work through a simple example of this workflow for a protein of interest.

To calculate the multiple sequence alignment and secondary characteristics, `epictope` relies on local installs of BLAST, MUSCLE, and DSSP. The recommended way to run Epictope is through R or RStudio inside a conda environment. 


#### Verify installation
To check if `epictope` has been installed correctly, we can attempt the load the package using "library()". If successful, the command should run without any further messages to the console. Once `epictope` has been loaded, we can use the provided ".FindExecutable()" function to determine if the dependencies have been successfully installed and are identifiable by R. If successful, the output of each .FindExecutable command should list the file location of the program. 

```{r, eval = FALSE}
library(epictope) # check if epictope is installed and loadable
epictope::.findExecutable("muscle") # looks for muscle executable
# ~/hchung/epictope/bin/muscle" 
epictope::.findExecutable("blastp") # looks for blast executable
# ~/hchung/epictope/bin/blastp" 
epictope::.findExecutable("mkdssp") # looks for dssp executable
# ~/hchung/epictope/bin/mkdssp" 
```

### Workflow

Once we have verified Epictope and its dependencies are installed, we can use the package functions to identify optimal tagging locations as follows. To start, we will need to download the protein sequences of the organisms we want to compare against for a multiple sequence alignment. Afterwards, the `data/CDS` folder should be populated with ".gz" files for each of the organisms used in the multiple sequence alignment.


```{r, eval = FALSE}
setup_files()
check_config()
# search for the default species specified with the config file 
protein_links <- ftp_search(species)
# download the sequences
lapply(protein_links, ftp_download)

# we can also specify the animals we want to use at the time of search 
#species <- c("Mus_musculus", "Bos_taurus")
#protein_links <- ftp_search(species)
#lapply(protein_links, ftp_download)
```

Next, we will need to specify the *UniprotID* of the protein you are interested in. [UniProt](https://www.uniprot.org/) is a freely accessible database of protein sequence and functional information. A single gene can encode multiple protein isoforms through  alternative splicing. Starting with the UniprotID allows to know which specific protein isoform we are interested in and working with. 

```{r, eval = FALSE}
query <- "Q9W7E7" # The UniprotID of the protein we are looking for. In this example, we use the Smad5 gene for zebrafish.
uniprot_fields <- c("accession", "id", "gene_names", "xref_alphafolddb", "sequence", "organism_name", "organism_id")
uniprot_data <- query_uniProt(query = query, fields = uniprot_fields)
```

Using the Alphafold2ID provided by Uniprot, we can download the Protein Data Bank file or the AlphaFold2 predicted structure for the protein of interest. We then run DSSP to define the secondary structures from the Alphafold2 PDB file and parse out the results to a dataframe.

```{r, eval = FALSE}
# download associated alphafold2 pdb
alphafold_file <- fetch_alphafold(gsub(";", "", uniprot_data$AlphaFoldDB))
# calculate dssp on alphafold2 pdb file
dssp_res <- dssp_command(alphafold_file)
# parse dssp and convert to dataframe
dssp_df <- parse_dssp(dssp_res)
```

The UniprotID is also used for IUPRED2A. IUPred2A is a web interface that allows to identify disordered protein regions using IUPred2 and disordered binding regions using ANCHOR2. From the IUPRED2A output, we take the ANCHOR2 score.
```{r, eval = FALSE}
iupred_df <- iupredAnchor(query) # dataframe for iupred results
```

To perform the multiple sequence alignment (MSA). We must identify homologous proteins in our selected species. To do this, we will BLAST our query protein against the proteome of the selected species. These species can be defined by the user with the "config.R" configuration file, or using the default set of animals. We will sort the BLAST results by their E value, and select highest match by the lowest E value. We will then run muscle to perform the MSA. Using the MSA result, we can calculate the shannnon entropy for each position. 

```{r, eval = FALSE}
# blast query aa sequence
seq <- Biostrings::AAStringSet(uniprot_data$Sequence, start=NA, end=NA, width=NA, use.names=TRUE) # Use the sequence identified by the UniprotID

# list of proteome files to blast against
aa_files <-  list.files(cds_folder, pattern = paste0(species, ".*\\.all.fa$", collapse = "|"), ignore.case = TRUE, full.names = TRUE, recursive = TRUE)
names(aa_files) <- aa_files

# blast
blast_results <- lapply(aa_files, function(.x){protein_blast(seq, .x)})

# take the highest blast match according to E score 
find_best_match <- function(.x){head(.x[base::order(.x$E),], 1)} # sort by E value and take best match
blast_best_match <- lapply(blast_results, find_best_match) # find best match
blast_seqs <- lapply(blast_best_match, fetch_sequences) # retrieve the sequence for the best match

# Assign the sequence of our query protein to the MSA
blast_seqs[[query]] <- seq
blast_stringset  <- Biostrings::AAStringSet(unlist(lapply(blast_seqs, function(.x){.x[[1]]})))

# multiple sequence alignment
msa_res <-  muscle(blast_stringset)
# shannon entropy calculation
shannon_df <- shannon_reshape(msa_res, query)

```

The final step is to combine the shannon entropy, secondary structure, and disordered binding region into a single dataframe. From there, we can calculate the solvent accessibility and tagging score for each position. The final output will be written to a file in the outputs folder.

```{r, eval = FALSE}
# join tagging features in dataframe
features_df <- Reduce(function(x, y) merge(x, y, all=TRUE), list(shannon_df, dssp_df, iupred_df), accumulate=FALSE)
colnames(features_df)
# normalize features and calculate tagging score
norm_feats_df <- calculate_scores(features_df)

# write to file.
res_df <- merge(norm_feats_df, features_df)
write.csv(apply(res_df,2,as.character), file = paste0(outputFolder, "/", query, "_score.csv"), row.names = FALSE)
```