--- title: "Pteridium_amplicon_pipeline" author: "Carol Rowe" date: "5/4/2017" output: html_document --- This is my pipeline for taking Illumnia fastq reads from MiSeq 2x300 of amplicons of Pteridium and processing through cutadapt and then assembling similar sequences. ### OUTLINE: Summary of amplicons and samples. Step 1: Cutadapt: trim seqs by quality and check for read-through (-q 10) OUTPUT: trimmed\*.fastq Step 2: Pandaseq: merge paired-end reads (threshold = 0.90) OUTPUT: trimmed_pandaseq_merged_reads (contains .fasta and .log files) Step 3: Summary table of number of merged reads Step 4: vsearch to cluster reads within each sample/amplicon (threshold = 1.0) Step 2: Sort .fastq sample files into respective amplicons using exact primer match Step 3: Summary table of sorted files (number of reads for each sample/amplicon) Step 4: Use pandaseq to merge paired-end reads (threshold = 0.9) ### Summary of Amplicons and Samples: #### Amplicons: 4 nuclear genes: ApPEFP, CRY, GAPC, and SQD 2 chloroplast genes: rpl16, and rps4 Since we can not sequence through the entire length of each gene with MiSeq 2x300, we divided the genes into pieces (except for CRY and GAPC which are <600bp). ```{r echo=FALSE} primerTable <- read.csv('/Users/carolrowe/Dropbox/Pteridium_phylogenetics/CR_Pteridium_Amplicon/Primer_overview.csv', header=TRUE) primerTable ``` Here are the primer sequences: ```{r echo=FALSE} PrimerseqTable <- read.csv('/Users/carolrowe/Dropbox/Pteridium_phylogenetics/CR_Pteridium_Amplicon/Primer_seqs.csv', header=TRUE) PrimerseqTable ``` #### Samples: Also, we have a total of 100 Pteridium individuals. This includes the 12 individuals run previously in a test run. For sample names, see Step 5: Pteridium_Samples.xls Two of the samples are repeats: **MD22r = MD22VE, and ME21r = ME21VN** ### Step 1: Cutadapt Use the cutadapt software for quality trimming and to remove any sequences that read-thru to other primer/adapter: Marcel Martin. Cutadapt removes adapter sequences from high-throughput sequencing reads. EMBnet.journal, 17(1):10-12, May 2011. DOI: http://dx.doi.org/10.14806/ej.17.1.200 url = {http://journal.embnet.org/index.php/embnetjournal/article/view/200} Before running the code for cutadapt, you'll need two input files: ADAPTER_FWD.fasta and ADAPTER_REV.fasta, which have the reverse-complement primer seqs, since we are looking for the primers at the 3' ends of the reads in case of a mishap short read. ADAPTER_FWD.fasta >rpl16aRrc GCACCAAGTAATATCTGAAC >rpl16bRrc ATTTCGTAARCAACATNGSGGA >rps4aRrc AGTGAATCGCATTGCTAGCC >rps4bRrc GAACCCTTTCGCTCTTCT >trnScRrc GAGGGATTCGAACCCTCGGTAA >SQDaRrc CATCTYAGCAAAGTGCAYGAYTCACGC >SQDbRrc CACCCCCTTACAGTTTATGGCAAAGG >ApPEFPaRrc TGTGGTTGTTTCAGCATTGC >ApPEFPbRrc GTTTCTGCTCTAGTATTTTGCACTTC >ApPEFPcRrc TTGTTAGCTCGCACCTTTGA >CRY2Rrc GGGGTATGAARTATTTCTGGGAYAC >GAPCRrc TGAAGGGCATTCTGGGTTAC ADAPTER_REV.fasta >rpl16arc AACGAGTCGCACACTAAGCGT >rpl16brc CGGGAGTTGAGTGGTTAACC >rps4arc AGGTCCTCGRTAACGSGACAT >rps4brc GGTATCTTTTCCCCCAAAAGAGA >rps4crc GCTAGTTTTTCAGTAATCTG >SQDarc CCTATGATCATDACCTTDGTACCCTTGC >SQDbrc CTCAAAATCACATATATCACCCACA >ApPEFParc AGCATTTTCCGCAGCTGTAT >ApPEFPbrc TCAGTGCCCAATAGTGTGTGT >ApPEFPcrc TGTGTACTGAGTTTCTGTCAAACTAGT >CRYrc CATTGCTRCCTTTCTCCARYTCATCCT >GAPCrc GCAGCACCAGTTGARCTTGGRAT NOTE: If you have \*.fastq.gz files then unzip (gunzip \*.fastq.gz) before running the following python code: From command line: python ./cutadaptPter.py ```{python eval=FALSE} import glob import os # This is for quality trimming in cutadapt # Here's the format: # cutadapt -q 10 -a file:ADAPTER_FWD.fasta -A file:ADAPTER_REV.fasta -o out.1.fastq -p out.2.fastq reads.1.fastq reads.2.fastq pathway = '/path/to/.fastq/files/' #get list of fastq files in the 'pathway' directory #glob.glob reads each file name, while the os.path.basename gets just the filename myfiles = [os.path.basename(x) for x in glob.glob(pathway + '*.fastq')] print myfiles print len(myfiles) #Get names of the forward and reverse read files (file names are exact, except for R1 (forward) and R2 (reverse)) forwardReads = [f for f in myfiles if "_R1_" in f] reverseReads = [r for r in myfiles if "_R2_" in r] #print forwardReads #print reverseReads if len(forwardReads) != len(reverseReads): print "ERROR: Unequal number of reverse and forward read files!" else: print "Great! Equal number of reverse and forward read files." newlist = [] for forname in forwardReads: revname = forname.replace('_R1_', '_R2_') #get sample name from first part of file before 1st "_" samplename = forname.partition("_")[0] out1 = "trimmed" + samplename + "_R1_.fastq" out2 = "trimmed" + samplename + "_R2_.fastq" #the following is the format needed for input into quality trim for cutadapt output1 = "cutadapt -q 10 -a file:ADAPTER_FWD.fasta -A file:ADAPTER_REV.fasta -o {} -p {} {} {}" .format(out1, out2, forname, revname) newlist.append(output1) #print newlist # you can write to a non-existing file; let's call it cutadapt_q_commands.txt with open('cutadapt_q_commands.txt','w') as file: for item in newlist: print>>file, item ``` OUTPUT = cutadapt_q_commands.txt Here's an example of a single line from that output file: cutadapt -q 10 -a file:ADAPTER_FWD.fasta -A file:ADAPTER_REV.fasta -o trimmed104MOGJ_R1_.fastq -p trimmed104MOGJ_R2_.fastq 104MOGJ_S60_L001_R1_001.fastq 104MOGJ_S60_L001_R2_001.fastq Note that the output files are labeled as trimmed\*.fastq so as not to confuse them with the original .fasta files. Now, we need to run each line from the cutadapt_q_commands.txt file: Here's the script for running the cutadapt commands. From command line: ./cutadaptloop.sh ```{bash eval=FALSE} #!/bin/bash while read line; do $line done < 'cutadapt_q_commands.txt' ``` OUTPUT = trimmed\*.fastq ### Step 2: PANDAseq to merge reads Andre P Masella, Andrea K Bartram, Jakub M Truszkowski, Daniel G Brown and Josh D Neufeld. PANDAseq: paired-end assembler for illumina sequences. BMC Bioinformatics 2012, 13:31. http://www.biomedcentral.com/1471-2105/13/31 [Here's a good webpage for pandaseq basics:](http://neufeldserver.uwaterloo.ca/~apmasell/pandaseq_man1.html) Before running pandaseq, create a primer file to use: PrimerSeqSize.csv ```{r echo=FALSE} primerTable <- read.csv('/Users/carolrowe/Dropbox/Pteridium_phylogenetics/CR_Pteridium_Amplicon/PrimerSeqSize.csv', header=TRUE) primerTable ``` NOTE: When running screen: screen -L # this creates a log file Similar to cutadapt, we'll make a .txt file with the individual commands to run for pandaseq. From commmand line run (may want to run this in screen depending on how fast your computer is): python trim_90_PterPandas.py ```{python eval=FALSE} import pandas as pd import glob import os #goal is to generate a list for pandaseq such that I can just loop over the commands in a shell script #Need to get forward and reverse file names, as well as, forward and reverse primer reads, min length of #reads and output (.fasta and .log) file names pathway = '/path/to/directory/' PrimerTable = pd.read_csv(pathway + 'PrimerSeqSize.csv') #print PrimerTable #get list of fastq files in the 'pathway' directory #glob.glob reads each file name, while the os.path.basename gets just the filename myfiles = [os.path.basename(x) for x in glob.glob(pathway + 'trimmed*.fastq')] print myfiles print len(myfiles) #Get names of the forward and reverse read files (file names are exact, except for R1 (forward) and R2 (reverse)) forwardReads = [f for f in myfiles if "_R1_" in f] reverseReads = [r for r in myfiles if "_R2_" in r] #print forwardReads #print reverseReads if len(forwardReads) != len(reverseReads): print "ERROR: Unequal number of reverse and forward read files!" else: print "Great! Equal number of reverse and forward read files." newlist = [] for forname in forwardReads: revname = forname.replace('_R1_', '_R2_') #get sample name from first part of file before 1st "_" samplename = forname.partition("_")[0] #Now loop throught the primer table to get each primer for each of the above samples for index, row in PrimerTable.iterrows(): primername = row['Primer_Name'] forprimer = row['IUPAC_Primer_seq_FORWARD'] revprimer = row['IUPAC_Primer_seq_REVERSE'] fastaoutput = samplename + "_" + primername + ".fasta" logoutput = samplename + "_" + primername + ".log" minlength = row['set_min_length'] maxlength = row['set_max_length'] #minoverlap = row['min_overlap'] #maxoverlap = row['max_overlap'] #the following is the format needed for input into pandaseq output = "pandaseq -f {} -r {} -p {} -q {} -w {} -g {} -l {} -L {} -A simple_bayesian -t 0.90" .format(forname, revname, forprimer, revprimer, fastaoutput, logoutput, minlength, maxlength) newlist.append(output) #print newlist # you can write to a non-existing file; let's call it trim99pandaseqcommands.txt with open('trim90pandaseqcommands.txt','w') as file: for item in newlist: print>>file, item ``` OUTPUT = trim90pandaseqcommands.txt Here's an example of one line from trim99pandaseqcommands.txt: pandaseq -f trimmed362HKSL_R1_.fastq -r trimmed362HKSL_R2_.fastq -p ACGCTTAGTGTGCGACTCGTT -q GTTCAGATATTACTTGGTGC -w trim90_pandaseq_merged_reads/trimmed362HKSL_rpl16a.fasta -g trim90_pandaseq_merged_reads/trimmed362HKSL_rpl16a.log -l 496 -L 559 -A simple_bayesian -t 0.90 Let's look at the anatomy of the pandaseq command: pandaseq calls the package of pandaseq scripts. -f trimmed362HKSL_R1_.fastq tells the script where to find the forward read. -r trimmed362HKSL_R2_.fastq tells the script where to find its matching reverse read. -p ACGCTTAGTGTGCGACTCGTT looks for the primer and then strips out the forward primer and any seq prior to it. -q GTTCAGATATTACTTGGTGC looks for the primer and then strips out the forward primer and any seq prior to it. -w ./trim99_pandaseq_merged_reads/trimmed362HKSL_rpl16a.fasta directs the script to make a new fasta file of the assembled reads and to put it in the "trim99_pandaseq_merged_reads" directory. -g ./trim99_pandaseq_merged_reads/trimmed362HKSL_rpl16a.log selects an option of creating a log file. The log file includes details as to how well the merging went. -l the minimum length of resulting seq AFTER primers removed (I just took the expected lenght and subtracted 50bp) -L the maximum length of resulting seq AFTER primers removed (I opted to calculate this for each locus by: 600bp - length of primers) -A Set the algorithm used for assembly: simple_bayesian. Uses the formula described in the original paper (Masella 2012). -t 0.90 is a threshold score (between 0 and 1) that each seq must meet to be kept in final output. After experimenting, I think this is ideal; could set a bit lower if needed. See below. Other parameters I ended up NOT using: -o specifies the minimum bp of overlap between forward and reverse seqs -O specifies the maximum bp of overlap between forward and reverse seqs #### Notes on threshold value in pandaseq: -t is a quality score between 0 and 1 that each sequence must meet to be kept in the final output. "...setting the quality threshold higher than 0.9 would demand that reads have fewer errors than data known to be perfectly correct and is, in effect, demanding the underlying read quality be better than is necessary to reconstruct the sequence." "Relaxing the quality threshold increased sequence recoveries substantially. When the quality threshold was reduced to 0.6, the total number of sequences increased by 50% and the number of sequences in the most abundant clusters increased by 85%. Even if the quality threshold was lowered below 0.6, no new OTU sequences were assembled by PANDAseq." [See PANDAseq paper...](https://bmcbioinformatics.biomedcentral.com/articles/10.1186/1471-2105-13-31) For the following, run in screen mode! Now, run each line in trim99pandaseqcommands.tx as a separate command. From command line: ./pandaseqloop90.sh ```{bash eval=FALSE} #!/bin/bash # -p checks to see if directory already exists mkdir -p trim90_panadseq_merged_reads while read line; do $line done < 'trim90pandaseqcommands.txt' ``` OUPUT = directory with resulting fasta and log files: trim90_pandaseq_merged_reads ### Summary count of merged reads: ```{bash eval=FALSE} # -c just returns the count grep -c '^>' *.fasta | wc -l >> trim90counts.txt ``` Results: Lots that did NOT merge. 90% has total 2,029,678 reads 99% has total 198,285 reads ### Summary of trim90_pandaseq_merged_reads - before I do within sample/ampicon clustering with vsearch: Here's notes of what Step 5 does: * Creates directory, trimmed_uniq_seqs, with .txt files of unique sequence reads * Creates a table, trimmedPterAmpTable.csv, with counts of top 6 haplotypes, etc. * Headers are: 'Sample', 'Amplicon', 'Num_Contigs', 'Num_Uniq_Contigs', 'Num_Hap_1', 'Num_Hap_2', 'Num_Hap_3', 'Num_Hap_4', 'Num_Hap_5', 'Num_Hap_6', and 'Num_Singletons' * Haplotypes were arranged in order of most common/numerous sequences first (number is the number of reads with identical sequence) and hence are then called Hap_1 through 6. Singletons are the number of unique sequences with ONLY 1 read contributing to them. This does NOT take into account differing read lengths. I look into this later! * Then merge tables: + Use: 'trimmedPterAmpTable.csv' + 'Pteridium_Samples.xls' + 'Pteridium_current_names.xls' to add the following columns to the trimmedPterAmpTable.csv: 'species_ID', 'Genus', 'species', and 'subsp'. * NOTE: The trimmed_uniq_seqs at this point do NOT account for seqs of different lengths. That is taken care of in Step 6!!! From command line: ./trim90_AmpliconTable.sh ```{bash eval=FALSE} #!/bin/bash #This script is for taking fasta files of the contigs formed in Pteridium ampicon # project and making a table with name, amplicon, # number assembled contigs, number of unique contigs, how many reads # generated the first 6 most common uniq contigs, # and the number of uniq reads generated from a single contig. # Also create a directory, uniq_seqs, with .txt files of unique seqs NEWDIR='/path/to/trim90_uniq_seqs' mkdir -p $NEWDIR tablepath='/path/to/output_table/' pathway='/path/to/trim90_pandaseq_merged_reads/' cd $pathway format=trimmed*.fasta for i in $format; do #Get just the sample name which is before 1st _ name=$(echo $i | awk -F'_' '{print $1}') #Get just the amplicon name (comes after the 1st _ and before . amplicon=$(echo $i | awk -F'[_.]' '{print $2}') #How many contigs in the fasta file? num_contigs=$(grep "^>" $i | wc -l) #Create a file name for which to put just the unique fasta contigs into newname=${name}_$amplicon newname=${newname}_uniq.txt echo $newname #Get unique seq contigs (getting lines not starting with > since this is #a fasta file) and put the count (number of replicates of each unique read) # and actual unique sequence into a new file grep -v "^>" $i | sort | uniq -c | sort -nr >> $NEWDIR/$newname #Count the number of unique contigs uniqcontigs=$(wc -l $NEWDIR/$newname | awk '{print $1}') echo $uniqcontigs #Get the first 6 most common contigs, but return only the number of reads/copies #that contributed to these 6 unique contigs mostcommon=$(awk '{print $1}' $NEWDIR/$newname | awk 'FNR <=6' | tr '\n' '\t') # In case there are fewer than 6 lines: counter=$(echo $mostcommon | awk -F' ' '{print NF}') if [ $counter -lt 6 ]; then while [ $counter -lt 6 ]; do mostcommon=$(echo -e $mostcommon'\t0') let counter=$counter+1 done fi #How many contigs are there with no duplicate reads (singletons) justone=$(grep "1" $NEWDIR/$newname | wc -l) #now let's print this all out as one line for each file outputline="$name $amplicon $num_contigs $uniqcontigs $mostcommon $justone" #and put the output into a new text file echo $outputline >> $tablepath/trim90_Pter_amplicon_table.txt done ``` OUTPUT = trim90_Pter_amplicon_table.txt, and directory: trim90_uniq_seqs