Samples

Sample Information: Note: Samples from population 8543 were removed as they were not included in the final study due to low reads.

Software versions

Platform: x86_64-apple-darwin13.4.0 (64-bit)
Running under: OS X 10.10.5 (Yosemite)

rmarkdown_1.4
knitr_1.15.1
htmltools_0.3.5 jupyter (1.0.0)

ipyrad (v.0.5.15)

R version 3.3.1 (2016-06-21) Bug in Your Hair
{ Note: in R you can use the following to get version info: sessionInfo() }
library(“ade4”) # ade4_1.7-5
library(“adegenet”) # adegenet_2.0.1
library(“ape”) # ape_4.1
library(“genetics”) #genetics_1.3.8.1
library(“ggplot2”) # ggplot2_2.2.1
library(MASS) # MASS_7.3-45 library(“pegas”) # pegas_0.9
library(“poppr”) # poppr_2.4.1
library(“seqinr”) # seqinr_3.3-3

Python 3.6.2
import pandas as pd # version: 0.20.3
import seaborn as sns # version 0.8.1
import numpy as np # version 1.13.1
from geopy.distance import vincenty # geopy (1.11.0)
import re # 2.2.1
import scipy.stats as stats #scipy (0.19.1)

FOR MAP-MAKING ONLY:
Python 2.7.13
from mpl_toolkits.basemap import Basemap # mpl_toolkits ‘1.0.7’
import matplotlib.pyplot as plt # matplotlib ‘1.5.1’
from PIL import Image # PIL ‘1.1.7’
from matplotlib.collections import PatchCollection # matplotlib ‘1.5.1’
import matplotlib.patches as mpatches # matplotlib ‘1.5.1’

Barcodes file used for ipyrad

03_02 ATTCTGGAA
05_03 AAGTTGCTAA
07_04 AATCCAGA
10_02 AGCAATGAA
21_05 ATAGTCGTAA
08_02 CCTAGAGA
04_05 ACCTCGTCCA
17_10 CCAGTCTA
01_05 TAACGGTAA
18_06 CGAGGAAGCA
20_05 ATATAGTA
02_04 AATCCTTAA
17_05 AACGCCAA
03_05 CCAGAACCAA
05_01 CGGTCCAA
10_03 CTGGATGAA
03_01 TGCAACTTAA
02_05 GGAATAGA
16_02 CAACTTGAA
14_04 AAGCCAACCA
16_05 TTCGTAGA
08_01 AAGAGGTA
21_03 AGGCGCAGCA
17_01 TTCAATTA
09_03 CATCTCAGCA
06_01 CCGCGTTA
10_01 GGTTAACCA
03_03 TGAAGCCAA
17_09 AGACGCAA
13_03 AACCGCCAA
11_03 TCGCGACCAA
14_05 CTTCAATAA
09_01 CTTGCCGA
13_01 AAGGAATAA
02_03 CGCTCAACCA
19_03 GACCGCGA
19_05 TTCTCATAA
13_05 AATTGGAGCA
15_02 CCTCCAAC
12_02 GTCAGACCA
20_02 CAGGAACGCA
18_08 TTGCGAAC
20_04 AGACGACCA
06_03 ATCGGACCAA
05_02 TAGATCAA
01_01 CTGATCCAA
11_05 CATTGACCAA
07_02 ATAAGAACCA
15_04 ACGTTCGA
15_03 GATAGATAA
09_05 CCTCGAACCA
03_04 TATAAGGA
13_02 CCAGGATAA
10_05 AACTCTCGCA
05_04 CCTATACCA
07_03 AATGCAGGCA
21_02 ACCTGAAC
02_01 CTTGAGCCA
11_03 TTGGACGGCA
18_02 CTGATACCAA
07_01 AATAAGCAA
16_01 ACCAAGCCAA
10_02 CAGCAGAA
01_01 GTCCTAACCA
16_01 CTGATGGA
04_01 GCGCTATAA
18_09 CTGCAGACCA
16_03 TGGACTGA
10_04 AGATTATAA
17_04 AATATAAC
17_02 ACTCCGCCA
18_07 TTAATTGGCA
12_01 GCGGTAAC
19_01 GGAGCGCCA
17_08 CATAATTGCA
11_01 TGAGAGAA
05_05 TCGGAGCAA
08_02 TAGGAGCCAA
07_04 GTCTCGAA
11_05 CTTATTACCA
04_02 ACTGAATA
01_02 GTAACCTAA
20_03 ACCGTTACCA
18_05 TAGTAATA
16_04 CAGTCCTAA
01_04 CTTAAGAC
18_10 CGGAGGCCA
14_02 AGAACTTGCA
17_03 GACGAGAC
12_03 ATAATGCCA
14_01 CAATTAATCA
15_01 GGTAGGAA
18_01 CCAACGCAA
04_02 GATACGCCAA
02_02 CCATGGAA
08_04 CAGAGAGCCA
08_05 GTAGCATA
11_02 ACCTAGTAA
06_02 CATTACGCCA
01_03 ATCCGATA
06_04 TGGACGTAA
21_01 AGGACGAC
19_04 GCGTTGCCA
21_04 TTATGCATCA
11_04 TATTGGAC
14_03 TAGACTCCA
04_04 TATTCTATCA
18_03 GGCCATAA
04_03 ATCGCGCAA
05_01 TTAACTCCAA
21_03 ATGGATAA
13_04 CTATCGGCCA
08_03 GCCTTATA
06_05 GCTGCGTAA
12_05 TGAGCTGCCA
17_06 TGCTGCTA
20_01 CTCAGGTAA
09_02 CGCTTGAC
07_05 GATCGTCCA
15_05 GCTTGCCTCA
02_02 TAGGCTAC
18_04 CTCCTTCCA
19_02 AGCATGCTCA
12_04 CATACTAA
17_07 CCTTATCAA
01_03 TGCGAAGCAA
04_05 TTACCTAA
8542_1A GCCGGCTCA
8544_3C CTCTGATCGA
8550_9A CTATGCTCA
8545_4A ATTGGAAGGA
8542_1B GAGATCTCA
8550_9B GGCAAGTCA
8542_1C ATATGCTGGA
8550_9C GGAATCTGGA
8545_4C TAGCATATGA
8551_10A GGCTCAAGA
8546_5B TTATATCTGA
8549_8A GAGGTAAGA
8551_10C TGCGGCAATA
8544_3A GCCGACGATA
8548_7C TATAACTATA
8544_3B CCGCCTACTA
8547_6A GGCGGCAG
8549_8C AACTTCAGA
8548_7C TGAGTACCTA

ipyrad params file

——- ipyrad params file (v.0.5.15)——————————————-
AllenrFINAL      ## [0] [assembly_name]: Assembly name. Used to name output directories for assembly steps
./     ## [1] [project_dir]: Project dir (made in curdir if not present)
./Allenr_01_S4_L008_R1_001.fastq.gz     ## [2] [raw_fastq_path]: Location of raw non-demultiplexed fastq files
./AllenrFINALbars.txt     ## [3] [barcodes_path]: Location of barcodes file
     ## [4] [sorted_fastq_path]: Location of demultiplexed/sorted fastq files
denovo     ## [5] [assembly_method]: Assembly method (denovo, reference, denovo+reference, denovo-reference)
     ## [6] [reference_sequence]: Location of reference sequence file
ddrad     ## [7] [datatype]: Datatype (see docs): rad, gbs, ddrad, etc.
CAATTC,GTAA     ## [8] [restriction_overhang]: Restriction overhang (cut1,) or (cut1, cut2)
4     ## [9] [max_low_qual_bases]: Max low quality base calls (Q<20) in a read
33     ## [10] [phred_Qscore_offset]: phred Q score offset (33 is default and very standard)
6     ## [11] [mindepth_statistical]: Min depth for statistical base calling
6     ## [12] [mindepth_majrule]: Min depth for majority-rule base calling
1000     ## [13] [maxdepth]: Max cluster depth within samples
0.90     ## [14] [clust_threshold]: Clustering threshold for de novo assembly
0     ## [15] [max_barcode_mismatch]: Max number of allowable mismatches in barcodes
2     ## [16] [filter_adapters]: Filter for adapters/primers (1 or 2=stricter)
40     ## [17] [filter_min_trim_len]: Min length of reads after adapter trim
2     ## [18] [max_alleles_consens]: Max alleles per site in consensus sequences
2, 2     ## [19] [max_Ns_consens]: Max N’s (uncalled bases) in consensus (R1, R2)
4, 4     ## [20] [max_Hs_consens]: Max Hs (heterozygotes) in consensus (R1, R2)
40     ## [21] [min_samples_locus]: Min # samples per locus for output
20, 20     ## [22] [max_SNPs_locus]: Max # SNPs per locus (R1, R2)
5, 5     ## [23] [max_Indels_locus]: Max # of indels per locus (R1, R2)
0.5     ## [24] [max_shared_Hs_locus]: Max # heterozygous sites per locus (R1, R2)
0, 0     ## [25] [edit_cutsites]: Edit cut-sites (R1, R2) (see docs)
0, 0, 0, 0     ## [26] [trim_overhang]: Trim overhang (see docs) (R1>, <R1, R2>, <R2)
*     ## [27] [output_formats]: Output formats (see docs)
     ## [28] [pop_assign_file]: Path to population assignment file

Loci selection from ipyrad .ustr file

Due to varying lengths of reads, randomly selecting a single SNP from each locus does not always result in maintaining the “Min # samples per locus for output” (##21 of ipyrad params file). To check this, and to change this value to a minimum of 100 samples per locus, I ran the following python script.

NOTE: As seen from the script below, even though the ipyrad results suggest I should have 28,899 loci present in a min. of 40 samples, I actually have 26604 loci for min_samples_locus. Difference of 2,295 loci.

Note: There are 132 samples in this dataset. The .ustr file from ipyrad contains 28,899 loci.
python code:

import pandas as pd

# IMPORT .ustr FILE:
# There seems to be wierd combos of spaces or tabs as separaters in the .ustr file
# Thus, use the generic delim_whitespace=True to delimit on any kind of space
my_ustr = pd.read_csv('AllenrFINAL.ustr', delim_whitespace=True, header=None, na_values="-9")
#print(my_ustr.iloc[0:5, 0:9]) # File imported into pandas correctly.
#print(my_ustr.shape) # (264, 28900) rows by columns - as expected

# Counts of non-missing data by column
my_counts = my_ustr.count()
# Since there are 2 lines (or rows) per sample:
my_counts_per_sample = my_counts/2

# Don't forget that row 0 is the sample names, so should have count of 264 (132 samples)
#print(len(my_counts_per_sample))
#print(my_counts[0:10])
#print(my_counts_per_sample[0:10])


# CREATE PLOTS TO LOOK AT YOUR DATA:
import seaborn as sns
import numpy as np
# Can change width of bins by changing numer: bins=np.arange(min, max, bin_width)
# Or you could tell it how many bins you want, for example: bins=80
sns_plot = sns.distplot(my_counts_per_sample, bins=np.arange(0,132, 2))
sns_plot.set(xlabel='Number Samples', ylabel='Proportion of Total Loci')
fig = sns_plot.get_figure()
fig.savefig("proportion_loci_per_sample.png") # see plot below

# SELECT FOR LOCI WHERE DATA EXISTS IN AT LEAST 100 SAMPLES AND SAVE TO NEW FILE:
# count will be >= 200 as there are two lines per individual
# Make a new .ustr based upon a cut-off number of existing data
new_ustr = my_ustr.loc[:,my_ustr.count() >= 200]
#print(new_ustr.shape) # (264, 1384)
#print(new_ustr.iloc[0:5,0:5])

# Just for curiosity. Supposed to have 28,889 loci in a min of 40 samples...
ustr_40 = my_ustr.loc[:,my_ustr.count() >= 80]
print(my_ustr40.shape) # (264, 26605)
#Hence, you can see that ipyrad said I have 28,899 (subtract 1 for column name) loci, where in reality, I have 26604 loci for min_samples_locus. Difference of 2,295 loci.

# Put NaN to -9
new_ustr = new_ustr.fillna('-9')
#print(new_ustr.iloc[0:5, 0:9]) # Looks correct.
# Make sure that your columns stay as integers because:
# This is a "gotcha" in pandas (Support for integer NA), where integer columns with NaNs are converted to floats.
new_ustr.iloc[:,1:] = new_ustr.iloc[:,1:].astype(int)
new_ustr.to_csv('AllenrFINAL_100samptolocus.ustr', index=False, sep=' ', header=False)

Plot of Proportion of Loci Per Sample

Note the peak at 40 samples, which is what we expect given the settings in ipyrad for a min. of 40 samples per locus.

Genetic Subdivision Analysis

I highly recommend reading:
Jombart, T. and Collins, C. (2017) A tutorial for Discriminant Analysis of Principal Components (DAPC) using adegenet 2.1.0. Imperial College London. MRC Centre for Outbreak Analysis and Modelling
Jombart and Collins link

Read the *.str file onto a genind format in R. To do this, you first need to know how many individuals (n.ind) in your file, the number of loci (n.loc), and whether there are column headers (col.lab)

If you want to look at the first 12 rows (head -n 12) and the first 12 columns in your file:
head -n 12 AllenrFINAL_100samptolocus.ustr | cut -d’ ’ -f1-12

In editing the .ustr file, we know that we have 132 samples and 1383 loci. You can double check this by:

n.ind = 132
Get this by typing the following from command line:
wc -l AllenrFINAL_100samptolocus.ustr
>> returns 164. There are 2 lines per indivivdual: 164/2 = 132 individuals

n.loc = 1383
Get this by typing the following from command line:
head -n 1 AllenrFINALout.str | awk ‘{print NF}’
>> returns 1384. Subtract 1 for the row name = 1383. (Don’t forget to subtract IF you have population info columns, etc.)
( NOTE: you could also run this command: awk ‘{print NF; exit}’ AllenrFINALout.str )

col.lab = 1 Since column 1 has names of the samples
col.pop = 0 since I do not have a column containing a priori population info. Otherwise give the column number
row.markernames = 0 Since there is no header in my .ustr file
onerowperind = FALSE Because I have two rows per individual
NA_char = ‘-9’ Since -9 is what our missing data are.
R code

library("ape")
library("genetics")
library("pegas")
library("seqinr")
library("ggplot2")
library("adegenet")

######### READ IN DATA  ############################
Allen_obj1 <- read.structure("AllenrFINAL_100samptolocus.stru", n.ind = 132, n.loc = 1383, col.lab = 1, col.pop = 0, row.marknames = 0, onerowperind = FALSE, NA.char = '-9')

# For kicks, you can run the following two lines just for partial confirmation of proper importation of file:
indNames(Allen_obj1) # should return a list of the 132 sample names
ploidy(Allen_obj1) # should return 2 since we gave it 2 alleles for each marker
############# END READING IN DATA ###################

IDENTIFY OPTIMAL NUMBER OF CLUSTERS TO BEST EXPLAIN THE DATA
R code

# ?find.clusters
# Start with zoomed-out perspective. max.n.clust=30
# When you run the command below, it will ask you how many PCs to retain and then how many clusters
# I retained 200 PCs, and retained 30 clusters (I have 30 populations and 132 samples, so figure this will be ample.)
BIG_grps <- find.clusters(x=Allen_obj1, stat="BIC",choose.n.clust=TRUE, max.n.clust=80, n.iter=100000, n.start=100,scale=FALSE, truenames=TRUE) 
# Based upon BIG_grps results, we can norrow the scale:
# I am "zooming in " on the initial dip from the above output - around 0-10 clusters
grps <- find.clusters(x=Allen_obj1, stat="BIC",choose.n.clust=TRUE, max.n.clust=10, n.iter=100000, n.start=100,scale=FALSE, truenames=TRUE) # retained 200 PCs and 10 clusters

Note: The BIC vs # clusters with 30 clusters is inset with in the “zoomed-in” version showing the plot with 10 clusters.

Let’s look at the actual assignment of samples to the optimal 3 clusters:
R code

# To see what info we can access:
names(grps)
# "Kstat" "stat"  "grp"   "size" 
grps$size # tells how many individuals in each cluster
# 42 34 56 
grps$grp # gives list of individuals assigned to each cluster
# output is list of sample names and what group they belong to

GROUP ASSIGNMENTS: