Category Archives: Data Analysis

Making conStruct Input Files

As part of my postdoc with Gideon Bradburd, I’m using his new software package conStruct (bioRxiv; GitHub) to analyze dozens of genomic datasets.  conStruct requires three input files: 1) genetic data, 2) coordinate data (longitude in the first column, latitude in the second), and 3) a pairwise distance matrix with the same number of sites as in the coordinate data.  Files 2 and 3 are straight forward; but it took me a little time to be able to go from a regular STRUCTURE file to a conStruct file.  So below is the R code I’m using to do this conversion.

Now if your data is not already in STRUCTURE two row format (i.e. two rows per sample), then you’ll need to get there as a starting place. I used PGDSpider to make the STRUCTURE files WITH a header row and a column to denote sampling site.
(PLINK, bless its heart, makes 1 row 1 column STRUCTURE files, and I’m not coding out of that.) Remember you want to denote sampling sites for conStruct and not putative populations. I then replaced the two blank headers for columns one and two with “SampleID” and “PopID.”

conStruct can take data as counts or frequencies. The code below makes a table of frequencies for one allele (doesn’t matter major or minor, derived or ancestral) for each sampling site for each locus.

I have written this as a loop to process multiple input files at once. You can remove the for loop and start at “str <- read.table()” if you only have one file.

setwd("Enter the path to your working directory")
files <- list.files(pattern = "*.str",full.names=T)
newnames <- paste(sep="",sub('.str', '',files),"-Processed.str")

#Loop over all files and make the processed files needed for conStruct
for(i in 1:length(files)){

#Read data file and convert missing data to NA
str <- read.table(files[i],header=T)
str[str == "-9"] <- NA                          
str <- str[ order(str$PopID,str$SampleID),]

#Count number of samples
SampleID <- as.character(unique(str$SampleID))

#Looping over all loci, create a frequency table of alleles (0,1,2)
#Jacob Burkhart wrote this loop
count <- data.frame(SampleID)
for(loci in 3:dim(str)[2]){   
  temp <- table(str$SampleID, str[,loci])           
  colnames(temp) <- paste0(colnames(str)[loci], "-", colnames(temp)) 
  temp <- data.frame(unclass(temp)) 
  
#If there are no alleles, recode the row as -9
  for(j in 1:dim(temp)[1]){
    if(sum(temp[j,]) == 0) {temp[j,] <- NA} 
  }
#Test if a monomorphic locus slipped through your data processing
#If so, column bind data to sample ID and any previous datasets
#If not (as expected), then the column bind will be applied to the 2nd allele
#Why the 2nd allele?  Because any loci with missing data will result in data being added to the table
  count <- as.matrix(cbind(count,if(length(temp)==1){temp[,1]} else{temp[,2]}))
}

#Create a vector of the sampling site information for each sample
pop.vec <- as.vector(str[,2])
pop.vec <- pop.vec[c(seq(from=1, to=nrow(str), by=2))]

#Make variables to utilize below
n.pops <- length(unique(pop.vec))
table.pops <- data.frame(table(pop.vec))

#Make a file of individual sample allele frequencies
#If you only have one sample per sampling site, then you could stop here
freq <- matrix(as.numeric(count[,-1])/2,nrow(count),ncol(count)-1)
f <- matrix(as.numeric(freq),nrow(freq),ncol(freq))

#Empty matrix for sampling site level calculations
admix.props <- matrix(NA, n.pops,ncol(f))

#Calculate frequency (of 2nd allele) per sampling site
#The last line tests if there is a sampling site with n=1
#If so, prints vector because frequency has already been calculated (0, 0.5, or 1)
#If not, then calculates mean across samples from that site
for(m in 1:length(table.pops$pop.vec)){
  t<-as.factor(unique(pop.vec))[m]
  admix.props[m,] <- if(table.pops[table.pops$pop.vec == t,2] == 1){f[which(pop.vec==t),]} else{colMeans(f[which(pop.vec==t),],na.rm=T)}
  }

#Export conStruct file and save in working directory
write.table(admix.props, newnames[i],quote=F,sep="\t",row.names=F,col.names=F)
}

As I noted in the code, my friend Jake Burkhart wrote the internal for loop that makes the frequency table. He originally wrote the loop to make pseudo-SNP datasets out of microsatellite data. Which means, if you want to run conStruct on a microsatellite dataset, you can print all of the loci (instead of just one of the biallelic SNPs), then keep processing the frequencies at each sampling site.  Note, conStruct will throw an error if there are fewer loci than samples, which shows up more readily when using pseudo-SNP data from (even highly polymorphic) microsatellites.

BayesAss for RADseq Data

I want to use BayesAss on a large SNP dataset generated with RADseq.  But I found out when I went to convert the data into the .immaq format that my favorite converter, PGDspider, would only convert the first 40 loci.  I didn’t get 10,000s of loci for nothing, so that wasn’t going to work.  But a second problem was that BayesAss 3.0.3 only allows 240 SNPs anyways.

Obviously I’m not the only person with this problem.  And thanks to Steve Mussmann there’s a solution.  Steve re-wrote BayesAss 3.0.4 to be able to handle large SNP datasets, as well as a program to convert the STRUCTURE files from pyRAD into .immaq input files for BayesAss.

Since I have STRUCTURE files from STACKS and not pyRAD, I had to do a little conversion.  My messy code is below, but leave a comment if you are a wiz with an elegant solution.

Turning STRUCTURE Output from STACKS into STRUCTURE Output from pyRAD
First, output data from STACKS in STRUCTURE format (.str).  Remove the first two rows from this output (STACKS header and loci identifiers).  Then, print the first column (sample names), insert five empty columns to match pyRAD.  Do not print the second column from the STACKS STRUCTURE output because that is the population code from your Population Map input into STACKS.

awk '{print $1 "\t" "\t" "\t" "\t" "\t" "\t"}' data.str > test.out

Next, print out the remainder of your data starting at column 3 (i.e. first locus) in the original dataset (awk code from here).  Then use paste to concatenate the two files into the .str file you will convert.

awk '{for(i=3;i<NF;i++)printf"%s",$i OFS;if(NF)printf"%s",$NF;printf ORS}' data.str | tr ' ' '\t' > test2.out

paste test.out test2.out > data.str

Convert .str to .immac Using a Custom PERL Script
If you already had data from pyRAD and did not have to do the steps above, you can just convert your .str file to an .immac using Steve’s script: str2immaq.pl. However, if you have data from STACKS then you need to modify lines 39-52 in the original script to the following:

# convert structure format to immanc format, and push data into hash
for(my $i = 0; $i < @strlines; $i++){
	my @ima;
	my @temp = split(/\s+/, $strlines[$i]);
	my $name = shift(@temp);
	foreach my $allele(@temp){
		push( @ima, $allele);
	}

Since STACKS exports missing data as 0 (whereas pyRAD exports missing data as -9), this change removes converting missing data from -9 to 0. Steve also wrote this change.  Save the change in the perl script, then use the script to convert data from .str to .immac.  Now you’ve got an input file for BayesAss.

Formatting Microsatellite Data for PCA in EIGENSOFT

At this point, who hasn’t read Patterson et al 2006 about population structure and eigenvector analysis?  It’s a great paper as it introduced the EIGENSOFT package for analyzing genomic data using principal components analysis (PCA).  PCA is a great way to identify both population structure and admixture relationships.  For anyone that works with microsatellites, you no doubt noticed the paragraph that says you can use microsatellite data as the PCA input.  However, the input format is not all that clear for how to convert your data.  This post provides an in-eloquent way of doing the format conversion using Excel tables.

Let’s start with our data in double column format (e.g.- STRUCTURE format).
Let’s call this Tab1:

MicroData

Note that Sample2 is missing data at Locus1.

You need to convert your data so that each allele of a microsatellite in your dataset has its own column.  In the toy example there are four alleles for Locus1, and three alleles for Locus2.  In Excel, I set up a new tab with the paired locus and allele information.  Also remember that it is important to keep the sample order the same as in the two column format.

The next step is to populate the new table with the number of alleles (0, 1, or 2) that each sample has for each locus_allele combination.  While this may be simple enough to do my hand for toy datasets, it is much easier (and less error prone) to use formulas in Excel to populate the new table.  I used an if/then statement.  Specifically, if the allele in the first column for the locus of interest was equal to the allele in the header row, then populate that cell with the value of 1; if the values in Tab 1 and Tab 2 are not equal, then populate the cell with a 0.

Write formulas for each locus_allele combination for the first sample, making sure to lock the right hand of the equation (allele value to compare the data to) using dollar signs before the row and column identifiers.

One_Allele

Once formulas have been written for the first sample, you can drag the equations down to populate the full table.

You should have noticed that when you do this, you are only accounting for the first allele.  Therefore, you can add (+) an additional if/then statement to account for the second allele in the second column, like this:

Two_Alleles

Yay!  There are now counts for all of the alleles.  However, we still have to account for the missing data.  To do this, we can still use the same idea as before by adding another if/then statement, but this time it will evaluate if the cell in the original data (Tab 1) is blank.  Blank cells can be coded in Excel as empty quotes (e.g. “”).  Since missing data should be missing in both columns in the original double column data (Tab 1), we only need to evaluate the first column, and assign it a value of 9 if it fulfills the if/then statement.

Two_Alleles_Missing

Now we see that Locus1 of Sample2 has been coded as missing data (9) at all alleles.

Finally, you need to make the three files (.eigenstratgeno, .snp, and .ind) that serve as input in EIGENSOFT.  To make the .eigenstratgeno file, copy your Excel table, then paste the values to remove the formulas.  (If you want to reference the formulas in the future, paste the values in a new tab.)  Now copy the table (with values not formulas) again and use the transpose function.  The toy data now looks like this:

EigenstratgenoFormat

Copy just the values (no sample or locus names) and paste into a text file, then remove the tabs between each column that Excel inserts.

FinalData

Finally, you may be asking why go through all of this trouble to use EIGENSOFT for PCA when you could also make one using the R package adegenet?  I like EIGENSOFT because you can analyze the output with Tracy-Widom statistics (within the EIGENSOFT package) to identify which principal components are significant versus less rigorous ways such as observing the plateau of eigenvalues.

Script for ChromoPainter

I’m new to Perl so this may not be the most elegant script. The script converts fastPHASE output into a ChromoPainter input file. The for loop makes an input file for each scaffold for which my data maps to (in my case 284 scaffolds of the polar bear genome).

#!/usr/bin/perl
use strict;
use warnings;
use diagnostics;
my($infile, $phasefile, $Chromofile, $constant, $i);
foreach my $chr (1..284){ #1 to 284 b/c my data maps to scaffolds that are named numerically and range from 1 to 284, non exclusive)
$constant = "0 0\n";
$infile="batch_1.scaffold".$chr.".phase.inp"; #Exported fastPHASE input file (.inp) for each scaffold from STACKS
$phasefile="Scaf".$chr."._hapguess_switch.out"; #_hapguess_switch.out is output from fastPHASE
$Chromofile="Scaf".$chr.".phase";
if (-e $infile && $phasefile) { #Using -e flag on infile and phasefile names to skip over any scaffold numbers (between 1-284) where my data does not map to; ex: no data on scaffold 100
open (INFILE, '+<', "$infile");
open (PHASEFILE, '+<', "$phasefile"); open (CHROMOFILE, '>>', "$Chromofile");
print CHROMOFILE ($constant); #Prints a 0 on the first line of the Chromofile, then moves to next line
my @infile = ;
my @phasefile = ;
print CHROMOFILE @infile[0..3]; #Prints first four lines of fastPHASE input file (num samples, num loci, loci bp position on scaffold, and a row of "S" * num loci)
print CHROMOFILE @phasefile[0..187]; #I have 94 samples (diploid); therefore, 188 haplotype lines when -Z flag used in fastPHASE. This prints all lines of the file fastPHASE output to ChromoPainter input file.
close INFILE;
close PHASEFILE;
close CHROMOFILE;
}
}