Tag Archives: Bioinformatics

Making conStruct Input Files

As part of my postdoc with Gideon Bradburd, I’m using his new software package conStruct (bioRxiv; GitHub) to analyze dozens of genomic datasets.  conStruct requires three input files: 1) genetic data, 2) coordinate data (longitude in the first column, latitude in the second), and 3) a pairwise distance matrix with the same number of sites as in the coordinate data.  Files 2 and 3 are straight forward; but it took me a little time to be able to go from a regular STRUCTURE file to a conStruct file.  So below is the R code I’m using to do this conversion.

Now if your data is not already in STRUCTURE two row format (i.e. two rows per sample), then you’ll need to get there as a starting place. I used PGDSpider to make the STRUCTURE files WITH a header row and a column to denote sampling site.
(PLINK, bless its heart, makes 1 row 1 column STRUCTURE files, and I’m not coding out of that.) Remember you want to denote sampling sites for conStruct and not putative populations. I then replaced the two blank headers for columns one and two with “SampleID” and “PopID.”

conStruct can take data as counts or frequencies. The code below makes a table of frequencies for one allele (doesn’t matter major or minor, derived or ancestral) for each sampling site for each locus.

I have written this as a loop to process multiple input files at once. You can remove the for loop and start at “str <- read.table()” if you only have one file.

setwd("Enter the path to your working directory")
files <- list.files(pattern = "*.str",full.names=T)
newnames <- paste(sep="",sub('.str', '',files),"-Processed.str")

#Loop over all files and make the processed files needed for conStruct
for(i in 1:length(files)){

#Read data file and convert missing data to NA
str <- read.table(files[i],header=T)
str[str == "-9"] <- NA                          
str <- str[ order(str$PopID,str$SampleID),]

#Count number of samples
SampleID <- as.character(unique(str$SampleID))

#Looping over all loci, create a frequency table of alleles (0,1,2)
#Jacob Burkhart wrote this loop
count <- data.frame(SampleID)
for(loci in 3:dim(str)[2]){   
  temp <- table(str$SampleID, str[,loci])           
  colnames(temp) <- paste0(colnames(str)[loci], "-", colnames(temp)) 
  temp <- data.frame(unclass(temp)) 
  
#If there are no alleles, recode the row as -9
  for(j in 1:dim(temp)[1]){
    if(sum(temp[j,]) == 0) {temp[j,] <- NA} 
  }
#Test if a monomorphic locus slipped through your data processing
#If so, column bind data to sample ID and any previous datasets
#If not (as expected), then the column bind will be applied to the 2nd allele
#Why the 2nd allele?  Because any loci with missing data will result in data being added to the table
  count <- as.matrix(cbind(count,if(length(temp)==1){temp[,1]} else{temp[,2]}))
}

#Create a vector of the sampling site information for each sample
pop.vec <- as.vector(str[,2])
pop.vec <- pop.vec[c(seq(from=1, to=nrow(str), by=2))]

#Make variables to utilize below
n.pops <- length(unique(pop.vec))
table.pops <- data.frame(table(pop.vec))

#Make a file of individual sample allele frequencies
#If you only have one sample per sampling site, then you could stop here
freq <- matrix(as.numeric(count[,-1])/2,nrow(count),ncol(count)-1)
f <- matrix(as.numeric(freq),nrow(freq),ncol(freq))

#Empty matrix for sampling site level calculations
admix.props <- matrix(NA, n.pops,ncol(f))

#Calculate frequency (of 2nd allele) per sampling site
#The last line tests if there is a sampling site with n=1
#If so, prints vector because frequency has already been calculated (0, 0.5, or 1)
#If not, then calculates mean across samples from that site
for(m in 1:length(table.pops$pop.vec)){
  t<-as.factor(unique(pop.vec))[m]
  admix.props[m,] <- if(table.pops[table.pops$pop.vec == t,2] == 1){f[which(pop.vec==t),]} else{colMeans(f[which(pop.vec==t),],na.rm=T)}
  }

#Export conStruct file and save in working directory
write.table(admix.props, newnames[i],quote=F,sep="\t",row.names=F,col.names=F)
}

As I noted in the code, my friend Jake Burkhart wrote the internal for loop that makes the frequency table. He originally wrote the loop to make pseudo-SNP datasets out of microsatellite data. Which means, if you want to run conStruct on a microsatellite dataset, you can print all of the loci (instead of just one of the biallelic SNPs), then keep processing the frequencies at each sampling site.  Note, conStruct will throw an error if there are fewer loci than samples, which shows up more readily when using pseudo-SNP data from (even highly polymorphic) microsatellites.

Script for ChromoPainter

I’m new to Perl so this may not be the most elegant script. The script converts fastPHASE output into a ChromoPainter input file. The for loop makes an input file for each scaffold for which my data maps to (in my case 284 scaffolds of the polar bear genome).

#!/usr/bin/perl
use strict;
use warnings;
use diagnostics;
my($infile, $phasefile, $Chromofile, $constant, $i);
foreach my $chr (1..284){ #1 to 284 b/c my data maps to scaffolds that are named numerically and range from 1 to 284, non exclusive)
$constant = "0 0\n";
$infile="batch_1.scaffold".$chr.".phase.inp"; #Exported fastPHASE input file (.inp) for each scaffold from STACKS
$phasefile="Scaf".$chr."._hapguess_switch.out"; #_hapguess_switch.out is output from fastPHASE
$Chromofile="Scaf".$chr.".phase";
if (-e $infile && $phasefile) { #Using -e flag on infile and phasefile names to skip over any scaffold numbers (between 1-284) where my data does not map to; ex: no data on scaffold 100
open (INFILE, '+<', "$infile");
open (PHASEFILE, '+<', "$phasefile"); open (CHROMOFILE, '>>', "$Chromofile");
print CHROMOFILE ($constant); #Prints a 0 on the first line of the Chromofile, then moves to next line
my @infile = ;
my @phasefile = ;
print CHROMOFILE @infile[0..3]; #Prints first four lines of fastPHASE input file (num samples, num loci, loci bp position on scaffold, and a row of "S" * num loci)
print CHROMOFILE @phasefile[0..187]; #I have 94 samples (diploid); therefore, 188 haplotype lines when -Z flag used in fastPHASE. This prints all lines of the file fastPHASE output to ChromoPainter input file.
close INFILE;
close PHASEFILE;
close CHROMOFILE;
}
}