BayesAss for RADseq Data

I want to use BayesAss on a large SNP dataset generated with RADseq.  But I found out when I went to convert the data into the .immaq format that my favorite converter, PGDspider, would only convert the first 40 loci.  I didn’t get 10,000s of loci for nothing, so that wasn’t going to work.  But a second problem was that BayesAss 3.0.3 only allows 240 SNPs anyways.

Obviously I’m not the only person with this problem.  And thanks to Steve Mussmann there’s a solution.  Steve re-wrote BayesAss 3.0.4 to be able to handle large SNP datasets, as well as a program to convert the STRUCTURE files from pyRAD into .immaq input files for BayesAss.

Since I have STRUCTURE files from STACKS and not pyRAD, I had to do a little conversion.  My messy code is below, but leave a comment if you are a wiz with an elegant solution.

Turning STRUCTURE Output from STACKS into STRUCTURE Output from pyRAD
First, output data from STACKS in STRUCTURE format (.str).  Remove the first two rows from this output (STACKS header and loci identifiers).  Then, print the first column (sample names), insert five empty columns to match pyRAD.  Do not print the second column from the STACKS STRUCTURE output because that is the population code from your Population Map input into STACKS.

awk '{print $1 "\t" "\t" "\t" "\t" "\t" "\t"}' data.str > test.out

Next, print out the remainder of your data starting at column 3 (i.e. first locus) in the original dataset (awk code from here).  Then use paste to concatenate the two files into the .str file you will convert.

awk '{for(i=3;i<NF;i++)printf"%s",$i OFS;if(NF)printf"%s",$NF;printf ORS}' data.str | tr ' ' '\t' > test2.out

paste test.out test2.out > data.str

Convert .str to .immac Using a Custom PERL Script
If you already had data from pyRAD and did not have to do the steps above, you can just convert your .str file to an .immac using Steve’s script: str2immaq.pl. However, if you have data from STACKS then you need to modify lines 39-52 in the original script to the following:

# convert structure format to immanc format, and push data into hash
for(my $i = 0; $i < @strlines; $i++){
	my @ima;
	my @temp = split(/\s+/, $strlines[$i]);
	my $name = shift(@temp);
	foreach my $allele(@temp){
		push( @ima, $allele);
	}

Since STACKS exports missing data as 0 (whereas pyRAD exports missing data as -9), this change removes converting missing data from -9 to 0. Steve also wrote this change.  Save the change in the perl script, then use the script to convert data from .str to .immac.  Now you’ve got an input file for BayesAss.

3 thoughts on “BayesAss for RADseq Data

  1. Lina Maria Valencia

    Hi Emily, Thanks for this post! It has been super useful. However when I run this new version of Bayesass I still get the error “Segmentation fault”, which if I am not wrong refers to the maximum number of SNPs. Did you encounter this error?
    Thanks for your help!

    Reply
    1. Sarah

      Hi Emily!
      Great post! I am having the same issue as Lina. Did you have any issues with getting an error (Segmentation fault) when running the new version?
      Thank you,
      Sarah

      Reply
      1. EEPuckett Post author

        Hi Sarah and Lina,

        My apologies for not writing sooner. I have not had a segmentation fault when I run it with 28,000+ loci. Since segmentation faults are usually related to memory issues, I ran a test with my full dataset but limited the number of iterations it ran. On my MacPro it used 7.15Gb of memory to run in the terminal. So a few suggestions: first, check how much memory your computer has for running programs. I believe BayesAss stores in memory all of the loci through all of the MCMCs, so it is quite the memory hog. Second, reboot your computer and when ready only run BayesAss so it is has all available resources to run. Third, if you can get the program installed on a university super computer so you have access to more memory. Fourth, Bruce Rannala recommended (on the BayesAss message board: https://sourceforge.net/p/bayesass/discussion/bayesass/) breaking loci down to chromosomes or contigs and individually running the program per chromosome. However, I do not remember how he recommended combining the results if you take this approach.

        Bests~ Emily

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s