BayesAss for RADseq Data

I want to use BayesAss on a large SNP dataset generated with RADseq.  But I found out when I went to convert the data into the .immaq format that my favorite converter, PGDspider, would only convert the first 40 loci.  I didn’t get 10,000s of loci for nothing, so that wasn’t going to work.  But a second problem was that BayesAss 3.0.3 only allows 240 SNPs anyways.

Obviously I’m not the only person with this problem.  And thanks to Steve Mussmann there’s a solution.  Steve re-wrote BayesAss 3.0.4 to be able to handle large SNP datasets, as well as a program to convert the STRUCTURE files from pyRAD into .immaq input files for BayesAss.

Since I have STRUCTURE files from STACKS and not pyRAD, I had to do a little conversion.  My messy code is below, but leave a comment if you are a wiz with an elegant solution.

Turning STRUCTURE Output from STACKS into STRUCTURE Output from pyRAD
First, output data from STACKS in STRUCTURE format (.str).  Remove the first two rows from this output (STACKS header and loci identifiers).  Then, print the first column (sample names), insert five empty columns to match pyRAD.  Do not print the second column from the STACKS STRUCTURE output because that is the population code from your Population Map input into STACKS.

awk '{print $1 "\t" "\t" "\t" "\t" "\t" "\t"}' data.str > test.out

Next, print out the remainder of your data starting at column 3 (i.e. first locus) in the original dataset (awk code from here).  Then use paste to concatenate the two files into the .str file you will convert.

awk '{for(i=3;i<NF;i++)printf"%s",$i OFS;if(NF)printf"%s",$NF;printf ORS}' data.str | tr ' ' '\t' > test2.out

paste test.out test2.out > data.str

Convert .str to .immac Using a Custom PERL Script
If you already had data from pyRAD and did not have to do the steps above, you can just convert your .str file to an .immac using Steve’s script: However, if you have data from STACKS then you need to modify lines 39-52 in the original script to the following:

# convert structure format to immanc format, and push data into hash
for(my $i = 0; $i < @strlines; $i++){
	my @ima;
	my @temp = split(/\s+/, $strlines[$i]);
	my $name = shift(@temp);
	foreach my $allele(@temp){
		push( @ima, $allele);

Since STACKS exports missing data as 0 (whereas pyRAD exports missing data as -9), this change removes converting missing data from -9 to 0. Steve also wrote this change.  Save the change in the perl script, then use the script to convert data from .str to .immac.  Now you’ve got an input file for BayesAss.

Not on the Postdoc Market- Round 2

In 2008 I graduated with my MS from Larry Smart’s Lab at SUNY-ESF.  Larry gives his students a personalized graduation gift, something that reflects the rapport he had with each student.  Mine included a hunter green sweatshirt, a hunter green picnic blanket, and a green water bottle because as he said, “she needs more green stuff.”  So yes, SUNY-ESF’s school colors are green and gold, but I’m pretty sure he had Michigan State University in mind with my hunter graduation gifts.  Larry went to MSU for his PhD, and I went to NC State for undergrad whose school colors are red and white.  During my 2nd year in his lab, NCSU and MSU played each other in the ACC-Big 10 Challenge and Larry and I bet on our respective teams, loser bakes the winner dessert in school spirit colors.  I made cupcakes with bright green frosting.  But apparently all that hunter apparel was just getting me ready for 2017…

I started a postdoctoral position in the Bradburd Lab at MSU.  I will work on spatial and temporal population genomics.  I’m really looking forward to learning new modeling skills.

Not on the Postdoc Market

Today is the first day of my postdoc with Jason Munshi-South at Fordham University! I’m super excited about starting this position and not just because I get to go to work everyday in a mansion (see below). The Munshi-South Lab focuses on adaptation to the urban environment. I think this is a really unique way to think about the forces that shape selection particularly rapid adaptation on shorter time scales. In 2009 the United Nations estimated half the world’s human population lived in urban areas and that the percentage will continue to rise. By studying species with both shorter generation times and closer contact with the urban environment (ex- pollutants, artificial lighting, linearized environments), we hope to gain an understanding of how selection responds to urbanization. This may provide insight into how humans are also adapting to urbanization.

The project that I will work on focuses on brown rats (Rattus norvegicus). Brown rats are not native to North America and were introduced via ships coming from Europe in the 1700-1800s. So before we can understand how selection has acted upon their genomes, we must first understand where their genomes came from as we hypothesize substantial admixture in North American populations. (Hmm, phylogeography and admixture, that sounds like something I know about.)

Beyond the project, I’m excited to work with new lab mates, always a fun part of science!

Calder Hall at the Louis Calder Center, Fordham University

Calder Hall at the Louis Calder Center, Fordham University

NSF Museum Collections Postdoc 2015 Summary

Last year I applied for NSF’s Postdoc utilizing museum collections, the first year for that particular competition. My colleagues have started asking for advice on applying, so I went to NSF’s awards website and saw the types of projects they funded. I summarize a few results below.

In 2015, there were 56 proposals submitted and 27 funded (48% funding rate). The break down based on priority was:

  • High: 7
  • Medium: 10
  • Low: 22
  • Do Not Fund: 17

I think there is some good news in here for us young researchers who may not understand how funding works. Specifically, it was not only the High Priority proposals that were funded, there must have been Medium and Low Priority proposals also funded.

Of the funded proposals 18 focused on vertebrates (birds: 8; mammals: 4), 4 on invertebrates, 4 on plants, and 1 was very difficult to tell the focal species.

Since museum collections lend themselves to both genomic approaches and studies of morphometrics, I counted the number of proposals that mentioned each technique. Based on the abstracts, 54% of will use genomics, and 15% will collect morphometric data. I would say these are minimum estimates as some abstracts were unclear.

Finally, I counted the number of museums that each proposal would partner. On average, funded proposals will work with 2.7 museums (range 1-6), although seven abstracts did not report number of partner museums. This is critical in the writing stage as you will need a letter of support from each museum partner.

Of course none of these stats suggest how to frame an interesting and relevant question that utilizes museum collections. I present them simply as a snapshot of what was funded the first year of this unique postdoc.

Formatting Microsatellite Data for PCA in EIGENSOFT

At this point, who hasn’t read Patterson et al 2006 about population structure and eigenvector analysis?  It’s a great paper as it introduced the EIGENSOFT package for analyzing genomic data using principal components analysis (PCA).  PCA is a great way to identify both population structure and admixture relationships.  For anyone that works with microsatellites, you no doubt noticed the paragraph that says you can use microsatellite data as the PCA input.  However, the input format is not all that clear for how to convert your data.  This post provides an in-eloquent way of doing the format conversion using Excel tables.

Let’s start with our data in double column format (e.g.- STRUCTURE format).
Let’s call this Tab1:


Note that Sample2 is missing data at Locus1.

You need to convert your data so that each allele of a microsatellite in your dataset has its own column.  In the toy example there are four alleles for Locus1, and three alleles for Locus2.  In Excel, I set up a new tab with the paired locus and allele information.  Also remember that it is important to keep the sample order the same as in the two column format.

The next step is to populate the new table with the number of alleles (0, 1, or 2) that each sample has for each locus_allele combination.  While this may be simple enough to do my hand for toy datasets, it is much easier (and less error prone) to use formulas in Excel to populate the new table.  I used an if/then statement.  Specifically, if the allele in the first column for the locus of interest was equal to the allele in the header row, then populate that cell with the value of 1; if the values in Tab 1 and Tab 2 are not equal, then populate the cell with a 0.

Write formulas for each locus_allele combination for the first sample, making sure to lock the right hand of the equation (allele value to compare the data to) using dollar signs before the row and column identifiers.


Once formulas have been written for the first sample, you can drag the equations down to populate the full table.

You should have noticed that when you do this, you are only accounting for the first allele.  Therefore, you can add (+) an additional if/then statement to account for the second allele in the second column, like this:


Yay!  There are now counts for all of the alleles.  However, we still have to account for the missing data.  To do this, we can still use the same idea as before by adding another if/then statement, but this time it will evaluate if the cell in the original data (Tab 1) is blank.  Blank cells can be coded in Excel as empty quotes (e.g. “”).  Since missing data should be missing in both columns in the original double column data (Tab 1), we only need to evaluate the first column, and assign it a value of 9 if it fulfills the if/then statement.


Now we see that Locus1 of Sample2 has been coded as missing data (9) at all alleles.

Finally, you need to make the three files (.eigenstratgeno, .snp, and .ind) that serve as input in EIGENSOFT.  To make the .eigenstratgeno file, copy your Excel table, then paste the values to remove the formulas.  (If you want to reference the formulas in the future, paste the values in a new tab.)  Now copy the table (with values not formulas) again and use the transpose function.  The toy data now looks like this:


Copy just the values (no sample or locus names) and paste into a text file, then remove the tabs between each column that Excel inserts.


Finally, you may be asking why go through all of this trouble to use EIGENSOFT for PCA when you could also make one using the R package adegenet?  I like EIGENSOFT because you can analyze the output with Tracy-Widom statistics (within the EIGENSOFT package) to identify which principal components are significant versus less rigorous ways such as observing the plateau of eigenvalues.

Script for ChromoPainter

I’m new to Perl so this may not be the most elegant script. The script converts fastPHASE output into a ChromoPainter input file. The for loop makes an input file for each scaffold for which my data maps to (in my case 284 scaffolds of the polar bear genome).

use strict;
use warnings;
use diagnostics;
my($infile, $phasefile, $Chromofile, $constant, $i);
foreach my $chr (1..284){ #1 to 284 b/c my data maps to scaffolds that are named numerically and range from 1 to 284, non exclusive)
$constant = "0 0\n";
$infile="batch_1.scaffold".$chr.".phase.inp"; #Exported fastPHASE input file (.inp) for each scaffold from STACKS
$phasefile="Scaf".$chr."._hapguess_switch.out"; #_hapguess_switch.out is output from fastPHASE
if (-e $infile && $phasefile) { #Using -e flag on infile and phasefile names to skip over any scaffold numbers (between 1-284) where my data does not map to; ex: no data on scaffold 100
open (INFILE, '+<', "$infile");
open (PHASEFILE, '+<', "$phasefile"); open (CHROMOFILE, '>>', "$Chromofile");
print CHROMOFILE ($constant); #Prints a 0 on the first line of the Chromofile, then moves to next line
my @infile = ;
my @phasefile = ;
print CHROMOFILE @infile[0..3]; #Prints first four lines of fastPHASE input file (num samples, num loci, loci bp position on scaffold, and a row of "S" * num loci)
print CHROMOFILE @phasefile[0..187]; #I have 94 samples (diploid); therefore, 188 haplotype lines when -Z flag used in fastPHASE. This prints all lines of the file fastPHASE output to ChromoPainter input file.
close INFILE;

Dance Your PhD 2014 Submission

I wanted to enter the Dance Your PhD contest since before I was accepted into a PhD program!  Although never formally trained in dance, I did dance on my high school’s team and generally enjoyed the process.  So Dance Your PhD seemed totally up my alley combining art and science.

I had the vision for my dance mapped out in the second year of my PhD program.  While preparing an NSF DDIG and really thinking about both my hypotheses and the conservation implications of my work, the dance simply story-boarded out in my mind.  Specifically, I would combine my dissertation chapters on American black bear (Ursus americanus) phylogeography with its implications for wildlife trade.  Black bears experience a low amount of trade for traditional Asian medicine where the gallbladders and paws are thought to be medicinally useful.  By understanding genetic diversity throughout the range (as informed by the evolution of bear lineages), we can predict where an illegal sample came from, since the sample should have a genetic signature similar to its natal population. Even though I had the story mapped out, I didn’t want to dance my hypothses, I wanted to dance my data. So I sat on my dance until this year as I’m finishing up my dissertation.

Knowing the parts of my dissertation I wanted to convey led easily into the story of the interpretive dance. In the first scene, a bear (that’s me) is illegally hunted. In the second scene, I show the results of the phylogeography study. The appearance of the bears on screen signifies their speciation in North America. Over time the species expands its range and population size, all the while accumulating genetic variation which you can see slow emerge between the dancers in the east (stage right) and west (stage left). The bears are pushed south into glacial refugia. The group dance is meant to convey how even though genetic differentiation has already accumulated, they are still a single species. As the third scene begins, we come back to our illegally hunted bear who dances her genetic signature to try to find which broad region she belongs. After that broad lineage is identified, she is finally able to find her natal area within the whole species range.

The title of the piece, “No Way Home,” is a reference to Dave Wilcove’s book of the same name. The book describes multiple ways in which human induced landscape changes have affected animal migration routes. I use the title two ways. First, to recognize the impact of ongoing habitat fragmentation in limiting dispersal and gene flow in many species, not just bears. Second, to tie in the use of natal origin identification for wildlife trade application; although we can use genetics in this way, an illegally harvested animal no longer contributes it’s genes to the population. Therefore, through habitat fragmentation or illegal harvest, there is, “No Way Home” for many animals.

I first heard “Wapsuk,” by Kathleen Edwards when she opened for Bon Iver in 2011. She told the audience the story of being an artist invited on an expedition to Wapsuk National Park in Manitoba, Canada where she saw polar bears (Ursus maritimus) for the first time. Immediately, I sat up and took note, because it was a song about bears! Two hours later as I walked to the car (after an amazing set by Bon Iver), I started singing Wapsuk, and that’s when I knew that it HAD to be the song for my Dance Your PhD submission.

A final note about my dance: I dedicated it to the staff (past, present, and future) of the UNCG All Arts and Sciences Camp and specifically to camp’s long time Executive Director Bob Prout. That was the summer camp I attended then worked at when I was younger. Camp was my absolutely favorite thing/place/week that I looked forward to all year long. The dedication is just my tiny way of saying thank you for all you do to create fun and formative experiences for us art and science nerds. I can not fully express how very much camp means to me, still to this day.

Evolution2014 Talk- The talk that should have been

I organize a lot of things.  Like A LOT of things!  I chaired my department’s seminar organizing committee for three years.  I’ve organized the rangewide sampling effort for my dissertation project with 32 organizations. I founded and organize schedules for a blog.  The list goes on; I’m organized.

So imagine my surprise when I opened the Evolution2014 program and didn’t find my name. Apparently, the one thing I didn’t organize this year (and perhaps the most important since I’m looking for a postdoc) was signing up to give a talk. I have memories of signing up for a talk so can only imagine I got distracted and closed out without finishing; I truly thought I signed up.

Since life handed me lemons, I decided to make lemonade. Where the “lemonade” was really just making the slides for the talk I would have given. Feel free to download as a pdf: PuckettEvol2014; then find me at the meeting or tweet to me, as I’m happy to discuss further!

I’m on the postdoc market

I created this website in large part because I am looking for postdoctoral research opportunities.  I anticipate graduating in May 2015 and starting a new position shortly thereafter.  My current research interests are below; additionally, you can find links to my publications here. I would be happy to discuss project ideas or university specific fellowship opportunities that may fund my time in your lab.  Please contact me: EEPuckett at mizzou dot edu, or via Twitter.

Research Interests
I am broadly interested in both the spatial and temporal distribution of genetic variation in functional and neutral loci. Regarding the spatial distribution, I am interested in geographic barriers leading to within species lineage diversification and range expansion processes, particularly the adaptive potential of range expansion for widespread species. Regarding the temporal distribution, I am interested in both the timing of either de novo mutations or changes in standing allele frequency variation as an adaptive response to climatic change. I am also interested in incomplete lineage sorting.