Formatting Microsatellite Data for PCA in EIGENSOFT

At this point, who hasn’t read Patterson et al 2006 about population structure and eigenvector analysis?  It’s a great paper as it introduced the EIGENSOFT package for analyzing genomic data using principal components analysis (PCA).  PCA is a great way to identify both population structure and admixture relationships.  For anyone that works with microsatellites, you no doubt noticed the paragraph that says you can use microsatellite data as the PCA input.  However, the input format is not all that clear for how to convert your data.  This post provides an in-eloquent way of doing the format conversion using Excel tables.

Let’s start with our data in double column format (e.g.- STRUCTURE format).
Let’s call this Tab1:

MicroData

Note that Sample2 is missing data at Locus1.

You need to convert your data so that each allele of a microsatellite in your dataset has its own column.  In the toy example there are four alleles for Locus1, and three alleles for Locus2.  In Excel, I set up a new tab with the paired locus and allele information.  Also remember that it is important to keep the sample order the same as in the two column format.

The next step is to populate the new table with the number of alleles (0, 1, or 2) that each sample has for each locus_allele combination.  While this may be simple enough to do my hand for toy datasets, it is much easier (and less error prone) to use formulas in Excel to populate the new table.  I used an if/then statement.  Specifically, if the allele in the first column for the locus of interest was equal to the allele in the header row, then populate that cell with the value of 1; if the values in Tab 1 and Tab 2 are not equal, then populate the cell with a 0.

Write formulas for each locus_allele combination for the first sample, making sure to lock the right hand of the equation (allele value to compare the data to) using dollar signs before the row and column identifiers.

One_Allele

Once formulas have been written for the first sample, you can drag the equations down to populate the full table.

You should have noticed that when you do this, you are only accounting for the first allele.  Therefore, you can add (+) an additional if/then statement to account for the second allele in the second column, like this:

Two_Alleles

Yay!  There are now counts for all of the alleles.  However, we still have to account for the missing data.  To do this, we can still use the same idea as before by adding another if/then statement, but this time it will evaluate if the cell in the original data (Tab 1) is blank.  Blank cells can be coded in Excel as empty quotes (e.g. “”).  Since missing data should be missing in both columns in the original double column data (Tab 1), we only need to evaluate the first column, and assign it a value of 9 if it fulfills the if/then statement.

Two_Alleles_Missing

Now we see that Locus1 of Sample2 has been coded as missing data (9) at all alleles.

Finally, you need to make the three files (.eigenstratgeno, .snp, and .ind) that serve as input in EIGENSOFT.  To make the .eigenstratgeno file, copy your Excel table, then paste the values to remove the formulas.  (If you want to reference the formulas in the future, paste the values in a new tab.)  Now copy the table (with values not formulas) again and use the transpose function.  The toy data now looks like this:

EigenstratgenoFormat

Copy just the values (no sample or locus names) and paste into a text file, then remove the tabs between each column that Excel inserts.

FinalData

Finally, you may be asking why go through all of this trouble to use EIGENSOFT for PCA when you could also make one using the R package adegenet?  I like EIGENSOFT because you can analyze the output with Tracy-Widom statistics (within the EIGENSOFT package) to identify which principal components are significant versus less rigorous ways such as observing the plateau of eigenvalues.

Categories: Data Analysis

6 Comments on “Formatting Microsatellite Data for PCA in EIGENSOFT

  1. Hey Emily, nice post!

    I just got a reviewer who told me to add a PCA or an FCA to strengthen my STRUCTURE analysis for publication. I have about 300 individuals genotyped at 29 microsats.

    I never used EIGENSOFT to be honest, but I am quite handy with R although I haven’t tried running PCAs with ADEGENET. I guess I have 2 questions.

    Do you know if EIGENSOFT can run FCAs (Factorial Correspondence Analyses)?
    You would still recommend EIGENSOFT over ADEGENET package?

    Thanks

    1. Thanks Giuliano!
      EIGENSOFT does not do FCA. I like EIGENSOFT because I know how to use it and have access to a Linux cluster to run it on. You may want to look into the R package LEA (http://membres-timc.imag.fr/Olivier.Francois/LEA/index.htm) which also does PCA. My preference for EIGENSOFT is not solely because I’ve used it before; I really like the Tracy-Widom statistics for understanding how many PCs are significant. LEA implements this as well. LEA also gives instructions for converting microsats into the format they need, and I must imagine it is easier than my ganky Excel workaround.

  2. Hi Emily,

    Thanks for the info on formatting the data!
    Currently I am working with 27microsatellites from human populations (956 individuals)
    I have a question regarding the .snp file. I am not sure how to build it with this newly generated table. Could you please guide me a bit on that? I have seen .ind files for SNP data and it has a different format. Any help would be much appreciated.

    Thanks again,
    Best,

    Evelyn

    1. Hi Evelyn,

      The .snp file has six columns with tabs between each column and an end line (\n) at the end. For SNP data the columns are:
      1- locus name
      2- chromosome
      3- a distance measure (I think- would have to check the manual)
      4- base pair position
      5- ancestral or reference or major allele
      6- derived or minor allele

      So for microsat data I make my own with dummy variables in columns 2, 3, 5, and 6. But I do make sure to vary the locus name (combining my microsat locus + allele), and for column 4 I just make everything different.
      Ex:
      G10L115 1 0 1 A C
      G10L121 1 0 2 A C
      .
      .
      .
      G10L141 1 0 12 A C

      Bests,
      Emily

      1. Hi Emily,
        Great, thank you for your help! I really appreciate it.

        Best,

        Evelyn

Comments are closed.