2011-10-02

Howto: convert GenomeStudio's Final Report files to PLINK format (fixed)

This first Howto is focused on providing solutions to convert GenomeStudio files to PLINK format. I must confess I don't have much experience on Windows, so this is based on UNIX/Linux. I present some differences and alternatives for Windows users, but I report in advance that I haven't tested all of them and somethings might need adaptation.

GenomeStudio has several options of output - from matrix-like to third part files - but it doesn't support PLINK yet. Truth be told, any output format coming from GenomeStudio can be easily parsed and converted to anything you want using scripting.

As we have many options to export data from GenomeStudio, it is highly recommended to find out the most convenient format for you and stick consistent to it, avoiding future problems. My favorite way to store genotypes is by using the wizard report and exporting all SNPs with 1 individual per file. Useful columns are: SNP_name, Sample_ID, Allele1_AB, Allele2_AB and GC_Score - separated by tab.

The Sample_ID column is not a clever peace of information since it is the repetition of the individual ID over the SNP names. Nevertheless it is necessary to identify the file with its ID - especially because file names defined by GenomeStudio are generated by integer auto-incrementing after the chosen report name. For instance, if you save your report as "report", file names will be "report.txt", "report1.txt", "report2.txt" and so on. Also you are getting two extra files: "SNP_Map.txt" and "Sample_Map.txt". The SNP map will be used later on.

I wrote a simple Perl script to rename your files to avoid opening them to figure out which individual's genotypes are there > http://dl.dropbox.com/u/28917337/rename.pl. If you are working on Windows it is likely that you don't have Perl installed by default. Go to http://www.perl.org/get.html and follow the instructions on the website. Back to the script, you will need a text file containing the name of the files you want to rename (one per line). This is easily obtained under the UNIX shell by changing to the target directory and issuing:


% ls -1 report*.txt > filelist.txt

Once you have all the files to be renamed, the filelist and the script in the same folder, use the following command:

% perl rename.pl <list with filenames>

If you get the message "You don't have permission to rename it" on Linux you probably have permission restrictions. You can set all privileges to these files for all users by:

% chmod 777 *

or simply run the script as root

% sudo perl rename.pl <list with filenames>

Now that you have all files renamed, it is time to get a PLINK equivalent. PLINK works with several file formats, but basically they derive from .ped format. Description of the file formats is detailed at http://pngu.mgh.harvard.edu/~purcell/plink/data.shtml. Download the following Perl script > http://dl.dropbox.com/u/28917337/gstudio_to_plink.pl. Now use the command:

% perl gstudio_to_plink.pl <list with filenames> <population (or breed) name> <output name>

This will generate a file named outputname.ped. To run plink you need another file called outputname.map. This is obtained from the SNP_Map.txt file from GenomeStudio.

% cat SNP_Map.txt | awk 'NR > 1 {print $3,$2,$4}' > outputname.map

On Windows, you can do it this way:

PS C:\Scripts> Get-Content .\SNP_Map.txt | %{ $_.Split('\t')[3][2][4]; } > outputname.map

In this case, a header will be printed to the file as well. I don't have a solution for skipping it so you will need to open the file and delete the first row.

Now the files shoud work! Give it a try by calculating MAF for SNPs on chromosome 1:

% plink --file outputname --map3 --compound-genotypes --chr 1 --freq --out outputnamefreq

When it is done you will find a new file called outputnamefreq.frq. I will make a new post on PLINK format variants and known issues very soon. Until there try its documentation, should be fun!

2011-10-01

Must have: PLINK and GenABEL

Recently some toolsets for genome-wide association studies popped out and made everybody's life easier. In the spotlights of this category are PLINK - a software written in C/C++, and GenABEL - a R package. Both programs run on Linux/PC/Mac. This first 'Must have' post is dedicated to present you these powerful tools and help you installing them.

I won't really do a PLINK x GenABEL pros/cons comment because although they were designed with the same objective - to provide comprehensible functions and routines for GWAS - they have interesting features of their own that when properly combined can offer high quality analyses. I strongly recommend going through their documentations > http://pngu.mgh.harvard.edu/~purcell/plink and http://www.genabel.org/tutorials/ABEL-tutorial. It is interesting to make a novel of it, seeking information as you need them.

GenABEL installation
First open R. Installing GenABEL is easy:

>install.packages("GenABEL",dependencies=TRUE)

You will be prompted to select a CRAN mirror - just pick the nearest to your location. That's it.

PLINK installation
This one is a little bit trickier. Go to http://pngu.mgh.harvard.edu/~purcell/plink/download.shtml and download the file that best represents your platform. After downloading, If you are running XP or W7 you should copy the executable file to C:\windows\. All you must to do now is opening a Command Prompt and have fun! PLINK is called by simply typing plink. You will see this:


@----------------------------------------------------------@
|        PLINK!       |     v1.07      |   10/Aug/2009     |
|----------------------------------------------------------|
|  (C) 2009 Shaun Purcell, GNU General Public License, v2  |
|----------------------------------------------------------|
|  For documentation, citation & bug-report instructions:  |
|        http://pngu.mgh.harvard.edu/purcell/plink/        |
@----------------------------------------------------------@

Web-based version check ( --noweb to skip )
Connecting to web...  OK, v1.07 is current

Writing this text to log file [ plink.log ]
Analysis started: Sat Oct  1 21:30:15 2011

Options in effect:

Before frequency and genotyping pruning, there are 0 SNPs
0 founders and 0 non-founders found
0 SNPs failed missingness test ( GENO > 1 )
0 SNPs failed frequency test ( MAF < 0 )
After frequency and genotyping pruning, there are 0 SNPs

ERROR: Stopping as there are no SNPs left for analysis

If you are not a fan of CLI you can try out gPLINK - a GUI for PLINK. But honestly you will have to deal with the CLI sooner or later.

If you work on a Linux/UNIX system you must pass the pathway to PLINK to your shell through the $PATH variable. Unfortunately the way to do that depends on the shell you are using. I will give support for two shells in this post: bash and zsh. If you are not sure which one is yours just open a terminal and type:

% echo $SHELL

You should get something like:

/bin/zsh
or
/bin/bash

Next step is to define where PLINK will be stored. I like to have all my softwares in a directory called programs on my Desktop - you can unzip the files pretty much wherever you find convenient. Let's say you have a folder like mine. Unzip the files there and open a terminal. Type:

% nano ~/.zshrc - for zsh
% nano ~/.bashrc - for bash

this will open the resources file for your shell with the text processor nano. This is where custom initializations and settings are definied. Type this anywhere:

PATH=$PATH:/home/yourusername/Desktop/programs/PLINK/plink-1.07-x86_64
export PATH

or if you downloaded the i686

PATH=$PATH:/home/yourusername/Desktop/programs/PLINK/plink-1.07-i686
export PATH

After this close nano with cntrl+X and save modifications. Type:

% source ~/.zshrc
or
% source ~/.bashrc

and then try:

% plink

If it works you will see the same standard out as shown before.

That's all for now folks. This post is not helpful in a practical way but before hammering we should make clear where the hammer is. Comments or suggestions please go ahead!

*Not native speaker - sorry any typos!