This post is to divulge the PLINK Input Report Plug-in, which can be obtained from http://www.illumina.com/software/illumina_connect.ilmn. It is a straightforward way to get PLINK format files right from GenomeStudio.
Comparing to the last post (http://livestockbioinfo.blogspot.com/2011/10/howto-convert-genomestudio-files-to.html) this can be interpreted as a enhanced solution, although there are bright sides and down sides from each approach. The plug-in is undeniably more user-friendly, but scripting is nice because of its flexibility and possibility of improvements/adaptations.
Nevertheless, this powerful plug-in allows for exporting PLINK input files for a specific project/chip set or you can create a samplesheet with samples you'd like to add to a PLINK file and load it with GenomeStudio.
Have fun!
2011-11-11
2011-10-02
Howto: convert GenomeStudio's Final Report files to PLINK format (fixed)
This first Howto is focused on providing solutions to convert GenomeStudio files to PLINK format. I must confess I don't have much experience on Windows, so this is based on UNIX/Linux. I present some differences and alternatives for Windows users, but I report in advance that I haven't tested all of them and somethings might need adaptation.
GenomeStudio has several options of output - from matrix-like to third part files - but it doesn't support PLINK yet. Truth be told, any output format coming from GenomeStudio can be easily parsed and converted to anything you want using scripting.
As we have many options to export data from GenomeStudio, it is highly recommended to find out the most convenient format for you and stick consistent to it, avoiding future problems. My favorite way to store genotypes is by using the wizard report and exporting all SNPs with 1 individual per file. Useful columns are: SNP_name, Sample_ID, Allele1_AB, Allele2_AB and GC_Score - separated by tab.
The Sample_ID column is not a clever peace of information since it is the repetition of the individual ID over the SNP names. Nevertheless it is necessary to identify the file with its ID - especially because file names defined by GenomeStudio are generated by integer auto-incrementing after the chosen report name. For instance, if you save your report as "report", file names will be "report.txt", "report1.txt", "report2.txt" and so on. Also you are getting two extra files: "SNP_Map.txt" and "Sample_Map.txt". The SNP map will be used later on.
I wrote a simple Perl script to rename your files to avoid opening them to figure out which individual's genotypes are there > http://dl.dropbox.com/u/28917337/rename.pl. If you are working on Windows it is likely that you don't have Perl installed by default. Go to http://www.perl.org/get.html and follow the instructions on the website. Back to the script, you will need a text file containing the name of the files you want to rename (one per line). This is easily obtained under the UNIX shell by changing to the target directory and issuing:
% ls -1 report*.txt > filelist.txt
Once you have all the files to be renamed, the filelist and the script in the same folder, use the following command:
If you get the message "You don't have permission to rename it" on Linux you probably have permission restrictions. You can set all privileges to these files for all users by:
% chmod 777 *
or simply run the script as root
Now that you have all files renamed, it is time to get a PLINK equivalent. PLINK works with several file formats, but basically they derive from .ped format. Description of the file formats is detailed at http://pngu.mgh.harvard.edu/~purcell/plink/data.shtml. Download the following Perl script > http://dl.dropbox.com/u/28917337/gstudio_to_plink.pl. Now use the command:
This will generate a file named outputname.ped. To run plink you need another file called outputname.map. This is obtained from the SNP_Map.txt file from GenomeStudio.
On Windows, you can do it this way:
In this case, a header will be printed to the file as well. I don't have a solution for skipping it so you will need to open the file and delete the first row.
Now the files shoud work! Give it a try by calculating MAF for SNPs on chromosome 1:
When it is done you will find a new file called outputnamefreq.frq. I will make a new post on PLINK format variants and known issues very soon. Until there try its documentation, should be fun!
GenomeStudio has several options of output - from matrix-like to third part files - but it doesn't support PLINK yet. Truth be told, any output format coming from GenomeStudio can be easily parsed and converted to anything you want using scripting.
As we have many options to export data from GenomeStudio, it is highly recommended to find out the most convenient format for you and stick consistent to it, avoiding future problems. My favorite way to store genotypes is by using the wizard report and exporting all SNPs with 1 individual per file. Useful columns are: SNP_name, Sample_ID, Allele1_AB, Allele2_AB and GC_Score - separated by tab.
The Sample_ID column is not a clever peace of information since it is the repetition of the individual ID over the SNP names. Nevertheless it is necessary to identify the file with its ID - especially because file names defined by GenomeStudio are generated by integer auto-incrementing after the chosen report name. For instance, if you save your report as "report", file names will be "report.txt", "report1.txt", "report2.txt" and so on. Also you are getting two extra files: "SNP_Map.txt" and "Sample_Map.txt". The SNP map will be used later on.
I wrote a simple Perl script to rename your files to avoid opening them to figure out which individual's genotypes are there > http://dl.dropbox.com/u/28917337/rename.pl. If you are working on Windows it is likely that you don't have Perl installed by default. Go to http://www.perl.org/get.html and follow the instructions on the website. Back to the script, you will need a text file containing the name of the files you want to rename (one per line). This is easily obtained under the UNIX shell by changing to the target directory and issuing:
% ls -1 report*.txt > filelist.txt
Once you have all the files to be renamed, the filelist and the script in the same folder, use the following command:
% perl rename.pl <list with filenames>
If you get the message "You don't have permission to rename it" on Linux you probably have permission restrictions. You can set all privileges to these files for all users by:
% chmod 777 *
or simply run the script as root
% sudo perl rename.pl <list with filenames>
Now that you have all files renamed, it is time to get a PLINK equivalent. PLINK works with several file formats, but basically they derive from .ped format. Description of the file formats is detailed at http://pngu.mgh.harvard.edu/~purcell/plink/data.shtml. Download the following Perl script > http://dl.dropbox.com/u/28917337/gstudio_to_plink.pl. Now use the command:
% perl gstudio_to_plink.pl <list with filenames> <population (or breed) name> <output name>
This will generate a file named outputname.ped. To run plink you need another file called outputname.map. This is obtained from the SNP_Map.txt file from GenomeStudio.
% cat SNP_Map.txt | awk 'NR > 1 {print $3,$2,$4}' > outputname.map
On Windows, you can do it this way:
PS C:\Scripts> Get-Content .\SNP_Map.txt | %{ $_.Split('\t')[3][2][4]; } > outputname.map
In this case, a header will be printed to the file as well. I don't have a solution for skipping it so you will need to open the file and delete the first row.
Now the files shoud work! Give it a try by calculating MAF for SNPs on chromosome 1:
% plink --file outputname --map3 --compound-genotypes --chr 1 --freq --out outputnamefreq
When it is done you will find a new file called outputnamefreq.frq. I will make a new post on PLINK format variants and known issues very soon. Until there try its documentation, should be fun!
2011-10-01
Must have: PLINK and GenABEL
Recently some toolsets for genome-wide association studies popped out and made everybody's life easier. In the spotlights of this category are PLINK - a software written in C/C++, and GenABEL - a R package. Both programs run on Linux/PC/Mac. This first 'Must have' post is dedicated to present you these powerful tools and help you installing them.
I won't really do a PLINK x GenABEL pros/cons comment because although they were designed with the same objective - to provide comprehensible functions and routines for GWAS - they have interesting features of their own that when properly combined can offer high quality analyses. I strongly recommend going through their documentations > http://pngu.mgh.harvard.edu/~purcell/plink and http://www.genabel.org/tutorials/ABEL-tutorial. It is interesting to make a novel of it, seeking information as you need them.
GenABEL installation
First open R. Installing GenABEL is easy:
You will be prompted to select a CRAN mirror - just pick the nearest to your location. That's it.
PLINK installation
This one is a little bit trickier. Go to http://pngu.mgh.harvard.edu/~purcell/plink/download.shtml and download the file that best represents your platform. After downloading, If you are running XP or W7 you should copy the executable file to C:\windows\. All you must to do now is opening a Command Prompt and have fun! PLINK is called by simply typing plink. You will see this:
If you are not a fan of CLI you can try out gPLINK - a GUI for PLINK. But honestly you will have to deal with the CLI sooner or later.
If you work on a Linux/UNIX system you must pass the pathway to PLINK to your shell through the $PATH variable. Unfortunately the way to do that depends on the shell you are using. I will give support for two shells in this post: bash and zsh. If you are not sure which one is yours just open a terminal and type:
You should get something like:
Next step is to define where PLINK will be stored. I like to have all my softwares in a directory called programs on my Desktop - you can unzip the files pretty much wherever you find convenient. Let's say you have a folder like mine. Unzip the files there and open a terminal. Type:
% nano ~/.zshrc - for zsh
% nano ~/.bashrc - for bash
this will open the resources file for your shell with the text processor nano. This is where custom initializations and settings are definied. Type this anywhere:
or if you downloaded the i686
After this close nano with cntrl+X and save modifications. Type:
and then try:
If it works you will see the same standard out as shown before.
That's all for now folks. This post is not helpful in a practical way but before hammering we should make clear where the hammer is. Comments or suggestions please go ahead!
*Not native speaker - sorry any typos!
I won't really do a PLINK x GenABEL pros/cons comment because although they were designed with the same objective - to provide comprehensible functions and routines for GWAS - they have interesting features of their own that when properly combined can offer high quality analyses. I strongly recommend going through their documentations > http://pngu.mgh.harvard.edu/~purcell/plink and http://www.genabel.org/tutorials/ABEL-tutorial. It is interesting to make a novel of it, seeking information as you need them.
GenABEL installation
First open R. Installing GenABEL is easy:
>install.packages("GenABEL",dependencies=TRUE)
You will be prompted to select a CRAN mirror - just pick the nearest to your location. That's it.
PLINK installation
This one is a little bit trickier. Go to http://pngu.mgh.harvard.edu/~purcell/plink/download.shtml and download the file that best represents your platform. After downloading, If you are running XP or W7 you should copy the executable file to C:\windows\. All you must to do now is opening a Command Prompt and have fun! PLINK is called by simply typing plink. You will see this:
@----------------------------------------------------------@
| PLINK! | v1.07 | 10/Aug/2009 |
|----------------------------------------------------------|
| (C) 2009 Shaun Purcell, GNU General Public License, v2 |
|----------------------------------------------------------|
| For documentation, citation & bug-report instructions: |
| http://pngu.mgh.harvard.edu/purcell/plink/ |
@----------------------------------------------------------@
Web-based version check ( --noweb to skip )
Connecting to web... OK, v1.07 is current
Writing this text to log file [ plink.log ]
Analysis started: Sat Oct 1 21:30:15 2011
Options in effect:
Before frequency and genotyping pruning, there are 0 SNPs
0 founders and 0 non-founders found
0 SNPs failed missingness test ( GENO > 1 )
0 SNPs failed frequency test ( MAF < 0 )
After frequency and genotyping pruning, there are 0 SNPs
ERROR: Stopping as there are no SNPs left for analysis
| PLINK! | v1.07 | 10/Aug/2009 |
|----------------------------------------------------------|
| (C) 2009 Shaun Purcell, GNU General Public License, v2 |
|----------------------------------------------------------|
| For documentation, citation & bug-report instructions: |
| http://pngu.mgh.harvard.edu/purcell/plink/ |
@----------------------------------------------------------@
Web-based version check ( --noweb to skip )
Connecting to web... OK, v1.07 is current
Writing this text to log file [ plink.log ]
Analysis started: Sat Oct 1 21:30:15 2011
Options in effect:
Before frequency and genotyping pruning, there are 0 SNPs
0 founders and 0 non-founders found
0 SNPs failed missingness test ( GENO > 1 )
0 SNPs failed frequency test ( MAF < 0 )
After frequency and genotyping pruning, there are 0 SNPs
ERROR: Stopping as there are no SNPs left for analysis
If you are not a fan of CLI you can try out gPLINK - a GUI for PLINK. But honestly you will have to deal with the CLI sooner or later.
If you work on a Linux/UNIX system you must pass the pathway to PLINK to your shell through the $PATH variable. Unfortunately the way to do that depends on the shell you are using. I will give support for two shells in this post: bash and zsh. If you are not sure which one is yours just open a terminal and type:
% echo $SHELL
You should get something like:
/bin/zsh
or
/bin/bash
Next step is to define where PLINK will be stored. I like to have all my softwares in a directory called programs on my Desktop - you can unzip the files pretty much wherever you find convenient. Let's say you have a folder like mine. Unzip the files there and open a terminal. Type:
% nano ~/.zshrc - for zsh
% nano ~/.bashrc - for bash
this will open the resources file for your shell with the text processor nano. This is where custom initializations and settings are definied. Type this anywhere:
PATH=$PATH:/home/yourusername/Desktop/programs/PLINK/plink-1.07-x86_64
export PATH
export PATH
or if you downloaded the i686
PATH=$PATH:/home/yourusername/Desktop/programs/PLINK/plink-1.07-i686
export PATH
After this close nano with cntrl+X and save modifications. Type:
% source ~/.zshrc
or
% source ~/.bashrc
and then try:
% plink
If it works you will see the same standard out as shown before.
That's all for now folks. This post is not helpful in a practical way but before hammering we should make clear where the hammer is. Comments or suggestions please go ahead!
*Not native speaker - sorry any typos!
Subscribe to:
Posts (Atom)