2011-10-02

Howto: convert GenomeStudio's Final Report files to PLINK format (fixed)

This first Howto is focused on providing solutions to convert GenomeStudio files to PLINK format. I must confess I don't have much experience on Windows, so this is based on UNIX/Linux. I present some differences and alternatives for Windows users, but I report in advance that I haven't tested all of them and somethings might need adaptation.

GenomeStudio has several options of output - from matrix-like to third part files - but it doesn't support PLINK yet. Truth be told, any output format coming from GenomeStudio can be easily parsed and converted to anything you want using scripting.

As we have many options to export data from GenomeStudio, it is highly recommended to find out the most convenient format for you and stick consistent to it, avoiding future problems. My favorite way to store genotypes is by using the wizard report and exporting all SNPs with 1 individual per file. Useful columns are: SNP_name, Sample_ID, Allele1_AB, Allele2_AB and GC_Score - separated by tab.

The Sample_ID column is not a clever peace of information since it is the repetition of the individual ID over the SNP names. Nevertheless it is necessary to identify the file with its ID - especially because file names defined by GenomeStudio are generated by integer auto-incrementing after the chosen report name. For instance, if you save your report as "report", file names will be "report.txt", "report1.txt", "report2.txt" and so on. Also you are getting two extra files: "SNP_Map.txt" and "Sample_Map.txt". The SNP map will be used later on.

I wrote a simple Perl script to rename your files to avoid opening them to figure out which individual's genotypes are there > http://dl.dropbox.com/u/28917337/rename.pl. If you are working on Windows it is likely that you don't have Perl installed by default. Go to http://www.perl.org/get.html and follow the instructions on the website. Back to the script, you will need a text file containing the name of the files you want to rename (one per line). This is easily obtained under the UNIX shell by changing to the target directory and issuing:


% ls -1 report*.txt > filelist.txt

Once you have all the files to be renamed, the filelist and the script in the same folder, use the following command:

% perl rename.pl <list with filenames>

If you get the message "You don't have permission to rename it" on Linux you probably have permission restrictions. You can set all privileges to these files for all users by:

% chmod 777 *

or simply run the script as root

% sudo perl rename.pl <list with filenames>

Now that you have all files renamed, it is time to get a PLINK equivalent. PLINK works with several file formats, but basically they derive from .ped format. Description of the file formats is detailed at http://pngu.mgh.harvard.edu/~purcell/plink/data.shtml. Download the following Perl script > http://dl.dropbox.com/u/28917337/gstudio_to_plink.pl. Now use the command:

% perl gstudio_to_plink.pl <list with filenames> <population (or breed) name> <output name>

This will generate a file named outputname.ped. To run plink you need another file called outputname.map. This is obtained from the SNP_Map.txt file from GenomeStudio.

% cat SNP_Map.txt | awk 'NR > 1 {print $3,$2,$4}' > outputname.map

On Windows, you can do it this way:

PS C:\Scripts> Get-Content .\SNP_Map.txt | %{ $_.Split('\t')[3][2][4]; } > outputname.map

In this case, a header will be printed to the file as well. I don't have a solution for skipping it so you will need to open the file and delete the first row.

Now the files shoud work! Give it a try by calculating MAF for SNPs on chromosome 1:

% plink --file outputname --map3 --compound-genotypes --chr 1 --freq --out outputnamefreq

When it is done you will find a new file called outputnamefreq.frq. I will make a new post on PLINK format variants and known issues very soon. Until there try its documentation, should be fun!

2 comments:

  1. Thank you very much, this is really helpful.Actually I have genome studio CNV.txt file from illumina platform, are there any way to convert it to CNV list file to use it in plink analysis of CNV, I would appreciate getting your replay, my name is Sal, email (swrdyani@gmail.com), thanks in advance for the help and have a great day.

    ReplyDelete