Prerequisites
Before using the GBSpipeline, you will need to install the following programs. I recommend creating a directory specifically for running the pipeline and placing the programs in there as this makes it very easy to provide the pipeline with the location of your programs. If you prefer, like me, to have a separate directory for programs (mine is called "Tools"), you can download the source/binary files there and compile as necessary, and provide the pathnames to each program.
1. trimmomatic
Download link: http://www.usadellab.org/cms/?page=trimmomatic
I recommend downloading the binary as there is no compiling from source code; just simply place the "trimmomatic-[version].jar file into the directory of your choice. It is also worth reading up on trimmomatic on how it trims paired-end reads, especially the options you can provide it, as you can use it to tweak the default options in the pipeline if necessary.
2. bowtie2
Download link: http://sourceforge.net/projects/bowtie-bio/files/bowtie2/2.2.3/
Ensure you choose the right binary for your operating system (either Linux or MacOS) and again, simply place the folder into your desired directory. If you would rather compile from source, the instructions are in the bowtie2 manual at: http://bowtie-bio.sourceforge.net/bowtie2/manual.shtml. This is also where you'll find the various options when running bowtie2, which can be passed into the pipeline.
3. SAMtools and BCFtools
Download link: http://www.htslib.org/download/
Note: SAMtools and bcftools were previously packaged together but now require you download and install them separately. As well, any versions prior to the 1.0 releases no longer work with the pipeline (ie. you will need SAMtools 1.0+ and BCFtools 1.0+).
These packages need to be compiled; so open up a terminal window, enter the SAMtools directory in the location you've placed it and type 'make' from the command line:
cd samtools-1.x/
make
make prefix=. install
and similarly for BCFtools in the directory you unpacked it:
cd bcftools-1.x/
make
make prefix=. install
You should see some output telling you that it compiled successfully (hopefully!). For a great tutorial on how to use SAMtools (though it won't be necessary to use this pipeline, but it is a great package for handling copious amounts of sequence data), I highly recommend this tutorial: http://biobits.org/samtools_primer.html
Download
Now you are ready to download the GBS Pipeline. To ensure you have the latest version, use the following git command to download it directly from github:
git clone https://github.com/carolyncaron/GBSpipeline.git
Alternatively, if you don't have git installed, you can download the zip file on this site and place it into the directory of your choice as with the previous dependencies. Once you have the GBS pipeline in its own directory, it is time to copy the following files into the same directory. Again, if you choose not to copy all of these files (for memory-saving reasons), just remember to use the pathname for each as parameters when using the pipeline. The following files are required:
- An index file - A simple text file in one of two formats. Either 1) A single column list of the indices used for this GBS run, or 2) 2 tab-delimited columns where the first consists of the sample name and the second consists of its associated index. (Either format provides enough information for the pipeline to demultiplex the reads, however it is recommended to use the second format as the output files will be named using the sample names rather than the indices, which is far more useful to the user.)
- Two FASTQ files containing paired-end read data - these need to be Illumina version 1.8 or higher. Each read in the first file is paired with a read in the second file. Only the first set of reads contain the indices, so it is important to list this file before the file containing the second set of reads when giving the pipeline parameters.
- A trim file - A FASTA file of adaptors that trimmomatic will use to trim from the reads. This is likely provided by your sequencing provider. Your download of trimmomatic will also include various files that you can use as your trim file (see the manual page for Trimmomatic).
- A reference genome in FASTA format
- Reference genome index files (optional). These should have the same name as the reference genome but with suffixes such as 1.bt2, 2.bt2 and so on. The pipeline will look for these in the same directory as the reference genome, but if they are not present, then they will be generated.
You are now ready to begin analyzing your GBS data!
Approach
This pipeline is broken up into 4 steps: this allows the user to pause and look over the output before proceeding to the next step. As each step is very time consuming, this also lowers the risk of losing most or all of the progress made up until that point. Each step-specific function prints to the console as to its progress, and creates output files as well as a summary file within the current directory. To call each step and provide it with the necessary parameters and files, you will need to execute the main script in the form:
./GBS_pipeline.pl step [arguments] [options]
The following lists examples to demonstrate how to run each step in sequential order. For full documentation on options and output files, type the following on the command line:
perldoc GBS_pipeline.pl
The first step is to demultiplex the reads into separate files for each sample. For example, suppose we are doing a run with Medicago samples:
./GBS_pipeline.pl demultiplex Medicago indices.list TGCA Mt4.0_read1.fastq Mt4.0_read2.fastq Medicago_GBS
When completed, you should see printed to the console something like:
Calling demultiplex …
[===========================] 100 %
Handling reads without a barcode...
Processed reads located in:
Medicago_GBS/demultiplex/
Summary (open in Excel or use command more): summary_files/Medicago_demultiplex_summary.txt
Completed demultiplex.
Next we trim the files for quality and adaptor sequences. Continuing with our example:
./GBS_pipeline.pl trim_reads Trimmomatic-0.32/trimmomatic-0.32.jar trimlist.fa
And the following is printed to the console:
Calling trim_reads …
[==========================] 100 %
Processed reads located in:
Medicago_GBS/trim/
Summary (open in Excel or use command more): summary_files/Medicago_trim_summary.txt
Completed trim_reads.
This step aligns the reads to the reference genome, and is most likely to take the longest time to complete:
./GBS_pipeline.pl align_reads ../../Tools/bowtie2-2.2.1 ../Mt4.0_reference.fna
With the following output:
Calling align_reads …
Reference genome index files present, proceeding to make alignments.
Running bowtie2 with the following parameters:
--end-to-end --no-mixed --no-discordant --no-sq --no-hd -k 3 -X 11000 -R 5 -p 1
[==========================] 100 %
Processed reads located in:
Medicago_GBS/align/
Summary (open in Excel or use command more): summary_files/Medicago_trim_summary.txt
Completed align_reads.
The final step is making variant calls using the alignment:
./GBS_pipeline.pl SNP_calling ../../Tools/samtools-0.1.19/
With output such as:
Calling SNP_calling …
[==========================] 100 %
Processed reads located in:
Medicago_GBS/variants/
Completed SNP_calling.
That's it! Take a look at your directory using the command ls
. You should see your output files from each step (indexed reads, trimmed reads, alignments, and vcf files) neatly organized into directories called demultiplex/ trim/ align/ variants/
, respectively. Additionally, you should see all of your summary files, which can be opened in a plain text editor or in Excel. More information on the summary files can be found at the end of the POD page.