View on GitHub

GBS Pipeline

A complete set of commands to demultiplex, trim, align and call raw variants on paired-end reads for the purpose of analyzing reads generated by genotyping by sequencing (GBS).

Download this project as a .zip file Download this project as a tar.gz file

Prerequisites

Before using the GBSpipeline, you will need to install the following programs. I recommend creating a directory specifically for running the pipeline and placing the programs in there as this makes it very easy to provide the pipeline with the location of your programs. If you prefer, like me, to have a separate directory for programs (mine is called "Tools"), you can download the source/binary files there and compile as necessary, and provide the pathnames to each program.

1. trimmomatic

Download link: http://www.usadellab.org/cms/?page=trimmomatic

I recommend downloading the binary as there is no compiling from source code; just simply place the "trimmomatic-[version].jar file into the directory of your choice. It is also worth reading up on trimmomatic on how it trims paired-end reads, especially the options you can provide it, as you can use it to tweak the default options in the pipeline if necessary.

2. bowtie2

Download link: http://sourceforge.net/projects/bowtie-bio/files/bowtie2/2.2.3/

Ensure you choose the right binary for your operating system (either Linux or MacOS) and again, simply place the folder into your desired directory. If you would rather compile from source, the instructions are in the bowtie2 manual at: http://bowtie-bio.sourceforge.net/bowtie2/manual.shtml. This is also where you'll find the various options when running bowtie2, which can be passed into the pipeline.

3. SAMtools and BCFtools

Download link: http://www.htslib.org/download/

Note: SAMtools and bcftools were previously packaged together but now require you download and install them separately. As well, any versions prior to the 1.0 releases no longer work with the pipeline (ie. you will need SAMtools 1.0+ and BCFtools 1.0+).

These packages need to be compiled; so open up a terminal window, enter the SAMtools directory in the location you've placed it and type 'make' from the command line:

cd samtools-1.x/
make
make prefix=. install

and similarly for BCFtools in the directory you unpacked it:

cd bcftools-1.x/
make
make prefix=. install

You should see some output telling you that it compiled successfully (hopefully!). For a great tutorial on how to use SAMtools (though it won't be necessary to use this pipeline, but it is a great package for handling copious amounts of sequence data), I highly recommend this tutorial: http://biobits.org/samtools_primer.html

Download

Now you are ready to download the GBS Pipeline. To ensure you have the latest version, use the following git command to download it directly from github:

git clone https://github.com/carolyncaron/GBSpipeline.git

Alternatively, if you don't have git installed, you can download the zip file on this site and place it into the directory of your choice as with the previous dependencies. Once you have the GBS pipeline in its own directory, it is time to copy the following files into the same directory. Again, if you choose not to copy all of these files (for memory-saving reasons), just remember to use the pathname for each as parameters when using the pipeline. The following files are required:

You are now ready to begin analyzing your GBS data!

Approach

This pipeline is broken up into 4 steps: this allows the user to pause and look over the output before proceeding to the next step. As each step is very time consuming, this also lowers the risk of losing most or all of the progress made up until that point. Each step-specific function prints to the console as to its progress, and creates output files as well as a summary file within the current directory. To call each step and provide it with the necessary parameters and files, you will need to execute the main script in the form:
./GBS_pipeline.pl step [arguments] [options]
The following lists examples to demonstrate how to run each step in sequential order. For full documentation on options and output files, type the following on the command line: perldoc GBS_pipeline.pl

  1. The first step is to demultiplex the reads into separate files for each sample. For example, suppose we are doing a run with Medicago samples:
    ./GBS_pipeline.pl demultiplex Medicago indices.list TGCA Mt4.0_read1.fastq Mt4.0_read2.fastq Medicago_GBS
    When completed, you should see printed to the console something like:
    Calling demultiplex …
    [===========================] 100 % Handling reads without a barcode...
    Processed reads located in:
    Medicago_GBS/demultiplex/
    Summary (open in Excel or use command more): summary_files/Medicago_demultiplex_summary.txt
    Completed demultiplex.

  2. Next we trim the files for quality and adaptor sequences. Continuing with our example:
    ./GBS_pipeline.pl trim_reads Trimmomatic-0.32/trimmomatic-0.32.jar trimlist.fa
    And the following is printed to the console:
    Calling trim_reads …
    [==========================] 100 %
    Processed reads located in:
    Medicago_GBS/trim/
    Summary (open in Excel or use command more): summary_files/Medicago_trim_summary.txt
    Completed trim_reads.

  3. This step aligns the reads to the reference genome, and is most likely to take the longest time to complete:
    ./GBS_pipeline.pl align_reads ../../Tools/bowtie2-2.2.1 ../Mt4.0_reference.fna
    With the following output:
    Calling align_reads …
    Reference genome index files present, proceeding to make alignments.
    Running bowtie2 with the following parameters:
    --end-to-end --no-mixed --no-discordant --no-sq --no-hd -k 3 -X 11000 -R 5 -p 1
    [==========================] 100 %
    Processed reads located in:
    Medicago_GBS/align/
    Summary (open in Excel or use command more): summary_files/Medicago_trim_summary.txt
    Completed align_reads.

  4. The final step is making variant calls using the alignment:
    ./GBS_pipeline.pl SNP_calling ../../Tools/samtools-0.1.19/
    With output such as:
    Calling SNP_calling …
    [==========================] 100 %
    Processed reads located in:
    Medicago_GBS/variants/
    Completed SNP_calling.

That's it! Take a look at your directory using the command ls. You should see your output files from each step (indexed reads, trimmed reads, alignments, and vcf files) neatly organized into directories called demultiplex/ trim/ align/ variants/, respectively. Additionally, you should see all of your summary files, which can be opened in a plain text editor or in Excel. More information on the summary files can be found at the end of the POD page.