WRGL Panels Pipeline

Part of the WRGL pipeline for NGS analysis.

IRIDIS connection

Establishes a connection to IRIDIS 4 for file transfer. Connection is secure but automatic, using SSH-RSA2048 keys rather than passwords.

There are 3 possible IRIDIS nodes for connection — a, b, and c.

Once the files have been transferred and the analysis started by UploadAndExecute(), the program waits and checks for a run completion file that is generated once the data has been processed by IRIDIS.

UploadAndExecute runs a series of BASH scripts on the IRIDIS system.

When complete, the analysed report is copied back to the local machine.

PanelPipeline class

  • is passed the run details, sample locations, etc. through the args when initialised.
  • Three WinSCP SessionOptions objects are created with the hostnames for each of the three possible IRIDIS connections
  • PanelPipeline()
    • Constructor sets up the object from the passed in session and run information and then runs the ExecutePanelPipeline() method.
  • ExecutePanelPipeline()
    • Begins by writing logs and attempting to connect to one of the three possible IRIDIS connections.
    • Adds some additional parameters to each of the SessionOptions, including password and key.
    • Calls WriteVariablesFile() to record the analysis variables
    • Calls UploadAndExecute(), which connects and transfers the FASTQ files to IRIDIS, then runs the pipeline scripts.
    • After calling UploadAndExecute, waits for 6 hours and begins to check for run completion using GetData()
    • If GetData() returns False, waits for 10 minutes and tries again, up to 60 attempts then times out.
    • If True, the data had been copied back to the local computer
    • Calls WritePanelReport() to summarise the run.
  • WriteVariablesFile()
    • Records run details in a file (<sampleID>.variables)
  • UploadAndExecute()
    • uploads the data to IRIDIS and runs the pipeline.
    • Attempts to establish a connection:
      • creates new WinSCP session, and uses session.Open() with the a, b, and c SessionOptions.
      • If connection fails, writes to log and tries next node.
      • If final connection fails, throws an exception.
      • Once connected, attempts to create the working directory on IRIDIS
      • If successful, copies the transcript definition file and the FASTQ files for the run.
        • Includes MD5 hash of files to check transfer completed successfully and files are complete and uncorrupted.
    • When the transfers are complete constructs and runs a bash command to change to the remote folder and run 1_Nextera_ReadAlignment.sh
  • GetData()
    • GetData() establishes a connection to IRIDIS (previous connection has likely timed out)
    • checks for a file called "complete" in the run folder
    • returns a bool - True if file found else False.
    • if found, it then retrieves the final files from IRIDIS
      • VCF file
      • recalibration plots
      • coverage file
      • BED file
      • etc…
    • If not found, returns false and doesn't try to copy data
  • WritePanelReport()

ParseVCF class

  • Annotation struct
    • <TODO>
  • GenomicVariant struct
    • <TODO>
  • VCFRecordWithGenotype
    • <TODO>
  • ParseVCF()

Panel shell scripts

1_Nextera_ReadAlignment.sh

  • reads are aligned against a human genome reference using novoalign
  • novoalign is a commercial aligner
  • what advantages over free solutions like Bowtie?
  • options used are as follows:

Uses a .variable file, similar to the variables files used in the Southampton pipeline.
Contains sample ID, fastq file locations, ROI BED files, run ID and panel name, and preferred transcripts

What does the preferred transcripts list look like and do?
Looks like it is a list of transcripts for genes nit he regions of interest. Used at annotation stage? To prevent use of multiple transcripts or minor transcripts?

$ novoalign \
-d /scratch/WRGL/bundle2.8/b37/human_g1k_v37.nix \     # human genome index file
-f "$R1 filename" "$R2 filename" \                     # <read file 1> [<read file 2 for paired end>]
-F ILM1.8 \                                            # read file format
--Q2Off \                                              # does something involving quality?
-o FullNW \                                            # disables soft-clipping in alignment
-r None \                                              # rejects reads with multiple alignments
-c 8 \                                                 # number of available cores for multithreading
-i PE 300,200 \                                        # sets paired-end mode. 300bp insert size, 200bp standard deviation
-o SAM <readgroup options> 1> <ID>.SAM                 # output settings, STDOUT piped to <ID>.SAM

2_Nextera_IndelRealignment.sh

  • This stage uses picardtools and GATK
    • Picard and GATK are Java utilities, so must be run using the Java VM
Picard:
$ java [java opts] -jar <$PICARD_PATH>/<Picard Tool> [Picard opts]
GATK:
$ java [java opts] -jar <$GATK_PATH>/GenomeAnalysisTK.jar -T <GATK Tool> [GATK opts]
  • Converts SAM to BAM - using picard. ?faster than samtools?
java -Xmx4000m -Djava.io.tmpdir=/scratch/WRGL/JavaTmp -jar \     # starts Java, sets memory usage and tmp directory
/scratch/WRGL/software/picard-tools-1.118/SortSam.jar \          # location of picard SortSam jar
INPUT="$RunID"_"$Sample_ID".sam                                  # input filename - unsorted, uncompressed SAM file
OUTPUT="$RunID"_"$Sample_ID"_sorted.bam \                        # output BAM filename
SORT_ORDER=coordinate \                                          # sorts by chromosome and position rather than read ID
COMPRESSION_LEVEL=0                                              # no compression. Overrides default value of 5
  • marks duplicate reads - picard seems to be accepted as the best option
java -Xmx4000m -Djava.io.tmpdir=/scratch/WRGL/JavaTmp -jar \       # starts Java, sets memory usage and tmp directory
/scratch/WRGL/software/picard-tools-1.118/MarkDuplicates.jar \     # location of picard MarkDuplicates jar
INPUT="$RunID"_"$SampleID"._sorted.bam                             # input filename - unsorted, uncompressed SAM file
OUTPUT="$RunID"_"$SampleID"_rmdup.bam \                            # output BAM filename
METRICS_FILE="$RunID"_"$SampleID"_dupMetrics.txt \                 # required file to output duplicate metrics
CREATE_INDEX=true \                                                # creates an index for the BAM file
COMPRESSION_LEVEL=0                                                # no compression. Overrides default value of 5
  • GATK has a realigner specially designed to work around indels, which cause problems during first-pass alignment.
    • Regions requiring realignment are identified using GATK RealignerTargetCreator
      • Takes a list of known common indels, so these should not be called.
  • Identifying regions to realign:
$ java -XMx2000m -Djava.io.tmpdir=/scratch/WRGL/JavaTmp -jar \                        # starts Java, sets memory usage and tmp directory
/scratch/WRGL/software/GenomeAnalysisTK-3.2-1/GenomeAnalysisTK.jar \                  # location of the GATK jar file
-T RealignerTargetCreator \                                                           # specifies the tool to be used
-R /scratch/WRGL/bundle2.8/b37/human_g1k_v37.fasta \                                  # location of reference genome file
-known /scratch/WRGL/bundle2.8/b37/1000G_phase1.indels.b37.vcf \                      # file containing known indels
-known /scratch/WRGL/bundle2.8/b37/Mills_and_1000G_gold_standard.indels.b37.vcf \     # more known indels
-I "$RunID"_"$Sample_ID"_rmdup.bam \                                                  # input file after removal of duplciates
-o "$RunID"_"$Sample_ID".intervals \                                                  # output filename
-L "$BEDFilename" \                                                                   # ignore regions outside of BED file
-ip 100\                                                                              # interval padding 100bp. Not sure about function
-dt NONE                                                                              # downsampling type
  • Perform realignment:
$ java -Xmx4000m -Djava.io.tmpdir=/scratch/WRGL/JavaTmp -jar \                        # starts Java, sets memory usage and tmp directory
/scratch/WRGL/software/GenomeAnalysisTK-3.2-2/GenomeAnalysisTK.jar \                  # location of the GATK jar file
-T IndelRealigner                                                                     # specifies the tool to be used
-R /scratch/WRGL/bundle2.8/b37/human_g1k_v37.fasta \                                  # location of reference genome file
-known /scratch/WRGL/bundle2.8/b37/1000G_phase1.indels.b37.vcf \                      # file containing known indels
-known /scratch/WRGL/bundle2.8/b37/Mills_and_1000G_gold_standard.indels.b37.vcf \     # more known indels
-targetIntervals "$RunID"_"$sample_ID".intervals \                                    # target file from RealignerTargetCreator
-I "$RunID"_"$Sample_ID"_rmdup.bam \                                                  # input BAM file
-o "$RunID"_"$Sample_ID"_realigned.bam \                                              # output BAM file
--LODThresholdForCleaning 0.4 \                                                       # cutoff for cleaning - lower score picks up smaller
--maxReadsForRealignment 60000 \                                                      # read depth above this threshold bypasses realignment
--maxConsensuses 90 \                                                                 # max number of alternate consensus alignments to try
-compress 0 \                                                                         # sets BAM compression level to 0
-dt NONE                                                                              # downsampling type
  • After running, all intermediary files (all but *_realigned.bam) are deleted.

3_Nextera_AnalyseCovariation.sh

  • Analyses covariation in the realigned BAM files
  • Runs BaseRecalibrator twice
    • BaseRecalibrator looks at all bases not in dbSNP to find sequencing errors
    • This is not a variant calling step
    • Uses this data to modify QUAL scores
      • Probability of mismatching reference genome
      • effect may be altered by actual variants, but they will remain significantly different.
$ java -Xmx8000m -Djava.io.tmpdir=/scratch/WRGL/JavaTmp -jar \                             # starts Java, sets memory usage and tmp directory
/scratch/WRGL/software/GenomeAnalysisTK-3.2-2/GenomeAnalysisTK.jar \                       # location of the GATK jar file
-T BaseRecalibrator \                                                                      # specifies the tool to be used
-R /scratch/WRGL/bundle2.8/b37/human_g1k_v37.fasta \                                       # location of reference genome file
-knownSites /scratch/WRGL/bundle2.8/b37/1000G_phase1.indels.b37.vcf \                      # file containing known indels
-knownSites /scratch/WRGL/bundle2.8/b37/Mills_and_1000G_gold_standard.indels.b37.vcf \     # more known indels
-knownSites /scratch/WRGL/bundle2.8/b37/dbsnp_138.b37.vcf \                                # list of sites in dbSNP
-I BAMSforBQSR.list \                                                                      # list of BAMs, generated in previous script
-L "$BEDFilename" \                                                                        # ignore regions outside of BED file
-o "$RunID"_recal_data.table \                                                             # output file name
-ip 100\                                                                                   # interval padding 100bp. Not sure about function
-nct 4 \                                                                                   # 'nanoschedulable' parallelisation option - number of cpu threads per data thread
-dt NONE                                                                                   # downsampling type
  • this is run twice, the only differences being:
-BQSR "$RunID"_recal_data.table \       # Base Quality Score Recalibration table, output of first run
-o "$RunID"_post_recal_data.table \     # output file name
  • The script also runs GATK AnalyzeCovariates to generate .pdf plots of the effect of recalibration
    • this is a non-essential step?
$ java -Xmx8000m -Djava.io.tmpdir=/scratch/WRGL/JavaTmp -jar \                             # starts Java, sets memory usage and tmp directory
/scratch/WRGL/software/GenomeAnalysisTK-3.2-2/GenomeAnalysisTK.jar \                       # location of the GATK jar file
-T AnalyzeCovariates \                                                                     # specifies the tool to be used
-R /scratch/WRGL/bundle2.8/b37/human_g1k_v37.fasta \                                       # location of reference genome file
-before "$RunID"_recal_data.table \                                                        # output of first pass
-after "$RunID"_post_recal_data.table \                                                    # output of second pass
-plots "$RunID"_recalibrationplots.pdf \                                                   # filename for plots
-L "$BEDFilename" \                                                                        # ignore regions outside of BED file
-ip 100\                                                                                   # interval padding 100bp. Not sure about function
-dt NONE                                                                                   # downsampling type

4_Nextera_ApplyCovariation_VariantCalling.sh

  • Applies the recalibration data to the BAM file
  • Then calls variants using GATK HaplotypeCaller
  • Recalibration:
$ java -Xmx4000m -Djava.io.tmpdir=/scratch/WRGL/JavaTmp -jar \                             # starts Java, sets memory usage and tmp directory
/scratch/WRGL/software/GenomeAnalysisTK-3.2-2/GenomeAnalysisTK.jar \                       # location of the GATK jar file
-T PrintReads \                                                                            # specifies the tool to be used
-R /scratch/WRGL/bundle2.8/b37/human_g1k_v37.fasta \                                       # location of reference genome file
-I "$RunID"_"$Sample_ID"_realigned.bam \                                                   # input BAM file
-BQSR ../"$RunID"_recal_data.table \                                                       # recalibration  data table. Why not *_post_recal_data.table?
-L "$BEDFilename" \                                                                        # ignore regions outside of BED file
-o "$RunID"_"$Sample_ID"_recal.bam \                                                       # output filename
-ip 100\                                                                                   # interval padding 100bp. Not sure about function
-dt NONE                                                                                   # downsampling type
  • This uses the recalibration table calculated in the previous script to alter the quality scores to be more accurate, based on global error rates.
    • It's not clear why it uses the output of the first-pass BaseRecalibrator output though.
  • Then, read group IDs are checked and set using Picard
$ java -Xmx2000m -Djava.io.tmpdir=/scratch/WRGL/JavaTmp -jar \             # starts Java, sets memory usage and tmp directory
/scratch/WRGL/software/picard-tools-1.118/AddOrReplaceReadGroups.jar \     # location of picard AddOrReplaceReadGroups jar file
INPUT="$RunID"_"$Sample_ID"_recal.bam \                                    # recalibrated BAM file
OUTPUT="$RunID"_"$SampleID".bam \                                          # output filename
RGID="$RunID" \                                                            # sets readgroup ID
RGLB="$ExperimentName" \                                                   # sets readgroup library
RGPL="$Platform" \                                                         # sets readgroup platform
RGPU="$RunID" \                                                            # sets readgroup platform unit (e.g. run barcode)
RGSM="$Sample_ID"                                                          # sets readgroup sample name
  • Once the read groups have been set, variants can be called using the GATK HaplotypeCaller (an updated version of the UnifiedGenotyper)
$ java -Xmx8000m -Djava.io.tmpdir=/scratch/WRGL/JavaTmp -jar \           # starts Java, sets memory usage and tmp directory
/scratch/WRGL/software/GenomeAnalysisTK-3.2-2/GenomeAnalysisTK.jar \     # location of the GATK jar file
-T HaplotypeCaller \                                                     # specifies the tool to be used
-R /scratch/WRGL/bundle2.8/b37/human_g1k_v37.fasta \                     # location of reference genome file
--dbsnp /scratch/WRGL/bundle2.8/b37/dbsnp_138.b37.vcf \                  # list of variant sites in dbSNP
-I "$RunID"_"$Sample_ID".bam \                                           # input file - read group corrected BAM
-L "$BEDFilename" \                                                      # ignore regions outside of BED file
-o "$RunID"_"$Sample_ID".vcf \                                           # output file name
--genotyping_mode DISCOVERY \                                            # DISCOVERY mode finds the most likely alternate allele
-stand_emit_conf 10 \                                                    # minimum PHRED score for emitting variants
-stand_call_conf 30 \                                                    # minimum PHRED score for calling variants. Variants between this and stand_emit_conf are marked LowQual
--emitRefConfidence GVCF \                                               # gVCF creates a whole-genome VCF, but condenses non-variant regions into single blocks
--variant_index_type LINEAR \                                            # parameter used when GATK is indexing the VCF
--variant_index_parameter 128000 \                                       # parameter used when GATK is indexing the VCF
-dt NONE                                                                 # downsampling type
  • After completion, a list of the created VCF files is produced, which is used in the next script.

5_Nextera_VariantProcessing.sh

  • The final script runs a selection of filtering stages, before flagging the run as complete and allowing download of the final files.
  • GenotypeGVCFs merges VCF files, creating a single multi-sample VCF for the project
$ java -Xmx2000m -Djava.io.tmpdir=/scratch/WRGL/JavaTmp -jar \           # starts Java, sets memory usage and tmp directory
/scratch/WRGL/software/GenomeAnalysisTK-3.2-2/GenomeAnalysisTK.jar \     # location of the GATK jar file
-T GenotypeGVCFs\                                                        # specifies the tool to be used
-R /scratch/WRGL/bundle2.8/b37/human_g1k_v37.fasta \                     # location of reference genome file
--dbsnp /scratch/WRGL/bundle2.8/b37/dbsnp_138.b37.vcf \                  # list of variant sites in dbSNP
-V VCFsforFiltering.list \                                               # GenotypeGVCFs requires multiple VCF files - contained in this list generated in previous script
-L "$BEDFilename" \                                                      # ignore regions outside of BED file
-o $RunID".vcf \                                                         # output filename
-dt NONE                                                                 # downsampling type
  • Extracts SNVs from the merged VCF file
    • code is commented as SNPs, but as there is no reference file for SNP data, I assume this actually means all SNVs not just polymorphisms.
    • SNVs can then be processed and annotated, while indels can be considered separately.
$ java -Xmx2000m -Djava.io.tmpdir=/scratch/WRGL/JavaTmp -jar \           # starts Java, sets memory usage and tmp directory
/scratch/WRGL/software/GenomeAnalysisTK-3.2-2/GenomeAnalysisTK.jar \     # location of the GATK jar file
-T SelectVariants \                                                      # specifies the tool to be used
-R /scratch/WRGL/bundle2.8/b37/human_g1k_v37.fasta \                     # location of reference genome file
-V RunID.vcf \                                                           # merged VCF file from GenotypeGVCFs output
-L "$BEDFilename" \                                                      # ignore regions outside of BED file
-o $RunID"_SNPs.vcf \                                                    # output filename
-selectType SNP \                                                        # extract single nucleotide changes
-dt NONE                                                                 # downsampling type
  • Extracted SNVs are then filtered to remove low quality results.
    • VariantFiltration filters VCFs based on user defined rules applied to the INFO and FORMAT fields of each variant
$ java -Xmx2000m -Djava.io.tmpdir=/scratch/WRGL/JavaTmp -jar \           # starts Java, sets memory usage and tmp directory
/scratch/WRGL/software/GenomeAnalysisTK-3.2-2/GenomeAnalysisTK.jar \     # location of the GATK jar file
-T VariantFiltration \                                                   # specifies the tool to be used
-R /scratch/WRGL/bundle2.8/b37/human_g1k_v37.fasta \                     # location of reference genome file
-V RunID_SNPs.vcf \                                                      # merged VCF file from GenotypeGVCFs output
-L "$BEDFilename" \                                                      # ignore regions outside of BED file
-o $RunID"_SNPs_Filtered.vcf \                                           # output filename
--filterExpression "QD < 2.0" \                                          # filter on "quality by depth" score - QUAL normalised for coverage
--filterName "LowQD" \                                                   #
--filterExpression "FS > 60.0" \                                         # filter on "fisher strand" score - measure of strand bias, higher score is more likely biased
--filterName "SB" \                                                      #
--filterExpression "MQ < 40.0" \                                         # filter on "mapping quality" score - quality of read mapping at variant location
--filterName "LowMQ" \                                                   # 
--filterExpression "MQRankSum < -12.5" \                                 # nearer 0 is better, low score indicates poor mapping in non-reference allele
--filterName "MQRankSum" \                                               #
--filterExpression "ReadPosRankSum < -8.0" \                             # filters on bias in read position, low score suggests position within reads is constant - inconsistent with a real variant
--filterName "ReadPosRankSum" \                                          #
-dt NONE                                                                 # downsampling type
  • Next, a similar extraction and filtration step is performed for indels
  • Extraction is as above, except for
-o $RunID"_INDELSs.vcf \     # output filename
-selectType INDEL \          # extract single nucleotide changes
  • Filtration is also similar, except for the input file and the rules, which are reproduced fully below (note MQRankSum is not used):
-V RunID_INDELs.vcf \                            # merged VCF file from GenotypeGVCFs output
-o $RunID"_INDELs_Filtered.vcf \                 # output filename
--filterExpression "QS < 2.0" \                  #
--filterName "lowQD" \                           #
--filterExpression "FS > 200.0" \                #
--filterName "SB"                                #
--filterExpression "ReadPosRankSum < -20.0 \     #
--filterName "ReadPosRanksum" \                  #
  • The two filtered VCFs are then merged back together, and annotated with snpEff
$ java -Xmx2000m -Djava.io.tmpdir=/scratch/WRGL/JavaTmp -jar \           # starts Java, sets memory usage and tmp directory
/scratch/WRGL/software/GenomeAnalysisTK-3.2-2/GenomeAnalysisTK.jar \     # location of the GATK jar file
-T CombineVariants \                                                     # specifies the tool to be used
-R /scratch/WRGL/bundle2.8/b37/human_g1k_v37.fasta \                     # location of reference genome file
-o "$RunID"_filtered.vcf \                                               # output filename
--variant "$RunID"_SNPs_Filtered.vcf \                                   # filtered SNVs
--variant "$RunIS"_INDELs_Filtered.vcf \                                 # filtered indels
-dt NONE                                                                 # downsampling type
$ java -Xmx2000m -Djava.io.tmpdir=/scratch/WRGL/JavaTmp -jar \    # starts Java, sets memory usage and tmp directory
/scratch/WRGL/software/snpEff_4E/snpEff.jar eff \                 # run SnpEff
-v \                                                              # verbose mode
GRCh37.75 \                                                       # sets reference genome
-noStats \                                                        # doesn't create a stats summary file
-no-downstream \                                                  # don't show downstream changes
-no-intergenic \                                                  # don't show intergenic changes
-no-upstream \                                                    # don't show upstream changes
-spliceSiteSize 10 \                                              # number of bases at splice junction to consider as splice site
-onlyTr "$PreferredTranscriptsFile" \                             # only consider the transcripts in the given file
-noLog \                                                          # doesn't create a log file
"$RunID"_Filtered.vcf > "$RunID"_Filtered_Annotated.vcf           # input file, output redirected to output filename.
Unless otherwise stated, the content of this page is licensed under Creative Commons Attribution-ShareAlike 3.0 License