Mutant-Analysis-workflow - Documentation

PBGL-NGS-Analysis-workflow - Overview

This collection of Snakemake rules constitutes a workflow for the analysis of high-throughput sequencing projects. We use it mostly for genome re-sequencing data produced with Illumina-type sequencing instruments. For this particular workflow we distinguish two different use cases: denovo and varcall. The workflow is designed to run the entire analyses automatically.

The (brief) principle of Snakemake

In Snakemake, analysis workflows are defined as a series of rules. In our case, rules define the analysis steps. Each rule has input and output files. Calling a rule will invoke all necessary upstream rules. It is hence generally sufficient to request the desired output file(s) of the downstream most rule and the entire workflow will be executed. In practice, users will provide input files, configure project meta information, adjust configuration parameters, and execute snakemake. The workflow will then automatically perform all the (necessary) analysis steps to produce the defined output.

In general, a rule will be run if:

Snakemake realizes that the expected output files of a rule are not (yet) present, or
Snakemake realizes that the expected input files or configuration parameters of a rule have changed since the last run.

Hence, when analyses are re-run, existing and valid intermediate output files from a previous run are re-used, saving precious time and resources. For details, please consult the Snakemake documentation. For the differente use cases in this workflow, invoke the rule that produces the desired output.

Mutant-Analysis-workflow Use Cases

De-Novo Analysis of sample relatedness (“denovo”)

Choosing this option will conduct a reference-free comparison of samples based on the raw sequencing reads using k-mers. The workflow invokes the software tools kWIP and/or mash and outputs distance matrices and PCA plots. A flowchart illustrating the summarised workflow is depicted in Figure 1. The workflow for this use case consists of the following steps:

#	Task	Rule	Software
1	Prepare/clip the raw reads	readQC	AdapterRemoval
2	Calculate distances	denovo	kWIP and/or mash
3	Perform PCA and plot	denovo	Utils/R-Scripts

dna-proto-workflow flowchart — Flowchart for “denovo” (click to expand)

Invoke this option by running the rule: rules.denovo.input

Required input for rules.denovo.input are fastq files and the workflow will return distance matrices between samples produced by kWIP and/or mash.

> NOTE: We recommend performing such denovo analysis for every project. Clustering at the level of sequencing runs can be used to confirm metadata and detect mixups.

Variant Calling - Standard Re-Sequencing Analysis (“varcall”)

Choosing this option will run a full re-sequencing analysis ending in filtered bcf/vcf files. It detects variants and genotypes in sets of samples based on the alignments of the sequencing reads against one or several user-defined reference genome(s). Reads can be aligned with bwa and/or NextGenMap (ngm), and variants can be called with freebayes and/or mpileup. Between read alignments and variant calling, PCR duplicates are marked (with samtools markdup) and indels are realigned (using abra2). If reference genome annotation is provided, the effects of variants on gene integrity can also be predicted using the software snpEff. A flowchart illustrating the summarised workflow is depicted in Figure 2

The full workflow for this use case consists of the following steps:

#	Task	Rule	Software
1	Prepare/clip the raw reads	readQC	AdapterRemoval
2	Align the reads to the reference genome	align	bwa and/or ngm
3	Mark duplicates	align	samtools fixmate samtools sort samtools markdup
4	Realign indels	align	samtools merge abra2
5	Call variants	varcall	freebayes and/or mpileup
6	Filter variants	varcall	bcftools view
7	Annotate variants	annotate	snpEff

This option can be invoked in 2 ways:

selecting rules.varcall.input will result in one or several vcf files, one for each combination of sample-set, reference genome, read aligner, variant caller, and variant filter.
selecting the rule rules.annotate.input will result in one or several annotated vcf files and additional summaries; running rules.annotate.input is only meaningful when a genome annotations is available and provided in form of a snpEff database. Currently, only one reference genome annotation can be provided at a time. Hence, invoking the variant calling workflow through rules.annotate.input, will restrict the workflow upstream to only one reference genome.

Required input files are fastq files and a genome reference sequence(s) (fasta). The rule snpeff in addition depends on a genome annotation in form of a snpEff database matching the reference genome. For maximum flexibility and ease of troubleshooting we recommend to first run the re-sequencing analysis by invoking rules.varcall.input, and upon successful completion invoke the workflow again, this time selecting/uncommenting rules.annotate.input.

Variant Annotation - The Effects of Variants on Gene Function (“annotate”)

Choosing the option rules.annotate.input will annotate bcf/vcf files found in output/variants/final/ and write annotated vcf.gz files to output/variants_annotated/. This analysis has only one step:

#	Task	Rule	Software
1	Annotate variants	annotate	snpEff

Typically, rules.annotate.input is run after a completed run of rules.varcall.input. A snpEff run will complete within a matter of minutes. It will run on all files specified by entries in the snpeff: section of the config.yml file:

config[snpeff:][ref:] one chosen reference genome,
config[snpeff:][database:] one corresponding snpEff database,

and all combinations of specified samplesets, read aligners, variant callers, and variant filter settings.

Hardware Requirements

The workflow is parallelised and Snakemake will make efficient use of available resources on local machines as well as on compute clusters. It will run faster the more resources are available, but it performs fine on smaller machines. For routine applications we have used the workflow on a budged workstation, HP Z820 with 32 cores, 64 GB RAM running Ubuntu 16.04, and on a Virtual Machine in the cloud, AZURE with 16 cores and 512 GB RAM running Ubuntu 18.04.

Snakemake allows for fine-tuning resource allocation to the individual rules, i.e., number of processors and memory. We configured each rule with reasonable defaults, but they can be tailored to your particular size project and hardware. For details please consult the Snakemake documentation. In general though, if limited by memory, then do not parallelise too aggressively.

Software Dependencies

We recommend running the workflow in its own conda environment on a Linux Server. Dependencies are listed in .yml files in envs/. A brief explanation how to use these files to generate the conda environment is further below. For comprehensive explanation please consult the conda documentation. For software that is not available through conda on some important platforms we make the specific binaries available in envs/; currently mainly abra2.jar.

Workflow Use in a Nutshell - Checklist

Steps

Create a new github repository in your github account using this workflow as a template
Clone your newly created repository to your local system where you want to perform the analysis
Create and activate the conda environment
Provide reference genome(s) and annotation(s) in /genomes_and_annotations/
List the chromosome/contig regions (bed file format) for which variants should be called in metadata/contigs_of_interest.bed
Specify the locations of input files and their meta data in /metadata/sample2runlib.csv
Provide lists of samples as sets to analyse in metadata/samplesets/
Uncomment the respective workflow option for your use case in the Snakefile
Configure software parameters in config.yml
Adapt snpeff.config (Optional in case the snpeff rule will be called)
Run snakemake

For standard applications no additional edits are necessary. The rules reside in rules/*.smk. Most rules have explicit shell commands with transparent flag settings. Expert users can change these for additional control.

After the successful snakemake run, archive your workflow and the raw data to enable reproducing your work.

Workflow Use in Detail

We recommend using Linux, managing the software dependency trough conda/mamba and running the workflow in a dedicated conda virtual environment. The virtual environment is created once where then all required software is available in any shell/terminal in which the environment has been “activated”.

Creating the Virtual Environment (conda/mamba)

The preferred way to create the environment is passing the instructions as a yaml file to mamba env create. The code below will install mamba (if not already available), create an environment named “dna-proto” with all necessary software installed, and activate it. If you are new to conda, please consult the conda documentation to get started.

$ conda install mamba
$ mamba env create --file all-dependencies.yml
$ conda activate dna-proto

Alternatively, (and only if the above does not work as intended) conda install the individual programs manually. They are listed in all-dependencies.yml. Specify the correct channel and version where required. Below we give examples, but please consult the conda (and bioconda) documentation. In case of manual installs, it is convenient to add all required channels to the conda config.

Example code:

$ conda config --show channels
channels:
      - defaults
      - bioconda
      - conda-forge
      - r
$ conda config --add channels bioconda
$ conda install samtools=1.9

Configuring channels has the pitfall of rare ambiguities and collisions. Please consult the conda documentation for “managing channels”.

Once set up, activate your conda environment

$ conda activate dna-proto

Reference Genome(s)

The workflow will look for the reference genome(s) and associated files in genomes_and_annotations/. We provide an example directory tree in genomes_annotations/readme. We recommend creating a separate directory for each reference genome. They can be soft links. Each reference genome directory must contain the necessary assembly file (.fa or .fna) and the derived files required by the aligners. Generate those files in this directory from the assembly file (fasta: .fa or .fna) like so:

Preparing a Reference Genome (Ensure your conda environment is activated!):

$ samtools faidx <reference-genome.fa>
$ bwa index -a bwtsw <reference-genome.fa>
$ ngm -r <reference-genome.fa>

# directory tree of an example reference genome directory when fully prepared

~/genomes_and_annotations/
├── GCF_004118075.1_ASM411807v1
   ├── GCF_004118075.1_ASM411807v1_genomic.fna
   ├── GCF_004118075.1_ASM411807v1_genomic.fna.amb
   ├── GCF_004118075.1_ASM411807v1_genomic.fna.ann
   ├── GCF_004118075.1_ASM411807v1_genomic.fna.bwt
   ├── GCF_004118075.1_ASM411807v1_genomic.fna-enc.2.ngm
   ├── GCF_004118075.1_ASM411807v1_genomic.fna.fai
   ├── GCF_004118075.1_ASM411807v1_genomic.fna-ht-13-2.3.ngm
   ├── GCF_004118075.1_ASM411807v1_genomic.fna.pac
   └── GCF_004118075.1_ASM411807v1_genomic.fna.sa

Reference Genome Annotation

After variants have been called, they can be annotated for their effects on genomic features such as genes. The workflow provides the annotate rule (rules.annotate.input). It will invoke the software snpEff on vcf files specified in config.yml.

Software snpEff needs a database (snpEffectPredictor.bin) for each relevant reference genome. Their location can be specified in snpeff.config. Our workflow expects these databases to reside in genomes_and_annotations/snpeffdata/ in sub-directories for each reference genome. To avoid confusion, we recommend to name these the same as the reference genome sub-directories and append “_snpeff” to the directory names.

Databases for snpEff are built from the reference genome fasta file, a genome annotation file (gbk, gff3, gtf), and other files specifying genomic features of interest, e.g., a protein fasta file. For a detailed explanation of the snpeff build process please consult the snpEff documentation.

For building a snpeff database, we recommend providing the reference genome, the genome annotation, a protein fasta file, etc., in the <reference_genome>_snpeff directory, adjust the snpEff.config file, and run the snpEff build from the workflow’s top level directory (where the snpeff.config resides). snpEff build requires particular filenames:

Required file	Filename required by snpEff build
Reference Genome (fasta file)	sequences.fa
Genome Annotation (GenBank, gff, gtf)	genes.gbk, genes.gff, or genes.gtf
Proteins (fasta file)	protein.fa

Below is an example directory tree of genomes_and_annotations/ for a cowpea reference genome downloaded from NCBI. We store the reference genome and associated files elsewhere and use soft links. That way we need to store the genomes only in one place and satisfying snpEff’s filename requirements is transparent.

genomes_and_annotations/
├── GCF_004118075.1_ASM411807v1 -> ~/genomes/GCF_004118075.1_ASM411807v1/
├── readme
└── snpeffdata
    └── GCF_004118075.1_ASM411807v1_snpeff
        ├── genes.gbk -> ~/genomes/GCF_004118075.1_ASM411807v1/GCF_004118075.1_ASM411807v1_genomic.gbff
        ├── genes.gff -> ~/genomes/GCF_004118075.1_ASM411807v1/GCF_004118075.1_ASM411807v1_genomic.gff
        ├── genes.gtf -> ~/genomes/GCF_004118075.1_ASM411807v1/GCF_004118075.1_ASM411807v1_genomic.gtf
        ├── protein.fa -> ~/genomes/GCF_004118075.1_ASM411807v1/GCF_004118075.1_ASM411807v1_protein.faa
        ├── sequences.fa -> ~/genomes/GCF_004118075.1_ASM411807v1/GCF_004118075.1_ASM411807v1_genomic.fna
        └── snpEffectPredictor.bin

Edit the file snpeff.config following our example entry for cowpea under “Non-standard Databases”. Replicate accordingly for your situation.

Once the files are in place and the respective edits have been made to snpeff.config the database snpEffectPredictor.bin is generated by executing snpeff build. Note that only one of genes.gbk, genes.gff, or genes.gtf is required. Depending of what gene annotation file you have, use either of the below commands:

$ snpEff build -v -c <path to snpEff.config> -genbank <reference-genome>_snpeff
$ snpEff build -v -c <path to snpEff.config> -gff3 <reference-genome>_snpeff
$ snpEff build -v -c <path to snpEff.config> -gtf <reference-genome>_snpeff

The target for snpEff build is the directory name only. The path to this directory (genomes_and_annotations/snpeffdata/) is specified in snpEff.config. If executed from the workflow’s top level directory (the directory that contains the snpeff.config file), specifying the location of snpEff.config (-c) is not necessary.

Important: When using reference genomes and annotations downloaded from NCBI we find that the genome annotation must be provided to snpeff build as GenBank file (genes.gbk), otherwise snpEff build fails to properly use the proteins.fa file.

Providing Required Meta Data

Sample sets

We use the term sample as the entity of interest. Our workflow calls variants on samples within sets. Sets are defined by a list of sample names in text files (.txt) in directory /metadata/samplesets/. Each such file defines one set. The samplesets in file config.yml refer to those .txt files and must match the basename of the textfiles in /metadata/samplesets/. The workflow will produce an alignment file (<sample>.bam) for each sample as well as a joint alignment file (<set>.bam) for each set. Variants will be called on the <set>.bam files.

Note: We implemented a glob: all_samples, which will have the effect of concatenating all *.txt files in /metadata/samplesets/ into an additional set comprising all samples across all sets. The intention is to enable easy addition or removal of samples to/from an existing analysis.

Sample Definitions

In practice, the unit of interest oftentimes is the individual. DNA is extracted from an individual and turned into one or several sequencing libraries that are then sequenced in one or several sequencing runs. In order to compare individuals, all respective fastq files need to be assigned to the same sample. For our workflow, the run–library–sample relationships are defined in /metadata/sample2runlib.csv making explicit which fastq files constitute the samples. The entries in columns run and library are together used as the primary key for the sequencing run and thus the fastq file(s). The user must make sure that any run–library combination is unique. Sequencing runs are assigned to the same sample through an identical entry in the sample column. fastq files can be provided either as separate forward and reverse read files (fq1, fq2) or as interleaved fastq (il_fq), with the respective other column(s) empty. Within one sample2runlib.csv file interleaved and two-file input can be mixed.

Example sample2runlib.csv file:

run	library	sample	fq1	fq2	il_fq
Run1	A-500bp	A	<path to file>	<path to file>
Run1	A-300bp	A	<path to file>	<path to file>
Run2	A-500bp	A	<path to file>	<path to file>
Run2	B-300bp	B			<path to file>
Run1	B-500bp	B			<path to file>
Run1	C-500bp	C	<path to file>	<path to file>

Regions of Interest for Variant Calling

Variant calling will happen on the set aligment files and can be restricted to particular regions of interest through metadata/contigs_of_interest.bed: A file in bed file format listing identifier, start-, and end-positions (tab delimited, no header) for all chromosomes/contigs for all provided reference genomes. This is helpful for exome capture data but also to restrict the analysis to specific chromosomes. In addition to the main chromosomes, genome assemblies often comprise many orphan fragments that congest the output and their variants are often not of interest. While the workflow will attempt to create the file if not present, we recommend to always define a contigs_of_interest.bed file. This one files serves all reference genomes in use and hence can/must contain all chromosome/contig names of interest. Obviously, it must be the exact names as in the assembly. The correct chromosome/contig names and their lengths can be conveniently extracted from the corresponding .fai file for the reference genome assembly. Note that the analysis restriction to regions in metadata/contigs_of_interest.bed only concerns the variant calling, not the read mapping. Below example will restrict variant calling to the 11 chromosomes of cowpea. The hundreds of additional contigs that are present in the cowpea reference genome assembly are available for read mapping, but variants will not be called for them and their variants will subsequently not be present in the bcf/vcf files. Note that lines starting with ‘#’ will be disregarded. So in this case we skip variant calling in the chloroplast (NC_018051.1).

Example contigs_of_interest.bed file:

NC_040279.1 0 42129361
NC_040280.1 0 33908088
NC_040281.1 0 65292630
NC_040282.1 0 42731077
NC_040283.1 0 48746289
NC_040284.1 0 34463471
NC_040285.1 0 40876636
NC_040286.1 0 38363498
NC_040287.1 0 43933251
NC_040288.1 0 41327797
NC_040289.1 0 41684185
#NC_018051.1 0 152415

Workflow Configuration

config.yml

Central place for the user to configure the workflow behaviour and software parameters is config.yml. There are comments in the file that explain the configuration parameters and options.

In Snakemake, calling a rule will trigger running of the upstream rules. When editing (/config.yml) it is important to only configure the intended most downstream rule (varcall:, annotate:, or denovo:). These settings will be propagated upstream. qc: is independent and will require editing: particularly the _DEFAULT_ adaptors used. The standard adaptors for Truseq- and Nextera-Libraries are given for reference, paste under _DEFAULT_ as required.

An important configuration parameter is the location of a suitable temporary directory (“tmp/”). Several rules make extensive use of the “tmp/” directory to temporarily store large files. Oftentimes, standard home directories on compute servers or cluster nodes are too small. Central place for the configuration of the “tmp/” directory is currently under the abra2 configuration options (abra2:temp:).

Additional Configuration

Expert users can change the rules themselves by editing rules/*.rules.smk. Use caution! We have chosen reasonable default settings and recommend modifying rules only when you know what you are doing. When allocating more cores to rules, pay attention that some rules are very memory intensive and some shell commands are piped and are in fact using more than 1 core per process.

Running the Workflow - different Workflow Use Cases

To run the workflow, un-comment the respective rule for the desired use case in the Snakefile and run snakemake.

$ snakemake -npr
$ snakemake -j 6 --notemp -kpr

The first command will initiate a dry run (-n) and snakemake will return what it intents to do (-p -r). The second command will invoke snakemake on 6 threads (-j), will disregard the temp declaration (retain temporary files), and will keep going with independent tasks, should a task fail (-k). For details on commandline options for snakemake please consult the Snakemake documentation. Un-comment only the one most downstream rule for your use case. Currently, these use case rules are:

# USER OPTIONS
#     rules.denovo.input,
#     rules.varcall.input,
#     rules.annotate.input,

A rule that encounters missing input files will invoke the respective upstream rule(s). E.g., when rules.annotate.input is uncommented and snakemake is run for the first time, the entire workflow from readqc (= adapter and quality clipping), align (= read alignments, duplicate marking, indel realignment), varcall (= variant calling), and annotate (= variant functional annotation) will run in one go. In case no snpeff database is supplied, then rule rules.annotate.input cannot be run. Use rules.varcall.input instead.

When configuring config.yml, keep in mind that configuration parameters of a downstream rule take precedence because parameters have to propagate upstream. I.e., when running varcall, the user must set alignment parameters in the varcall: section; same for annotate: parameters. They must be set under snpeff:.

>NOTE: Adjusting presumed upstream parameters, e.g., under align: will not have the intended effect. Only in special circumstances will the rules.align.input be run by itself and only then will you have to adjust parameters in the align: section.

The workflow runs on samples in sets as listed in samplesets/*.txt and defined in /metadata/sample2runlib.csv. Sample names listed in samplesets/*.txt must correspond to the entries in the sample column in /metadata/sample2runlib.csv.

De-Novo Analysis of sample relatedness (“denovo”)

Option denovo can be used to check the relatedness of sequencing runs and/or samples without the use of any reference genome. It will perform a comparative analysis based on k-mers on the raw data and output distance matrices. We recommend performing a de-novo analysis at the very start of every project at the “sequencing run”-level prior to any merging of runs into samples. This can help to detect mix-ups and mislabels. Run-level clustering is achieved by providing unique names for each sequencing run in the sample column of metadata/sample2runlib.csv. De-novo analysis is invoked by calling rules.denovo.input (Uncomment the respective line in the Snakefile, and only this line). Pay attention to maintain the indentation.

rule all:
    input:
# USER OPTIONS
            rules.denovo.input,
#            rules.varcall.input,
#            rules.annotate.input,
# EXPERT OPTIONS
#            rules.readqc.input,
#            rules.align.input,
#            rules.stats.input,

Variant Calling - Standard Re-Sequencing Analysis (“varcall”)

Running rules.varcall.input will call variants and genotype samples with respect to one or several reference genomes. varcall will compare samples listed in metadata/samplesets/ with one another, as defined in the sample column. Invoke this analysis by uncommenting rules.varcall.input (and only this line). Pay attention to maintain the indentation.

rule all:
    input:
# USER OPTIONS
#        rules.denovo.input,
        rules.varcall.input,
#        rules.annotate.input,
# EXPERT OPTIONS
#        rules.readqc.input,
#        rules.align.input,
#        rules.stats.input,

Variant Annotation - The Effects of Variants on Gene Function (“annotate”)

If a snpEff library for this (exact) reference genome is provided then the entire workflow can in principle be invoke by uncommenting rules.annotate.input (and only this line). Pay attention to maintain the correct indentation.

rule all:
    input:
# USER OPTIONS
#       rules.denovo.input,
#       rules.varcall.input,
       rules.annotate.input,
# EXPERT OPTIONS
#       rules.readqc.input,
#       rules.align.input,
#       rules.stats.input,

Currently, rules.annotate.input will operate only on one reference genome at a time. The rule will hence propagate upstream only one reference genome as requirement. If variants detected against several different reference genomes need annotating, then first run the varcall rule specifying all reference genomes and subsequently invoke the annotate rule separately for each reference genome annotation.

Expert Options

In cases where only adapter clipping or read alignment is desired, those processes can be invoked separately by the respective rules, rules.readqc.input or rules.align.input. There are separate configurations sections for those in config.yml which must be used. Alignment will trigger prior adaptor clipping. Invoking rules.stats.input will produce summary statistics. They will require read alignments and are meaningful for generic varcall runs.

Workflow Outputs

After completion, all output of the workflow, including logs and stats, will be in output/.

Clipped reads (in interleaved fastq format) are in output/reads/
Alignment files (.bam) for each sample are in output/alignments/samples and then in respective sub-directories for respective aligners and reference genomes.
Joint alignment files (.bam) for each set are in output/alignments/sets and then in respective sub-directories for respective aligners and reference genomes.
If abra2 was run, then BAM files with In/Del-realigned alignments are in output/abra/
BCF/VCF files of the filtered variants including and respective index files are in output/variants/final/
The snpEff output for each annotation including the snpEff-annotated variant file is in output/annotated_variants/snpeff/

Note: BCF/VCF files in output/variants/final/ have been filtered with bcftools view according to the user defined filter settings in config.yml. For loading into IGV, use the In/Del realigned BAM file in output/abra/ and the *.vcf.gz files of the filtered variants. Note that IGV requires the vcf.gz.tbi index.

Support

Contributors

This workflow was developed by Norman Warthmann of the Plant Breeding and Genetics Laboratory of the FAO/IAEA Joint Centre (PBGL), with important contributions from Kevin D Murray (Australian National University) and Marcos Conde, PBGL. The documentation was written by Norman Warthmann with contributions from Anibal Morales, PBGL.

Reproducibility

Workflows help addressing the reproducibility issue in science. Consider making your version of the workflow, configured for your data, available upon publication of your results. Check out the archive options of Snakemake.

Getting Help

Feel free to let us know if you are using our workflow and don’t hesitate to contact us with questions: email n.warthmann@iaea.org