Introduction

Processing DNA samples consumes significant amount of time. As a beginner, the task looks even more daunting (speaking from my personal experience). In this post, I have summarized knowledge that I gained from various sources on common pre-processing steps to be carried out in preparation for further analysis.

Paired-end sequencing

The paired-end sequencing produces reads from both ends of a DNA fragment (one starting from the 3’ end and the other one starting from the 5’ end). Since DNA fragment is usually longer than the read length, the reads will cover different segments of the fragment, although for short reads, it may be possible for the two reads to overlap.

The main benefit of paired-end sequencing compared to single-end sequencing is that it increases confidence in mapping to the reference genome. The paired-end sequencing usually produces two fastq.gz files, xxx_1.fastq.gz and xxx_2.fastq.gz, where xxx refers to file identifer. These two files should have same number of lines, which can be checked via the following command:

gunzip -c xxx_1.fastq.gz | wc -l
gunzip -c xxx_2.fastq.gz | wc -l

Note that this may take some time depending on the size of the file. Make sure that the number of lines match in both files. At times, there will be third file, xxx.fastq.gz, this file seems to contain unpaired reads but some claim that it may contain barcodes. [Source: this discussion on biostars.org].

TODO: write a post on sequencing technology and refer to that post here. Discuss about RNA-seq technology. Is paired-end sequencing possible for RNA?

Checking the quality of data

A first step is to ensure the quality of data, which can be carried out using fastqc. This program generates an HTML report to help the user identify poor data quality. An example of such report is available here and here. Here are some notes on how to run fastqc:

  • Run a separate fastqc for each file: fastqc xxx_1.fastq.gz -o output_dir. This will generate the report (in the HTML format) to the output_dir.
  • It took me about 1 hour for 3GB fastq.gz file on Uppmax cluster (Rackham) and 3 hours for 30GB file on Uppmax cluster (Bianca). Make sure to allocate about 4-6 hours of computing time.
  • There doesn’t seem to be any multi-threading support for fastqc so when requesting computing resources, single core will suffice. Doing so will allow the job to be scheduled faster.
  • For details on how to read the report generated, refer to the documentation.

TODO: write a separate post on how to read/interpret the report.

Trimming adapters

The next step is to trim the adapters. The adapters are artefacts of sequencing, where these adapters correspond to the chemical agents that latch on to the fragment. These adapters need to be removed before analysis.

TODO: explain how the adapters arise in another post.

The tool that we will use for carrying out this task is cutadapt. However, this tool does more than just remove adapters. Some of its capabilities include:

  1. Adapter removal
    • This step requires the user to know the adapter sequence. The adapters are platform dependent and can be looked up in the manual.
  2. Quality trimming
    • Example: -q 20 will remove the bases with quality score below 20. Similarly, -q 20,15 for paired reads will remove bases with quality score below 20 from 5’ end reads and 15 from 3’ end reads. If only one value is specified but the reads are paired, then only 3’ reads will be processed according to the documentation.
  3. Filtering by read length:
    • Q: When filtering reads based on length via -m l, if a read from 3’ has length shorter than l, then the corresponding read in 5’ end will also be discarded even though the corresponding read is longer than l?
    • TODO: run it with a very small file and see how it behaves.
  4. The full list of operations is available here.

Notes:

  1. Many of these operations are performed on both reads; therefore, the two files need to be processed (specified to cutadapt) together.
  2. Pipelining the output of cutadapt to the next step is possible and may be a good idea when storage is a concern. However, it may be of interest to run fastqc again on the trimmed data just to ensure that everything looks good after trimming.
  3. Q: cutadapt is written in Python. Are there better choices for this task?
  4. There are some discussions on whether this step is even necessary. For example, see this discussion and this).
  5. This step can be parallelized so request large number of cores. Using 4 cores, it took me about 11 hours to process 20GB file.

Mapping to reference and sorting the BAM file

For reference mapping task there seem to be two popular choices: BWA-mem and bowtie2. I will discuss bowtie2 here. The output of these tools is in SAM (sequence alignment map) file format. A potential problem is that the output file can get rather big and hence, the recommended practice is to pipe the output of bowtie2 to samtools to produce the BAM (binary alignment map) files.

Note: BAM file format is a compressed version of SAM file format.

Running bowtie2 can be accomlished by issueing the following command:

bowtie2 --maxins 2000 --threads 16 -1 xxx_1.fastq.gz -2 xxx_2.fastq.gz -x /path/to/reference -S $4

Here is description of each of the options:

  • -1: file name of first of the two paired-end sequencing
  • -2: the second of the two files
  • -x: the location of the reference genome. Note that these are platform specific. For Uppmax, use the following path: /sw/data/uppnex/reference/Homo_sapiens/GRCh37/program_files/bowtie2/concat
  • --minins or -I: specifies the minimum gap between the location of the first and the second pair. Default value is 0.
  • --maxins or -X: specifies the maximum gap between the location of the first and the second pair. The default value is 500.
  • --threads or -p: number of threads to use
  • -S: output file (but not recommended to output SAM file)
  • Refer to the manual for complete list of options.

Note that larger the gap between -I and -X, slower the mapping process. If you have good understanding of the sequencing technology that is used, then specifying accurate values for -I,-Z will shorten the mapping time.

The recommended tool for compressing to BAM file is samtools. It is a collection of tools that are frequently used in analyzing SAM/BAM files. Mapping and piping can be accomplished by,

bowtie2 --maxins 2000 --threads 16 -1 xxx_1.fastq.gz -2 xxx_2.fastq.gz -x /path/to/reference | samtools view -h -b - > mapped.bam
  • view: program name,
  • -b: convert to BAM format
  • -h: output the header in the BAM file
  • -: the dash is telling samtools that the input is coming from standard input (coming from pipeline).
  • It seems like multi-threading is available via @ num_threads option but some people have reported that it can be unstable.
  • The manual is available here.

Finally, we need to sort the BAM file:

samtools sort -o sorted.bam mapped.bam
rm mapped.bam # remove the intermediate BAM file

Question: why do we need to sort BAM?

  • Basically for faster processing down the stream. See this slide.

Some people suggest that writing raw BAM file is also a waste of space (and waste of disk I/O) so, only the sorted file should be stored in the disk. But then, some point out that piping too many commands can also create a bottleneck.

TODO: I have not tried this myself but at some point, try:

bowtie2 --maxins 2000 --threads 16 -1 xxx_1.fastq.gz -2 xxx_2.fastq.gz -x /path/to/reference | samtools view -h -b - | samtools sort -o sorted.bam

And also, try with @16 to see if multi-threading makes a difference.

Better yet, it is possible to bypass the step of converting SAM to BAM and just jump straight to sorted BAM:

bowtie2 --maxins 2000 --threads 16 -1 xxx_1.fastq.gz -2 xxx_2.fastq.gz -x /path/to/reference | samtools sort -@8 -o output.bam -

Time comparison

Compare running times and file outputs of the following scenarios on 3GBx2 files:

  1. Mapping to output SAM file (16 cores). Took about 2hours. Produced 70GB file.
  2. Converting SAM to BAM file (single-threading) from step 1. TODO
  3. Sorting the BAM file (16 cores) from step 2. TODO
  4. Piping the output of map to sort directly (16 cores for both map and sort). Took about 2 hours. Produced an 8GB BAM file.
  5. Piping the output of map to view for unsorted BAM file (use 16 cores for map but single threading for view). Took about 2 hours. Produced an 8GB BAM file.

Marking duplicate

The exact reasoning as to why this problem arises is not known to me yet. It appears to be an artefact of sequencing technology.

TODO: read up on the sequencing technologies to try and understand why this happens.

The tool that we will use for marking duplicates is picard.

# requiring 5GB of memory, for Rackham, each core comes with 6.4GB so we should be OK without additional request for memory
java -Xmx5g -jar ${PICARD}/picard.jar \
MarkDuplicates VALIDATION_STRINGENCY=LENIENT \
INPUT=sorted.bam \
OUTPUT=marked.bam.bam \
METRICS_FILE=duplicate_metric.txt 1>&2 2> log_fix_mark.txt

This script requires to provide the location of picard.jar and the input BAM file. The METRICS_FILE points to the file that will contain the duplication statistics. And the last bit is just to redirect the output to a log file.

TODO: add a bit more on how to read and interpret the metrics file.

The above command wil mark but not remove the duplicates (which is the recommended practice according to GATK source):

By default the tools flag duplicates and retain them in the output file. To remove the duplicate records from the resulting file, set the REMOVE_DUPLICATES parameter to true. However, given you can set GATK tools to include duplicates in analyses by adding -drf DuplicateRead to commands, a better option for value-added storage efficiency is to retain the resulting marked file over the input file.

The source above states that we can create the index file at this point:

To optionally create a .bai index, add and set the CREATE_INDEX parameter to true.

Note that Java is given 5GB to work with, noting that one core of Rackham comes with 6.4GB, we do not need to request for additional memory resources.

It took about 1.5 hr to complete this task on Rackham, on the sorted BAM file.

Adding read group

This step is needed because GATK does not allow BAM files that do not have read groups added. The command is as follows:

java -Xmx5g -jar ${PICARD}/picard.jar \
AddOrReplaceReadGroups \
INPUT=marked.bam \
OUTPUT=fixed.bam \
CREATE_INDEX=true \
RGID=$1 \
RGSM=$2 \
RGLB=$3 \
RGPL=ILLUMINA \
RGPU="NA" 1>&2 2> log_fix_addreplace.txt
  • CREATE_INDEX: create index file while adding the read group.
  • RGID: Read group ID. The default value is 1, and it is recommended to change this to something other than the default value. For example, 2 or if the data is triple negative breast cancer, TNBC or could be a patient ID if there are multiple patients.
  • RGSM: Read group sample name. This can be patient ID and/or information on the sample like blood or stomach if such is available.
  • RGLB: Read group library (library preparation method?). When unknown just fill in some default value like “NA”.
  • RGPL: Read group (sequencing) platform. In my case, we are using ILLUMINA.
  • RGPU: Read group platform unit. It is a required filled but I’m not sure what this is supposed to be referring to so, filled in NA.

Creating index file

Note that alternatively, we can create index file using samtools index marked.bam marked.bai, this will produce an output marked.bai (in the order of few MBs for 8GB BAM file).

Realignment

First, some notes:

  • Genome analysis toolkit (GATK) can be used. But this step seems to be optional if using MuTect2 and HaplotypeCaller.
  • Do not do this step for amplicon sequencing (i.e., targeted sequencing data).
  • I think it’s case-by-case. If the producer of the data did this step, then do it. Otherwise, it is optional and depends on the variant calling tool that you are using.

The pre-req for running indel realignment are the files containing the reference sequence, target interval, and known indel locations. Note that to perform re-alignment over the entire genome would be very time consuming, so it is recommended to carry out local re-alignment around the regions that are known to have insertions and deletions.

The reference sequence and known indel locations are usually available on a large compute clusters like Uppmax. But the target interval file may not available, in which case, we need to create one:

 java -Xmx1g -jar /path/to/GenomeAnalysisTK.jar \
  -T RealignerTargetCreator \
  -R /path/to/reference.fasta \
  -o /path/to/output.intervals \
  -known /path/to/indel_calls.vcf

Then, we call indel realignment as follows:

java -Xmx4g -Djava.io.tmpdir=/path/to/tmpdir \
  -jar /path/to/GenomeAnalysisTK.jar \
  -I fixed.bam \
  -R <ref.fasta> \
  -T IndelRealigner \
  -targetIntervals /path/to/output.intervals \
  -o realigned.bam \
  -known /path/to/indel_calls.vcf

Note 1 If you don’t add the read groups, you will likely get the following error:

Error details: SAM file doesn’t have any read groups defined in the header. The GATK no longer supports SAM files without read groups.

Note 2 Error regarding incompatible contigs

It can happen when the names of the chromosome used in the reads differ from the names of the chromosomes used in the reference file. For example, in my case, the reads referred to them as 1, 2, …, 22 but in the reference, they were labelled chr1, chr2, … chr22.

This problem and the possible solutions are laid out in this post.

Chances are, people in your lab have experienced this issue before and a compatible file format is available somewhere in Uppmax. This is how I solved this issue.

It took about 9.5 hours (single core) on 8GB BAM file.

Additional steps for cleaning

These seemed less important to me, so I skipped them. But for a careful and accurate analysis, perhaps these steps should not be skipped:

  • samtools flagstat
  • Additional filtering to remove low quality reads, unmapped reads, and marked duplicates
  • Base quality score recalibration
    • Uses studies on biases of sequencing platform to correct the quality scores. Certain platforms are known to produce more G’s vs A’s for example.

At this point, the realigned BAM file is more or less ready for analysis such as variant calling. And perhaps even copy number analysis. These next steps will be described in the future posts.

Useful resources: