Single cell RNA-seq

We tried our best to follow the supplementary materials towards analyzing scRNA-seq.

Softwares

You can get bowtie using conda: conda install -c bioconda bowtie.
- You can also download a pre-built package from here. Make sure to update PATH to point to install location.
- Make sure to get the latest version, 1.2.3. I tried using 1.0.0 but it doesn’t seem to work with fastq.gz.
We also need samtools to generate BAM files as well as for sorting and indexing: conda install -c bioconda samtools.
We will use RSEM, which is available here.

Aligning scRNA-seq

In the supp, the authors indcitate that they “mapped to the UCSC hg19 human transcriptome using Bowtie.” It turns out that we don’t need to run bowtie as RSEM can align and output BAM files.

Using RSEM to align and calculate TPM

It turns out that RSEM will do quite a few things for us. First, RSEM requires you to build references. The instructions are available on their Github page. The bulk data that I received from the authors came with the reference hg19 that they used for processing. Using the same reference file turned out to be important for running Strelka to call variants from single cell RNA-seq using matching normal bulk. We still need to download the GTF file. I chose GRCh37.75:

wget ftp://ftp.ensembl.org/pub/release-75/gtf/homo_sapiens/Homo_sapiens.GRCh37.75.gtf.gz

I have the GTF file as well as the hg19 in a common directory, call it, <REF_LOC>. Using gunzip to decompress the files and excute a script containing the following command:

rsem-prepare-reference --gtf <REF_LOC>/Homo_sapiens.GRCh37.75.gtf \
               --bowtie \
               <REF_LOC>/Homo_sapiens_assembly19.fasta \
               <REF_LOC>/bowtie/human_ensembl

Note:

You will need to create a destination directory <REF_LOC>/bowtie.
This command builds Bowtie index, which we are going to use for aligning single cells.
I ran this command in the interactive mode, i.e., did not submit it as a job. It didn’t take too long, roughly about 10-15 minutes. Depending on your compute cluster rules, you may need to submit it as job.
You may need to specify path to bowtie using the option --bowtie-path if bowtie is not in your search path.
Option -p can be used if you desire to use multiple cores.
Full set of options are available here.

To calculate the expression values, we use the command rsem-calculate-expression (documentation):

rsem-calculate-expression --output-genome-bam \
                          --phred33-quals \
                          --bowtie-n 1 \
                          --bowtie-e 99999999 \
                          --bowtie-m 15 \
                          --seed-length 25 \
                          --fragment-length-min 1 \
                          --fragment-length-max 2000 \
                          -p 1 \
                          --paired-end \
                          <fastq1> \
                          <fastq2> \
                          <REF_LOC>/human_ensembl \
                          <OUTPUT_NAME>

Compare the above to the options used to run bowtie in Tirosh et al:

-q --phred33-quals -n 1 -e 99999999 -m 15 -l 25 -I 1 -X 2000 -a -S -p 6

-q: This option indicates that the input files are in fastq. The single cells are in fastq.gz, so I thought we need to specify --gzipped-read-file but it seems to work without it.
--phred33-quals, -n, -e, -m are matched in both commands straightforwardly.
-l: corresponds to --seed-length. The default value is 25 so we actually do not need to specify it but we do so anyways for completeness.
-I: corresponds to --fragment-length-min. The default value is 1 for rsem-calculate-expression so we actually don’t need to specify it (do it for completeness).
-X: corresponds to --fragment-length-max. The default value is 1,000 so we definitely need to specify its value as 2,000 to match Tirosh et al.
-a: The default value is off for Bowtie. I wasn’t able to find the corresponding option for rsem-calculate-expression.
-S: Indicates that the output will be in SAM format. This option is irrelevant to us since rsem-calculate-expression will output BAM files directly specified using --output-genome-bam option.
-p: Indicates the number of threads to use. As I’m submitting one job, each allocated 1 core for each cell, I chose to specify 1 thread.

Because the single cells are small in size, the processing time was rather fast. Even for a cell with size about 200MB, I think it took only about 15 minutes.

Bulk WES-seq

We were able to acquire the bulk WES and scRNA data for a handful of melanoma patients analyzed in Tirosh et al: mel78, mel79, mel81, and mel82. The bulk WES for each patient was shared as BAM file and each tumor sample was accompanied by matching normal sample. By checking the header file, the BAM files appeared ready to be analyzed. Because we are given already aligned BAM files, it is important that we use hg19 (GRCh37) for aligning single cells (later).

We are going to call somatic and germline variants and copy numbers.

Softwares

MuTect2.
- You can get GATK here, which includes MuTect2 among other useful tools.
Strelka.
HaplotypeCaller.
FREEC.
- sambamba.
- bedtools.

MuTect2

The manual for MuTect2 is found here. Note: we did not use panel of normal.

gatk --java-options "-Xmx16g" Mutect2 \
  -R <REF_LOC>/Homo_sapiens.GRCh37.75.dna.primary_assembly.fa  \
  -I ${SAMPLE_DIR}/${TUMOR_SAMPLE}.bam \
  -tumor ${TUMOR_SAMPLE} \
  -I ${SAMPLE_DIR}/${NORMAL_SAMPLE}.bam \
  -normal ${NORMAL_SAMPLE} \
  -O ${SAMPLE_DIR}/somatic.vcf.gz \
  --f1r2-tar-gz ${SAMPLE_DIR}/f1r2.mutect2.tar.gz

Strelka

strelka-2.9.2.centos6_x86_64/bin/configureStrelkaSomaticWorkflow.py \
    --normalBam <NORMAL_BAM> \
    --tumorBam <TUMOR_BAM> \
    --referenceFasta <REF_LOC>/Homo_sapiens_assembly19.fasta \
    --runDir <RUN_DIR> \
    --exome

HaplotypeCaller

gatk --java-options "-Xmx16g" HaplotypeCaller \
  -R <REF_LOC>/Homo_sapiens_assembly19.fasta \
  -I ${SAMPLE_DIR}/${SAMPLE_NAME}.bam \
  -O ${SAMPLE_DIR}/${SAMPLE_NAME}.germline.vcf.gz \
  -ERC GVCF

FREEC

Although FREEC supports inferring copy number per exon, I was not able to get it to run. The error messages were rather cryptic, for example,

out of range error: Terminate called after throwing an instance of ‘std::out_of_range` what(): basic_string::substr: __pos (which is 43690) > this->size() (which is 0).

My previous experience was FREEC was rather smooth when analyzing WGS so I decided to try to use it without exon option, which basically amounts to specifying suitable window and step size. After a bit of a trial-and-error, I settled on using window size of 100 and step size of 50. I did a quick visualization check using IGV to verify that the results seemed reasonable.

Preparation

FREEC requires various files to be available:

GRCh37 reference split by chromosomes and
file containing lengths for each chromosomes.

We already have access to GRCh37 from RSEM step. We can find GRCh37 split by chromosome on Ensembl FTP. For example release-75 is available here. We can simply use wget to get the files we need. For example,

wget -r -A 'Homo_sapiens.GRCh37.75.dna.chromosome.*.fa.gz' ftp://ftp.ensembl.org/pub/release-75/fasta/homo_sapiens/dna/

will retrieve all chromosome files we need.

Chromsome length can be obtained using samtools:

samtools faidx Homo_sapiens.GRCh37.75.dna.primary_assembly.fa

Run FREEC

The configuration file to run FREEC is available here, which mostly uses default values. The following fields need to be modified the sample being analyzed:

general
- outputDir
- sex: {78: M, 79: M, 81: F, 82: M}.
sample: mateFile
control: mateFile

Identify somatic SNVs for analysis

A commonly adopted practice is to consider SNVs that are in both of these variant callers. Furthermore, we will focus on somatic SNVs that fall on an exon. To that end, I downloaded refGene.txt and wrote an R script that finds intersection of the MuTect2, Strelka, and exons found in refGene.txt.