Processing Melanoma Data
Single cell RNA-seq
We tried our best to follow the supplementary materials towards analyzing scRNA-seq.
Softwares
- You can get bowtie using conda:
conda install -c bioconda bowtie
.- You can also download a pre-built package from here. Make sure to update
PATH
to point to install location. - Make sure to get the latest version, 1.2.3. I tried using 1.0.0 but it doesn’t seem to work with
fastq.gz
.
- You can also download a pre-built package from here. Make sure to update
- We also need
samtools
to generate BAM files as well as for sorting and indexing:conda install -c bioconda samtools
. - We will use RSEM, which is available here.
Aligning scRNA-seq
In the supp, the authors indcitate that they “mapped to the UCSC hg19 human transcriptome using Bowtie.” It turns out that we don’t need to run bowtie as RSEM can align and output BAM files.
Using RSEM to align and calculate TPM
It turns out that RSEM will do quite a few things for us. First, RSEM requires you to build references. The instructions are available on their Github page. The bulk data that I received from the authors came with the reference hg19 that they used for processing. Using the same reference file turned out to be important for running Strelka to call variants from single cell RNA-seq using matching normal bulk. We still need to download the GTF file. I chose GRCh37.75:
wget ftp://ftp.ensembl.org/pub/release-75/gtf/homo_sapiens/Homo_sapiens.GRCh37.75.gtf.gz
I have the GTF file as well as the hg19 in a common directory, call it, <REF_LOC>
.
Using gunzip
to decompress the files and excute a script containing the following command:
rsem-prepare-reference --gtf <REF_LOC>/Homo_sapiens.GRCh37.75.gtf \
--bowtie \
<REF_LOC>/Homo_sapiens_assembly19.fasta \
<REF_LOC>/bowtie/human_ensembl
Note:
- You will need to create a destination directory
<REF_LOC>/bowtie
. - This command builds Bowtie index, which we are going to use for aligning single cells.
- I ran this command in the interactive mode, i.e., did not submit it as a job. It didn’t take too long, roughly about 10-15 minutes. Depending on your compute cluster rules, you may need to submit it as job.
- You may need to specify path to
bowtie
using the option--bowtie-path
ifbowtie
is not in your search path. - Option
-p
can be used if you desire to use multiple cores. - Full set of options are available here.
To calculate the expression values, we use the command rsem-calculate-expression
(documentation):
rsem-calculate-expression --output-genome-bam \
--phred33-quals \
--bowtie-n 1 \
--bowtie-e 99999999 \
--bowtie-m 15 \
--seed-length 25 \
--fragment-length-min 1 \
--fragment-length-max 2000 \
-p 1 \
--paired-end \
<fastq1> \
<fastq2> \
<REF_LOC>/human_ensembl \
<OUTPUT_NAME>
Compare the above to the options used to run bowtie in Tirosh et al:
-q --phred33-quals -n 1 -e 99999999 -m 15 -l 25 -I 1 -X 2000 -a -S -p 6
-q
: This option indicates that the input files are in fastq. The single cells are infastq.gz
, so I thought we need to specify--gzipped-read-file
but it seems to work without it.--phred33-quals, -n, -e, -m
are matched in both commands straightforwardly.-l
: corresponds to--seed-length
. The default value is 25 so we actually do not need to specify it but we do so anyways for completeness.-I
: corresponds to--fragment-length-min
. The default value is 1 forrsem-calculate-expression
so we actually don’t need to specify it (do it for completeness).-X
: corresponds to--fragment-length-max
. The default value is 1,000 so we definitely need to specify its value as 2,000 to match Tirosh et al.-a
: The default value is off for Bowtie. I wasn’t able to find the corresponding option forrsem-calculate-expression
.-S
: Indicates that the output will be in SAM format. This option is irrelevant to us sincersem-calculate-expression
will output BAM files directly specified using--output-genome-bam
option.-p
: Indicates the number of threads to use. As I’m submitting one job, each allocated 1 core for each cell, I chose to specify 1 thread.
Because the single cells are small in size, the processing time was rather fast. Even for a cell with size about 200MB, I think it took only about 15 minutes.
Bulk WES-seq
We were able to acquire the bulk WES and scRNA data for a handful of melanoma patients analyzed in Tirosh et al: mel78, mel79, mel81, and mel82. The bulk WES for each patient was shared as BAM file and each tumor sample was accompanied by matching normal sample. By checking the header file, the BAM files appeared ready to be analyzed. Because we are given already aligned BAM files, it is important that we use hg19 (GRCh37) for aligning single cells (later).
We are going to call somatic and germline variants and copy numbers.
Softwares
- MuTect2.
- You can get GATK here, which includes MuTect2 among other useful tools.
- Strelka.
- HaplotypeCaller.
- FREEC.
MuTect2
The manual for MuTect2 is found here. Note: we did not use panel of normal.
gatk --java-options "-Xmx16g" Mutect2 \
-R <REF_LOC>/Homo_sapiens.GRCh37.75.dna.primary_assembly.fa \
-I ${SAMPLE_DIR}/${TUMOR_SAMPLE}.bam \
-tumor ${TUMOR_SAMPLE} \
-I ${SAMPLE_DIR}/${NORMAL_SAMPLE}.bam \
-normal ${NORMAL_SAMPLE} \
-O ${SAMPLE_DIR}/somatic.vcf.gz \
--f1r2-tar-gz ${SAMPLE_DIR}/f1r2.mutect2.tar.gz
Strelka
strelka-2.9.2.centos6_x86_64/bin/configureStrelkaSomaticWorkflow.py \
--normalBam <NORMAL_BAM> \
--tumorBam <TUMOR_BAM> \
--referenceFasta <REF_LOC>/Homo_sapiens_assembly19.fasta \
--runDir <RUN_DIR> \
--exome
HaplotypeCaller
gatk --java-options "-Xmx16g" HaplotypeCaller \
-R <REF_LOC>/Homo_sapiens_assembly19.fasta \
-I ${SAMPLE_DIR}/${SAMPLE_NAME}.bam \
-O ${SAMPLE_DIR}/${SAMPLE_NAME}.germline.vcf.gz \
-ERC GVCF
FREEC
Although FREEC supports inferring copy number per exon, I was not able to get it to run. The error messages were rather cryptic, for example,
out of range error: Terminate called after throwing an instance of ‘std::out_of_range` what(): basic_string::substr: __pos (which is 43690) > this->size() (which is 0).
My previous experience was FREEC was rather smooth when analyzing WGS so I decided to try to use it without exon option, which basically amounts to specifying suitable window and step size. After a bit of a trial-and-error, I settled on using window size of 100 and step size of 50. I did a quick visualization check using IGV to verify that the results seemed reasonable.
Preparation
FREEC requires various files to be available:
- GRCh37 reference split by chromosomes and
- file containing lengths for each chromosomes.
We already have access to GRCh37 from RSEM step. We can find GRCh37 split by chromosome on Ensembl FTP.
For example release-75 is available here.
We can simply use wget
to get the files we need.
For example,
wget -r -A 'Homo_sapiens.GRCh37.75.dna.chromosome.*.fa.gz' ftp://ftp.ensembl.org/pub/release-75/fasta/homo_sapiens/dna/
will retrieve all chromosome files we need.
Chromsome length can be obtained using samtools
:
samtools faidx Homo_sapiens.GRCh37.75.dna.primary_assembly.fa
Run FREEC
The configuration file to run FREEC is available here, which mostly uses default values. The following fields need to be modified the sample being analyzed:
- general
- outputDir
- sex: {78: M, 79: M, 81: F, 82: M}.
- sample: mateFile
- control: mateFile
Identify somatic SNVs for analysis
A commonly adopted practice is to consider SNVs that are in both of these variant callers. Furthermore, we will focus on somatic SNVs that fall on an exon. To that end, I downloaded refGene.txt and wrote an R script that finds intersection of the MuTect2, Strelka, and exons found in refGene.txt.