I use the pipeline developed at SciLifeLab called Sarek, which uses nextflow and performs the standard processing on NGS data as recommended by GATK best practices. This produces BAM files for tumor and normal samples.

One of the important steps in analyzing tumor NGS data is copy number calling. Sarek ships with ascat and it provides details on how to use it on NGS data here. This post describes the steps that I took to run ascat on ALL data published with this paper.

Identifying SNP positions

It appears that ascat was originally developed for SNP array data. Ascat analyzes logR and BAF values at the positions probed using the SNP array. But the data that we will analyze is NGS data from tumor and normal samples so we need to compute these values and specify the positions that we will analyze.

Fortunately, this step has already been done for us by SciLifeLab team. A position is chosen if it is bi-allelic and if minor allele frequencies is greater than 0.3 (see here).

Counting A and B alleles

The next step is to count up the number of major and minor alleles at each of the identified positions. For this, we can use alleleCount. It is already installed on Uppmax

module load bioinfo-tools alleleCount
alleleCounter -l [path_to_loci_file] -r [path_to_human_ref_genome] -b path_to_tumor.bam -o output.allecount

The two input files are the loci file, the SNP positions identified in the previous step and the path to reference human genome.

Ascat SI

I had to stare at Eq. S5 in the SI for quite some time: \(b_i = \frac{1 - \rho + \rho n_B}{2 (1 - \rho) + \rho (n_A + n_B)}\). Here \(\rho\) refers to tumor fraction and \(n_A, n_B\) denote the true number of major and minor alleles. The observed data \(\hat{b}_i = n_B/(n_A + n_B)\) is a noisy version of the true value \(b_i\).

This expression makes sense once we realize that this is an SNP position that is heterozygous (i.e., germline variant). Therefore, \((1-\rho)\) comes from having one copy of the minor allele since this is the fraction that are healthy (one of major and one of minor copy). The other term, \(\rho n_B\) is more straight forward, as it refers to the number of copies of minor allele in the tumor population.