• Required Python libraries: pandas, numpy

This page contains exploratory code written in Jupyter Notebook. For tutorial on Jupyter Notebook go here.

The file is available for download as notebook here.

import pandas as pd
import numpy as np

file_name = "table.txt"

df = pd.read_table(file_name, sep='\t')
print(df.columns) # check the column names, we want library_name to be there!

Index(['study_accession', 'sample_accession', 'secondary_sample_accession',
       'experiment_accession', 'run_accession', 'tax_id', 'scientific_name',
       'instrument_model', 'library_name', 'library_layout', 'fastq_bytes',
       'fastq_ftp'],
      dtype='object')
patient_ids = df['library_name'].astype(str).str[0:6]
patients = set(patient_ids)
print(sorted(patients)) 
# there are 21 patients, but there are supposed to be only 20 patients according to the paper
# KTN609 does not appear in the appendix of the paper: https://www.cell.com/cms/attachment/2119295259/2091819478/mmc1.pdf
['KTN102', 'KTN115', 'KTN126', 'KTN129', 'KTN132', 'KTN134', 'KTN147', 'KTN152', 'KTN155', 'KTN206', 'KTN210', 'KTN215', 'KTN302', 'KTN304', 'KTN310', 'KTN316', 'KTN317', 'KTN501', 'KTN609', 'KTN612', 'KTN615']
# how many data files are there associated with patient KTN102?
patient_id = 'KTN102'
print(np.sum(patient_ids == patient_id)) # there are 511 files associated with this patient

511

For this dataset, there are

  • bulk data: BLOOD, PRE, MID, POST,
  • single cell data: scDNA and scRNA.

Let’s extract out the file name patterns.

df2 = df.loc[patient_ids == patient_id]
sorted(list(df2['library_name']))
['KTN1020',
 'KTN1020cells1',
 'KTN1020cells10',
 'KTN1020cells11',
 'KTN1020cells12',
 'KTN1020cells13',
 'KTN1020cells14',
 'KTN1020cells15',
 'KTN1020cells16',
 'KTN1020cells17',
 'KTN1020cells18',
 'KTN1020cells19',
 'KTN1020cells2',
 'KTN1020cells20',
 'KTN1020cells21',
 'KTN1020cells22',
 'KTN1020cells23',
 'KTN1020cells24',
 'KTN1020cells25',
 'KTN1020cells26',
 'KTN1020cells27',
 'KTN1020cells28',
 'KTN1020cells29',
 'KTN1020cells3',
 'KTN1020cells30',
 'KTN1020cells31',
 'KTN1020cells32',
 'KTN1020cells33',
 'KTN1020cells34',
 'KTN1020cells35',
 'KTN1020cells36',
 'KTN1020cells37',
 'KTN1020cells38',
 'KTN1020cells39',
 'KTN1020cells4',
 'KTN1020cells40',
 'KTN1020cells41',
 'KTN1020cells42',
 'KTN1020cells43',
 'KTN1020cells44',
 'KTN1020cells45',
 'KTN1020cells46',
 'KTN1020cells47',
 'KTN1020cells48',
 'KTN1020cells49',
 'KTN1020cells5',
 'KTN1020cells50',
 'KTN1020cells51',
 'KTN1020cells52',
 'KTN1020cells53',
 'KTN1020cells54',
 'KTN1020cells55',
 'KTN1020cells56',
 'KTN1020cells57',
 'KTN1020cells58',
 'KTN1020cells59',
 'KTN1020cells6',
 'KTN1020cells60',
 'KTN1020cells61',
 'KTN1020cells62',
 'KTN1020cells63',
 'KTN1020cells64',
 'KTN1020cells65',
 'KTN1020cells66',
 'KTN1020cells67',
 'KTN1020cells68',
 'KTN1020cells69',
 'KTN1020cells7',
 'KTN1020cells70',
 'KTN1020cells71',
 'KTN1020cells72',
 'KTN1020cells73',
 'KTN1020cells74',
 'KTN1020cells75',
 'KTN1020cells76',
 'KTN1020cells77',
 'KTN1020cells78',
 'KTN1020cells79',
 'KTN1020cells8',
 'KTN1020cells80',
 'KTN1020cells81',
 'KTN1020cells82',
 'KTN1020cells83',
 'KTN1020cells84',
 'KTN1020cells85',
 'KTN1020cells86',
 'KTN1020cells87',
 'KTN1020cells88',
 'KTN1020cells89',
 'KTN1020cells9',
 'KTN1020cells90',
 'KTN1020cells91',
 'KTN1022',
 'KTN102Blood',
 'KTN102OP',
 'KTN102OPcells1',
 'KTN102OPcells10',
 'KTN102OPcells100',
 'KTN102OPcells101',
 'KTN102OPcells102',
 'KTN102OPcells103',
 'KTN102OPcells104',
 'KTN102OPcells105',
 'KTN102OPcells106',
 'KTN102OPcells107',
 'KTN102OPcells108',
 'KTN102OPcells109',
 'KTN102OPcells11',
 'KTN102OPcells110',
 'KTN102OPcells111',
 'KTN102OPcells112',
 'KTN102OPcells113',
 'KTN102OPcells114',
 'KTN102OPcells115',
 'KTN102OPcells116',
 'KTN102OPcells117',
 'KTN102OPcells118',
 'KTN102OPcells119',
 'KTN102OPcells12',
 'KTN102OPcells120',
 'KTN102OPcells121',
 'KTN102OPcells122',
 'KTN102OPcells123',
 'KTN102OPcells124',
 'KTN102OPcells125',
 'KTN102OPcells126',
 'KTN102OPcells127',
 'KTN102OPcells128',
 'KTN102OPcells129',
 'KTN102OPcells13',
 'KTN102OPcells130',
 'KTN102OPcells131',
 'KTN102OPcells132',
 'KTN102OPcells133',
 'KTN102OPcells134',
 'KTN102OPcells135',
 'KTN102OPcells136',
 'KTN102OPcells137',
 'KTN102OPcells138',
 'KTN102OPcells139',
 'KTN102OPcells14',
 'KTN102OPcells140',
 'KTN102OPcells141',
 'KTN102OPcells142',
 'KTN102OPcells143',
 'KTN102OPcells144',
 'KTN102OPcells145',
 'KTN102OPcells146',
 'KTN102OPcells147',
 'KTN102OPcells148',
 'KTN102OPcells149',
 'KTN102OPcells15',
 'KTN102OPcells150',
 'KTN102OPcells151',
 'KTN102OPcells152',
 'KTN102OPcells153',
 'KTN102OPcells154',
 'KTN102OPcells155',
 'KTN102OPcells156',
 'KTN102OPcells157',
 'KTN102OPcells158',
 'KTN102OPcells159',
 'KTN102OPcells16',
 'KTN102OPcells160',
 'KTN102OPcells161',
 'KTN102OPcells162',
 'KTN102OPcells163',
 'KTN102OPcells164',
 'KTN102OPcells165',
 'KTN102OPcells166',
 'KTN102OPcells167',
 'KTN102OPcells168',
 'KTN102OPcells169',
 'KTN102OPcells17',
 'KTN102OPcells170',
 'KTN102OPcells171',
 'KTN102OPcells172',
 'KTN102OPcells173',
 'KTN102OPcells174',
 'KTN102OPcells175',
 'KTN102OPcells176',
 'KTN102OPcells177',
 'KTN102OPcells178',
 'KTN102OPcells179',
 'KTN102OPcells18',
 'KTN102OPcells180',
 'KTN102OPcells181',
 'KTN102OPcells182',
 'KTN102OPcells183',
 'KTN102OPcells184',
 'KTN102OPcells19',
 'KTN102OPcells2',
 'KTN102OPcells20',
 'KTN102OPcells21',
 'KTN102OPcells22',
 'KTN102OPcells23',
 'KTN102OPcells24',
 'KTN102OPcells25',
 'KTN102OPcells26',
 'KTN102OPcells27',
 'KTN102OPcells28',
 'KTN102OPcells29',
 'KTN102OPcells3',
 'KTN102OPcells30',
 'KTN102OPcells31',
 'KTN102OPcells32',
 'KTN102OPcells33',
 'KTN102OPcells34',
 'KTN102OPcells35',
 'KTN102OPcells36',
 'KTN102OPcells37',
 'KTN102OPcells38',
 'KTN102OPcells39',
 'KTN102OPcells4',
 'KTN102OPcells40',
 'KTN102OPcells41',
 'KTN102OPcells42',
 'KTN102OPcells43',
 'KTN102OPcells44',
 'KTN102OPcells45',
 'KTN102OPcells46',
 'KTN102OPcells47',
 'KTN102OPcells48',
 'KTN102OPcells49',
 'KTN102OPcells5',
 'KTN102OPcells50',
 'KTN102OPcells51',
 'KTN102OPcells52',
 'KTN102OPcells53',
 'KTN102OPcells54',
 'KTN102OPcells55',
 'KTN102OPcells56',
 'KTN102OPcells57',
 'KTN102OPcells58',
 'KTN102OPcells59',
 'KTN102OPcells6',
 'KTN102OPcells60',
 'KTN102OPcells61',
 'KTN102OPcells62',
 'KTN102OPcells63',
 'KTN102OPcells64',
 'KTN102OPcells65',
 'KTN102OPcells66',
 'KTN102OPcells67',
 'KTN102OPcells68',
 'KTN102OPcells69',
 'KTN102OPcells7',
 'KTN102OPcells70',
 'KTN102OPcells71',
 'KTN102OPcells72',
 'KTN102OPcells73',
 'KTN102OPcells74',
 'KTN102OPcells75',
 'KTN102OPcells76',
 'KTN102OPcells77',
 'KTN102OPcells78',
 'KTN102OPcells79',
 'KTN102OPcells8',
 'KTN102OPcells80',
 'KTN102OPcells81',
 'KTN102OPcells82',
 'KTN102OPcells83',
 'KTN102OPcells84',
 'KTN102OPcells85',
 'KTN102OPcells86',
 'KTN102OPcells87',
 'KTN102OPcells88',
 'KTN102OPcells89',
 'KTN102OPcells9',
 'KTN102OPcells90',
 'KTN102OPcells91',
 'KTN102OPcells92',
 'KTN102OPcells93',
 'KTN102OPcells94',
 'KTN102OPcells95',
 'KTN102OPcells96',
 'KTN102OPcells97',
 'KTN102OPcells98',
 'KTN102OPcells99',
 'KTN102_0_10',
 'KTN102_0_10_B2',
 'KTN102_0_11',
 'KTN102_0_11_B2',
 'KTN102_0_12',
 'KTN102_0_12_B2',
 'KTN102_0_13',
 'KTN102_0_14',
 'KTN102_0_14_B2',
 'KTN102_0_15_B2',
 'KTN102_0_16',
 'KTN102_0_16_B2',
 'KTN102_0_17',
 'KTN102_0_17_B2',
 'KTN102_0_18',
 'KTN102_0_18_B2',
 'KTN102_0_19',
 'KTN102_0_19_B2',
 'KTN102_0_1_B2',
 'KTN102_0_2',
 'KTN102_0_20',
 'KTN102_0_20_B2',
 'KTN102_0_21',
 'KTN102_0_22',
 'KTN102_0_22_B2',
 'KTN102_0_23',
 'KTN102_0_23_B2',
 'KTN102_0_24',
 'KTN102_0_24_B2',
 'KTN102_0_25',
 'KTN102_0_25_B2',
 'KTN102_0_26',
 'KTN102_0_26_B2',
 'KTN102_0_27',
 'KTN102_0_27_B2',
 'KTN102_0_28',
 'KTN102_0_28_B2',
 'KTN102_0_29',
 'KTN102_0_29_B2',
 'KTN102_0_2_B2',
 'KTN102_0_3',
 'KTN102_0_30',
 'KTN102_0_30_B2',
 'KTN102_0_31',
 'KTN102_0_31_B2',
 'KTN102_0_32',
 'KTN102_0_32_B2',
 'KTN102_0_33',
 'KTN102_0_33_B2',
 'KTN102_0_34',
 'KTN102_0_34_B2',
 'KTN102_0_35',
 'KTN102_0_35_B2',
 'KTN102_0_36',
 'KTN102_0_36_B2',
 'KTN102_0_37',
 'KTN102_0_37_B2',
 'KTN102_0_38',
 'KTN102_0_38_B2',
 'KTN102_0_39',
 'KTN102_0_39_B2',
 'KTN102_0_4',
 'KTN102_0_40',
 'KTN102_0_40_B2',
 'KTN102_0_41',
 'KTN102_0_41_B2',
 'KTN102_0_42',
 'KTN102_0_43',
 'KTN102_0_43_B2',
 'KTN102_0_44',
 'KTN102_0_44_B2',
 'KTN102_0_45',
 'KTN102_0_45_B2',
 'KTN102_0_46',
 'KTN102_0_46_B2',
 'KTN102_0_47',
 'KTN102_0_47_B2',
 'KTN102_0_48',
 'KTN102_0_48_B2',
 'KTN102_0_4_B2',
 'KTN102_0_5',
 'KTN102_0_5_B2',
 'KTN102_0_6',
 'KTN102_0_6_B2',
 'KTN102_0_7',
 'KTN102_0_7_B2',
 'KTN102_0_8',
 'KTN102_0_8_B2',
 'KTN102_0_9',
 'KTN102_0_9_B2',
 'KTN102_0_Pop',
 'KTN102_2_01',
 'KTN102_2_02',
 'KTN102_2_03',
 'KTN102_2_04',
 'KTN102_2_05',
 'KTN102_2_06',
 'KTN102_2_07',
 'KTN102_2_08',
 'KTN102_2_09',
 'KTN102_2_10',
 'KTN102_2_10_B2',
 'KTN102_2_11',
 'KTN102_2_11_B2',
 'KTN102_2_12',
 'KTN102_2_12_B2',
 'KTN102_2_13',
 'KTN102_2_13_B2',
 'KTN102_2_14',
 'KTN102_2_14_B2',
 'KTN102_2_15',
 'KTN102_2_15_B2',
 'KTN102_2_16',
 'KTN102_2_16_B2',
 'KTN102_2_17',
 'KTN102_2_17_B2',
 'KTN102_2_18',
 'KTN102_2_18_B2',
 'KTN102_2_19',
 'KTN102_2_19_B2',
 'KTN102_2_1_B2',
 'KTN102_2_20',
 'KTN102_2_20_B2',
 'KTN102_2_21',
 'KTN102_2_22',
 'KTN102_2_22_B2',
 'KTN102_2_23',
 'KTN102_2_23_B2',
 'KTN102_2_24',
 'KTN102_2_24_B2',
 'KTN102_2_25',
 'KTN102_2_25_B2',
 'KTN102_2_26',
 'KTN102_2_27',
 'KTN102_2_27_B2',
 'KTN102_2_28',
 'KTN102_2_28_B2',
 'KTN102_2_29',
 'KTN102_2_2_B2',
 'KTN102_2_30',
 'KTN102_2_30_B2',
 'KTN102_2_31',
 'KTN102_2_31_B2',
 'KTN102_2_32',
 'KTN102_2_32_B2',
 'KTN102_2_33',
 'KTN102_2_33_B2',
 'KTN102_2_34',
 'KTN102_2_34_B2',
 'KTN102_2_35',
 'KTN102_2_35_B2',
 'KTN102_2_36',
 'KTN102_2_36_B2',
 'KTN102_2_37',
 'KTN102_2_37_B2',
 'KTN102_2_38',
 'KTN102_2_38_B2',
 'KTN102_2_39',
 'KTN102_2_39_B2',
 'KTN102_2_3_B2',
 'KTN102_2_40',
 'KTN102_2_40_B2',
 'KTN102_2_41',
 'KTN102_2_41_B2',
 'KTN102_2_42',
 'KTN102_2_42_B2',
 'KTN102_2_43',
 'KTN102_2_43_B2',
 'KTN102_2_44',
 'KTN102_2_44_B2',
 'KTN102_2_45',
 'KTN102_2_45_B2',
 'KTN102_2_46',
 'KTN102_2_46_B2',
 'KTN102_2_47',
 'KTN102_2_47_B2',
 'KTN102_2_48_B2',
 'KTN102_2_4_B2',
 'KTN102_2_5_B2',
 'KTN102_2_6_B2',
 'KTN102_2_7_B2',
 'KTN102_2_8_B2',
 'KTN102_2_9_B2',
 'KTN102_2_Pop',
 'KTN102_OP_1',
 'KTN102_OP_10',
 'KTN102_OP_11',
 'KTN102_OP_12',
 'KTN102_OP_13',
 'KTN102_OP_14',
 'KTN102_OP_15',
 'KTN102_OP_16',
 'KTN102_OP_17',
 'KTN102_OP_18',
 'KTN102_OP_19',
 'KTN102_OP_2',
 'KTN102_OP_20',
 'KTN102_OP_21',
 'KTN102_OP_22',
 'KTN102_OP_23',
 'KTN102_OP_24',
 'KTN102_OP_25',
 'KTN102_OP_26',
 'KTN102_OP_27',
 'KTN102_OP_28',
 'KTN102_OP_29',
 'KTN102_OP_30',
 'KTN102_OP_31',
 'KTN102_OP_32',
 'KTN102_OP_33',
 'KTN102_OP_34',
 'KTN102_OP_35',
 'KTN102_OP_36',
 'KTN102_OP_37',
 'KTN102_OP_38',
 'KTN102_OP_39',
 'KTN102_OP_4',
 'KTN102_OP_40',
 'KTN102_OP_41',
 'KTN102_OP_42',
 'KTN102_OP_43',
 'KTN102_OP_44',
 'KTN102_OP_45',
 'KTN102_OP_46',
 'KTN102_OP_47',
 'KTN102_OP_48',
 'KTN102_OP_5',
 'KTN102_OP_6',
 'KTN102_OP_7',
 'KTN102_OP_8',
 'KTN102_OP_9',
 'KTN102_OP_Pop']

Based on quick browsing of the values, it seems like the library names have the following patterns:

  • KTN102Blood, KTN1020, KTN1022, KTN102OP seem to refer to bulk samples
    • For KTN1020 and KTN102OP, there seem to be library names with the following suffix ‘cells[0-9]+’. These are either DNA or RNA data for population KTN1020 and KTN102OP.
  • The other library names have the following pattern: KTN102_[0|2|OP]_[0-9|A-Z|a-Z]+
    • Some or all of these files are likely to be single cell data, but we do not know which correspond to RNA and which to DNA samples
    • We may be able to use the number of fastq.gz files attached to each library to help identify RNA from DNA samples since RNA samples should be single stranded.
    • We may also use the file sizes to our advantage.
# let's check the data sizes for the files to find the bulk samples
ret = list(map(lambda row: np.asarray(str(row).split(";"), dtype=int), list(df2['fastq_bytes'])))
ret = np.asarray(ret)
ret2 = list(map(lambda row: np.sum(row)/np.power(10,9), ret)) # in GB
ret2 = np.asarray(ret2)
df2[ret2 > 1]
study_accession sample_accession secondary_sample_accession experiment_accession run_accession tax_id scientific_name instrument_model library_name library_layout fastq_bytes fastq_ftp
0 PRJNA396019 SAMN07457099 SRS2412441 SRX3067795 SRR5906250 9606 Homo sapiens Illumina HiSeq 4000 KTN102Blood PAIRED 46088894;3107017681;3884543372 ftp.sra.ebi.ac.uk/vol1/fastq/SRR590/000/SRR590...
1 PRJNA396019 SAMN07457098 SRS2412440 SRX3067794 SRR5906251 9606 Homo sapiens Illumina HiSeq 4000 KTN102OP PAIRED 55533991;3032757154;3772164832 ftp.sra.ebi.ac.uk/vol1/fastq/SRR590/001/SRR590...
2 PRJNA396019 SAMN07457097 SRS2412442 SRX3067793 SRR5906252 9606 Homo sapiens Illumina HiSeq 4000 KTN1022 PAIRED 43518505;2498303748;3015988613 ftp.sra.ebi.ac.uk/vol1/fastq/SRR590/002/SRR590...
3 PRJNA396019 SAMN07457096 SRS2412443 SRX3067792 SRR5906253 9606 Homo sapiens Illumina HiSeq 4000 KTN1020 PAIRED 41991292;3179486152;3911231851 ftp.sra.ebi.ac.uk/vol1/fastq/SRR590/003/SRR590...
ret2[ret2 > 1] # large files probably correspond to bulk, matching our intuition
array([ 7.03764995,  6.86045598,  5.55781087,  7.13270929])
# let's check the number of files associated with each row
num_files = np.array(list(map(lambda row: row.shape[0], ret)))
s1 = sorted(df2[num_files == 1]["library_name"])
print(len(s1)) # 143
s1
143





['KTN102_0_10',
 'KTN102_0_11',
 'KTN102_0_12',
 'KTN102_0_13',
 'KTN102_0_14',
 'KTN102_0_16',
 'KTN102_0_17',
 'KTN102_0_18',
 'KTN102_0_19',
 'KTN102_0_2',
 'KTN102_0_20',
 'KTN102_0_21',
 'KTN102_0_22',
 'KTN102_0_23',
 'KTN102_0_24',
 'KTN102_0_25',
 'KTN102_0_26',
 'KTN102_0_27',
 'KTN102_0_28',
 'KTN102_0_29',
 'KTN102_0_3',
 'KTN102_0_30',
 'KTN102_0_31',
 'KTN102_0_32',
 'KTN102_0_33',
 'KTN102_0_34',
 'KTN102_0_35',
 'KTN102_0_36',
 'KTN102_0_37',
 'KTN102_0_38',
 'KTN102_0_39',
 'KTN102_0_4',
 'KTN102_0_40',
 'KTN102_0_41',
 'KTN102_0_42',
 'KTN102_0_43',
 'KTN102_0_44',
 'KTN102_0_45',
 'KTN102_0_46',
 'KTN102_0_47',
 'KTN102_0_48',
 'KTN102_0_5',
 'KTN102_0_6',
 'KTN102_0_7',
 'KTN102_0_8',
 'KTN102_0_9',
 'KTN102_0_Pop',
 'KTN102_2_01',
 'KTN102_2_02',
 'KTN102_2_03',
 'KTN102_2_04',
 'KTN102_2_05',
 'KTN102_2_06',
 'KTN102_2_07',
 'KTN102_2_08',
 'KTN102_2_09',
 'KTN102_2_10',
 'KTN102_2_11',
 'KTN102_2_12',
 'KTN102_2_13',
 'KTN102_2_14',
 'KTN102_2_15',
 'KTN102_2_16',
 'KTN102_2_17',
 'KTN102_2_18',
 'KTN102_2_19',
 'KTN102_2_20',
 'KTN102_2_21',
 'KTN102_2_22',
 'KTN102_2_23',
 'KTN102_2_24',
 'KTN102_2_25',
 'KTN102_2_26',
 'KTN102_2_27',
 'KTN102_2_28',
 'KTN102_2_29',
 'KTN102_2_30',
 'KTN102_2_31',
 'KTN102_2_32',
 'KTN102_2_33',
 'KTN102_2_34',
 'KTN102_2_35',
 'KTN102_2_36',
 'KTN102_2_37',
 'KTN102_2_38',
 'KTN102_2_39',
 'KTN102_2_40',
 'KTN102_2_41',
 'KTN102_2_42',
 'KTN102_2_43',
 'KTN102_2_44',
 'KTN102_2_45',
 'KTN102_2_46',
 'KTN102_2_47',
 'KTN102_2_Pop',
 'KTN102_OP_1',
 'KTN102_OP_10',
 'KTN102_OP_11',
 'KTN102_OP_12',
 'KTN102_OP_13',
 'KTN102_OP_14',
 'KTN102_OP_15',
 'KTN102_OP_16',
 'KTN102_OP_17',
 'KTN102_OP_18',
 'KTN102_OP_19',
 'KTN102_OP_2',
 'KTN102_OP_20',
 'KTN102_OP_21',
 'KTN102_OP_22',
 'KTN102_OP_23',
 'KTN102_OP_24',
 'KTN102_OP_25',
 'KTN102_OP_26',
 'KTN102_OP_27',
 'KTN102_OP_28',
 'KTN102_OP_29',
 'KTN102_OP_30',
 'KTN102_OP_31',
 'KTN102_OP_32',
 'KTN102_OP_33',
 'KTN102_OP_34',
 'KTN102_OP_35',
 'KTN102_OP_36',
 'KTN102_OP_37',
 'KTN102_OP_38',
 'KTN102_OP_39',
 'KTN102_OP_4',
 'KTN102_OP_40',
 'KTN102_OP_41',
 'KTN102_OP_42',
 'KTN102_OP_43',
 'KTN102_OP_44',
 'KTN102_OP_45',
 'KTN102_OP_46',
 'KTN102_OP_47',
 'KTN102_OP_48',
 'KTN102_OP_5',
 'KTN102_OP_6',
 'KTN102_OP_7',
 'KTN102_OP_8',
 'KTN102_OP_9',
 'KTN102_OP_Pop']
s2 = sorted(df2[num_files == 2]["library_name"])
print(len(s2)) # 275
s2
275





['KTN1020cells1',
 'KTN1020cells10',
 'KTN1020cells11',
 'KTN1020cells12',
 'KTN1020cells13',
 'KTN1020cells14',
 'KTN1020cells15',
 'KTN1020cells16',
 'KTN1020cells17',
 'KTN1020cells18',
 'KTN1020cells19',
 'KTN1020cells2',
 'KTN1020cells20',
 'KTN1020cells21',
 'KTN1020cells22',
 'KTN1020cells23',
 'KTN1020cells24',
 'KTN1020cells25',
 'KTN1020cells26',
 'KTN1020cells27',
 'KTN1020cells28',
 'KTN1020cells29',
 'KTN1020cells3',
 'KTN1020cells30',
 'KTN1020cells31',
 'KTN1020cells32',
 'KTN1020cells33',
 'KTN1020cells34',
 'KTN1020cells35',
 'KTN1020cells36',
 'KTN1020cells37',
 'KTN1020cells38',
 'KTN1020cells39',
 'KTN1020cells4',
 'KTN1020cells40',
 'KTN1020cells41',
 'KTN1020cells42',
 'KTN1020cells43',
 'KTN1020cells44',
 'KTN1020cells45',
 'KTN1020cells46',
 'KTN1020cells47',
 'KTN1020cells48',
 'KTN1020cells49',
 'KTN1020cells5',
 'KTN1020cells50',
 'KTN1020cells51',
 'KTN1020cells52',
 'KTN1020cells53',
 'KTN1020cells54',
 'KTN1020cells55',
 'KTN1020cells56',
 'KTN1020cells57',
 'KTN1020cells58',
 'KTN1020cells59',
 'KTN1020cells6',
 'KTN1020cells60',
 'KTN1020cells61',
 'KTN1020cells62',
 'KTN1020cells63',
 'KTN1020cells64',
 'KTN1020cells65',
 'KTN1020cells66',
 'KTN1020cells67',
 'KTN1020cells68',
 'KTN1020cells69',
 'KTN1020cells7',
 'KTN1020cells70',
 'KTN1020cells71',
 'KTN1020cells72',
 'KTN1020cells73',
 'KTN1020cells74',
 'KTN1020cells75',
 'KTN1020cells76',
 'KTN1020cells77',
 'KTN1020cells78',
 'KTN1020cells79',
 'KTN1020cells8',
 'KTN1020cells80',
 'KTN1020cells81',
 'KTN1020cells82',
 'KTN1020cells83',
 'KTN1020cells84',
 'KTN1020cells85',
 'KTN1020cells86',
 'KTN1020cells87',
 'KTN1020cells88',
 'KTN1020cells89',
 'KTN1020cells9',
 'KTN1020cells90',
 'KTN1020cells91',
 'KTN102OPcells1',
 'KTN102OPcells10',
 'KTN102OPcells100',
 'KTN102OPcells101',
 'KTN102OPcells102',
 'KTN102OPcells103',
 'KTN102OPcells104',
 'KTN102OPcells105',
 'KTN102OPcells106',
 'KTN102OPcells107',
 'KTN102OPcells108',
 'KTN102OPcells109',
 'KTN102OPcells11',
 'KTN102OPcells110',
 'KTN102OPcells111',
 'KTN102OPcells112',
 'KTN102OPcells113',
 'KTN102OPcells114',
 'KTN102OPcells115',
 'KTN102OPcells116',
 'KTN102OPcells117',
 'KTN102OPcells118',
 'KTN102OPcells119',
 'KTN102OPcells12',
 'KTN102OPcells120',
 'KTN102OPcells121',
 'KTN102OPcells122',
 'KTN102OPcells123',
 'KTN102OPcells124',
 'KTN102OPcells125',
 'KTN102OPcells126',
 'KTN102OPcells127',
 'KTN102OPcells128',
 'KTN102OPcells129',
 'KTN102OPcells13',
 'KTN102OPcells130',
 'KTN102OPcells131',
 'KTN102OPcells132',
 'KTN102OPcells133',
 'KTN102OPcells134',
 'KTN102OPcells135',
 'KTN102OPcells136',
 'KTN102OPcells137',
 'KTN102OPcells138',
 'KTN102OPcells139',
 'KTN102OPcells14',
 'KTN102OPcells140',
 'KTN102OPcells141',
 'KTN102OPcells142',
 'KTN102OPcells143',
 'KTN102OPcells144',
 'KTN102OPcells145',
 'KTN102OPcells146',
 'KTN102OPcells147',
 'KTN102OPcells148',
 'KTN102OPcells149',
 'KTN102OPcells15',
 'KTN102OPcells150',
 'KTN102OPcells151',
 'KTN102OPcells152',
 'KTN102OPcells153',
 'KTN102OPcells154',
 'KTN102OPcells155',
 'KTN102OPcells156',
 'KTN102OPcells157',
 'KTN102OPcells158',
 'KTN102OPcells159',
 'KTN102OPcells16',
 'KTN102OPcells160',
 'KTN102OPcells161',
 'KTN102OPcells162',
 'KTN102OPcells163',
 'KTN102OPcells164',
 'KTN102OPcells165',
 'KTN102OPcells166',
 'KTN102OPcells167',
 'KTN102OPcells168',
 'KTN102OPcells169',
 'KTN102OPcells17',
 'KTN102OPcells170',
 'KTN102OPcells171',
 'KTN102OPcells172',
 'KTN102OPcells173',
 'KTN102OPcells174',
 'KTN102OPcells175',
 'KTN102OPcells176',
 'KTN102OPcells177',
 'KTN102OPcells178',
 'KTN102OPcells179',
 'KTN102OPcells18',
 'KTN102OPcells180',
 'KTN102OPcells181',
 'KTN102OPcells182',
 'KTN102OPcells183',
 'KTN102OPcells184',
 'KTN102OPcells19',
 'KTN102OPcells2',
 'KTN102OPcells20',
 'KTN102OPcells21',
 'KTN102OPcells22',
 'KTN102OPcells23',
 'KTN102OPcells24',
 'KTN102OPcells25',
 'KTN102OPcells26',
 'KTN102OPcells27',
 'KTN102OPcells28',
 'KTN102OPcells29',
 'KTN102OPcells3',
 'KTN102OPcells30',
 'KTN102OPcells31',
 'KTN102OPcells32',
 'KTN102OPcells33',
 'KTN102OPcells34',
 'KTN102OPcells35',
 'KTN102OPcells36',
 'KTN102OPcells37',
 'KTN102OPcells38',
 'KTN102OPcells39',
 'KTN102OPcells4',
 'KTN102OPcells40',
 'KTN102OPcells41',
 'KTN102OPcells42',
 'KTN102OPcells43',
 'KTN102OPcells44',
 'KTN102OPcells45',
 'KTN102OPcells46',
 'KTN102OPcells47',
 'KTN102OPcells48',
 'KTN102OPcells49',
 'KTN102OPcells5',
 'KTN102OPcells50',
 'KTN102OPcells51',
 'KTN102OPcells52',
 'KTN102OPcells53',
 'KTN102OPcells54',
 'KTN102OPcells55',
 'KTN102OPcells56',
 'KTN102OPcells57',
 'KTN102OPcells58',
 'KTN102OPcells59',
 'KTN102OPcells6',
 'KTN102OPcells60',
 'KTN102OPcells61',
 'KTN102OPcells62',
 'KTN102OPcells63',
 'KTN102OPcells64',
 'KTN102OPcells65',
 'KTN102OPcells66',
 'KTN102OPcells67',
 'KTN102OPcells68',
 'KTN102OPcells69',
 'KTN102OPcells7',
 'KTN102OPcells70',
 'KTN102OPcells71',
 'KTN102OPcells72',
 'KTN102OPcells73',
 'KTN102OPcells74',
 'KTN102OPcells75',
 'KTN102OPcells76',
 'KTN102OPcells77',
 'KTN102OPcells78',
 'KTN102OPcells79',
 'KTN102OPcells8',
 'KTN102OPcells80',
 'KTN102OPcells81',
 'KTN102OPcells82',
 'KTN102OPcells83',
 'KTN102OPcells84',
 'KTN102OPcells85',
 'KTN102OPcells86',
 'KTN102OPcells87',
 'KTN102OPcells88',
 'KTN102OPcells89',
 'KTN102OPcells9',
 'KTN102OPcells90',
 'KTN102OPcells91',
 'KTN102OPcells92',
 'KTN102OPcells93',
 'KTN102OPcells94',
 'KTN102OPcells95',
 'KTN102OPcells96',
 'KTN102OPcells97',
 'KTN102OPcells98',
 'KTN102OPcells99']
s3 = sorted(df2[num_files == 3]["library_name"])
print(len(s3)) # 93
s3
93





['KTN1020',
 'KTN1022',
 'KTN102Blood',
 'KTN102OP',
 'KTN102_0_10_B2',
 'KTN102_0_11_B2',
 'KTN102_0_12_B2',
 'KTN102_0_14_B2',
 'KTN102_0_15_B2',
 'KTN102_0_16_B2',
 'KTN102_0_17_B2',
 'KTN102_0_18_B2',
 'KTN102_0_19_B2',
 'KTN102_0_1_B2',
 'KTN102_0_20_B2',
 'KTN102_0_22_B2',
 'KTN102_0_23_B2',
 'KTN102_0_24_B2',
 'KTN102_0_25_B2',
 'KTN102_0_26_B2',
 'KTN102_0_27_B2',
 'KTN102_0_28_B2',
 'KTN102_0_29_B2',
 'KTN102_0_2_B2',
 'KTN102_0_30_B2',
 'KTN102_0_31_B2',
 'KTN102_0_32_B2',
 'KTN102_0_33_B2',
 'KTN102_0_34_B2',
 'KTN102_0_35_B2',
 'KTN102_0_36_B2',
 'KTN102_0_37_B2',
 'KTN102_0_38_B2',
 'KTN102_0_39_B2',
 'KTN102_0_40_B2',
 'KTN102_0_41_B2',
 'KTN102_0_43_B2',
 'KTN102_0_44_B2',
 'KTN102_0_45_B2',
 'KTN102_0_46_B2',
 'KTN102_0_47_B2',
 'KTN102_0_48_B2',
 'KTN102_0_4_B2',
 'KTN102_0_5_B2',
 'KTN102_0_6_B2',
 'KTN102_0_7_B2',
 'KTN102_0_8_B2',
 'KTN102_0_9_B2',
 'KTN102_2_10_B2',
 'KTN102_2_11_B2',
 'KTN102_2_12_B2',
 'KTN102_2_13_B2',
 'KTN102_2_14_B2',
 'KTN102_2_15_B2',
 'KTN102_2_16_B2',
 'KTN102_2_17_B2',
 'KTN102_2_18_B2',
 'KTN102_2_19_B2',
 'KTN102_2_1_B2',
 'KTN102_2_20_B2',
 'KTN102_2_22_B2',
 'KTN102_2_23_B2',
 'KTN102_2_24_B2',
 'KTN102_2_25_B2',
 'KTN102_2_27_B2',
 'KTN102_2_28_B2',
 'KTN102_2_2_B2',
 'KTN102_2_30_B2',
 'KTN102_2_31_B2',
 'KTN102_2_32_B2',
 'KTN102_2_33_B2',
 'KTN102_2_34_B2',
 'KTN102_2_35_B2',
 'KTN102_2_36_B2',
 'KTN102_2_37_B2',
 'KTN102_2_38_B2',
 'KTN102_2_39_B2',
 'KTN102_2_3_B2',
 'KTN102_2_40_B2',
 'KTN102_2_41_B2',
 'KTN102_2_42_B2',
 'KTN102_2_43_B2',
 'KTN102_2_44_B2',
 'KTN102_2_45_B2',
 'KTN102_2_46_B2',
 'KTN102_2_47_B2',
 'KTN102_2_48_B2',
 'KTN102_2_4_B2',
 'KTN102_2_5_B2',
 'KTN102_2_6_B2',
 'KTN102_2_7_B2',
 'KTN102_2_8_B2',
 'KTN102_2_9_B2']
  • Rows that are associated with single file seem to be for scRNA
  • Rows associated with two files seem to be for scDNA
  • Rows associated with three files seem to be for bulk and/or single cell data (this will have to be determined by reading the paper) but it’s clear from the size of the files, which ones correspond to bulk data.

Details from the paper

The goal of this paper is to study the resistance to chemotherapy in triple negative breast cancer. To that end, they applied scDNA, scRNA, and bulk exome sequencing to profile longitudinal samples from 20 TNBC patients during chemotherapy.

Deep exome sequencing revealed chemo led to clonal extinction (removal of cancer clone) for 10 patients and 10 patients where clone persisted after chemo. In 8 patients, more detailed study was carried out using scDNA-seq to analyze 900 cells and scRNA-seq to analyze 6,862 cells.

An excerpt from the paper:

… single-nucleus RNA-sequencing (SNRS) method. SNRS performs automated imaging and selection of up to 1,800 single nuclei in parallel for 3’ mRNA profiling. We profiled the transcriptomes of 3,370 single nuclei isolated from two matched longitudinal samples per patient from the four clonal extinction patients (P1, P2, P6, P9).

From this excerpt, it appears that RNA sequencing produces sequencing from only 3’ end, which supports the suspicion that records associated with a single file are RNA-seq data.