Explore master file downloaded from European Nucleotide Archive
- Required Python libraries:
pandas, numpy
This page contains exploratory code written in Jupyter Notebook. For tutorial on Jupyter Notebook go here.
The file is available for download as notebook here.
import pandas as pd
import numpy as np
file_name = "table.txt"
df = pd.read_table(file_name, sep='\t')
print(df.columns) # check the column names, we want library_name to be there!
Index(['study_accession', 'sample_accession', 'secondary_sample_accession',
'experiment_accession', 'run_accession', 'tax_id', 'scientific_name',
'instrument_model', 'library_name', 'library_layout', 'fastq_bytes',
'fastq_ftp'],
dtype='object')
patient_ids = df['library_name'].astype(str).str[0:6]
patients = set(patient_ids)
print(sorted(patients))
# there are 21 patients, but there are supposed to be only 20 patients according to the paper
# KTN609 does not appear in the appendix of the paper: https://www.cell.com/cms/attachment/2119295259/2091819478/mmc1.pdf
['KTN102', 'KTN115', 'KTN126', 'KTN129', 'KTN132', 'KTN134', 'KTN147', 'KTN152', 'KTN155', 'KTN206', 'KTN210', 'KTN215', 'KTN302', 'KTN304', 'KTN310', 'KTN316', 'KTN317', 'KTN501', 'KTN609', 'KTN612', 'KTN615']
# how many data files are there associated with patient KTN102?
patient_id = 'KTN102'
print(np.sum(patient_ids == patient_id)) # there are 511 files associated with this patient
511
For this dataset, there are
- bulk data: BLOOD, PRE, MID, POST,
- single cell data: scDNA and scRNA.
Let’s extract out the file name patterns.
df2 = df.loc[patient_ids == patient_id]
sorted(list(df2['library_name']))
['KTN1020',
'KTN1020cells1',
'KTN1020cells10',
'KTN1020cells11',
'KTN1020cells12',
'KTN1020cells13',
'KTN1020cells14',
'KTN1020cells15',
'KTN1020cells16',
'KTN1020cells17',
'KTN1020cells18',
'KTN1020cells19',
'KTN1020cells2',
'KTN1020cells20',
'KTN1020cells21',
'KTN1020cells22',
'KTN1020cells23',
'KTN1020cells24',
'KTN1020cells25',
'KTN1020cells26',
'KTN1020cells27',
'KTN1020cells28',
'KTN1020cells29',
'KTN1020cells3',
'KTN1020cells30',
'KTN1020cells31',
'KTN1020cells32',
'KTN1020cells33',
'KTN1020cells34',
'KTN1020cells35',
'KTN1020cells36',
'KTN1020cells37',
'KTN1020cells38',
'KTN1020cells39',
'KTN1020cells4',
'KTN1020cells40',
'KTN1020cells41',
'KTN1020cells42',
'KTN1020cells43',
'KTN1020cells44',
'KTN1020cells45',
'KTN1020cells46',
'KTN1020cells47',
'KTN1020cells48',
'KTN1020cells49',
'KTN1020cells5',
'KTN1020cells50',
'KTN1020cells51',
'KTN1020cells52',
'KTN1020cells53',
'KTN1020cells54',
'KTN1020cells55',
'KTN1020cells56',
'KTN1020cells57',
'KTN1020cells58',
'KTN1020cells59',
'KTN1020cells6',
'KTN1020cells60',
'KTN1020cells61',
'KTN1020cells62',
'KTN1020cells63',
'KTN1020cells64',
'KTN1020cells65',
'KTN1020cells66',
'KTN1020cells67',
'KTN1020cells68',
'KTN1020cells69',
'KTN1020cells7',
'KTN1020cells70',
'KTN1020cells71',
'KTN1020cells72',
'KTN1020cells73',
'KTN1020cells74',
'KTN1020cells75',
'KTN1020cells76',
'KTN1020cells77',
'KTN1020cells78',
'KTN1020cells79',
'KTN1020cells8',
'KTN1020cells80',
'KTN1020cells81',
'KTN1020cells82',
'KTN1020cells83',
'KTN1020cells84',
'KTN1020cells85',
'KTN1020cells86',
'KTN1020cells87',
'KTN1020cells88',
'KTN1020cells89',
'KTN1020cells9',
'KTN1020cells90',
'KTN1020cells91',
'KTN1022',
'KTN102Blood',
'KTN102OP',
'KTN102OPcells1',
'KTN102OPcells10',
'KTN102OPcells100',
'KTN102OPcells101',
'KTN102OPcells102',
'KTN102OPcells103',
'KTN102OPcells104',
'KTN102OPcells105',
'KTN102OPcells106',
'KTN102OPcells107',
'KTN102OPcells108',
'KTN102OPcells109',
'KTN102OPcells11',
'KTN102OPcells110',
'KTN102OPcells111',
'KTN102OPcells112',
'KTN102OPcells113',
'KTN102OPcells114',
'KTN102OPcells115',
'KTN102OPcells116',
'KTN102OPcells117',
'KTN102OPcells118',
'KTN102OPcells119',
'KTN102OPcells12',
'KTN102OPcells120',
'KTN102OPcells121',
'KTN102OPcells122',
'KTN102OPcells123',
'KTN102OPcells124',
'KTN102OPcells125',
'KTN102OPcells126',
'KTN102OPcells127',
'KTN102OPcells128',
'KTN102OPcells129',
'KTN102OPcells13',
'KTN102OPcells130',
'KTN102OPcells131',
'KTN102OPcells132',
'KTN102OPcells133',
'KTN102OPcells134',
'KTN102OPcells135',
'KTN102OPcells136',
'KTN102OPcells137',
'KTN102OPcells138',
'KTN102OPcells139',
'KTN102OPcells14',
'KTN102OPcells140',
'KTN102OPcells141',
'KTN102OPcells142',
'KTN102OPcells143',
'KTN102OPcells144',
'KTN102OPcells145',
'KTN102OPcells146',
'KTN102OPcells147',
'KTN102OPcells148',
'KTN102OPcells149',
'KTN102OPcells15',
'KTN102OPcells150',
'KTN102OPcells151',
'KTN102OPcells152',
'KTN102OPcells153',
'KTN102OPcells154',
'KTN102OPcells155',
'KTN102OPcells156',
'KTN102OPcells157',
'KTN102OPcells158',
'KTN102OPcells159',
'KTN102OPcells16',
'KTN102OPcells160',
'KTN102OPcells161',
'KTN102OPcells162',
'KTN102OPcells163',
'KTN102OPcells164',
'KTN102OPcells165',
'KTN102OPcells166',
'KTN102OPcells167',
'KTN102OPcells168',
'KTN102OPcells169',
'KTN102OPcells17',
'KTN102OPcells170',
'KTN102OPcells171',
'KTN102OPcells172',
'KTN102OPcells173',
'KTN102OPcells174',
'KTN102OPcells175',
'KTN102OPcells176',
'KTN102OPcells177',
'KTN102OPcells178',
'KTN102OPcells179',
'KTN102OPcells18',
'KTN102OPcells180',
'KTN102OPcells181',
'KTN102OPcells182',
'KTN102OPcells183',
'KTN102OPcells184',
'KTN102OPcells19',
'KTN102OPcells2',
'KTN102OPcells20',
'KTN102OPcells21',
'KTN102OPcells22',
'KTN102OPcells23',
'KTN102OPcells24',
'KTN102OPcells25',
'KTN102OPcells26',
'KTN102OPcells27',
'KTN102OPcells28',
'KTN102OPcells29',
'KTN102OPcells3',
'KTN102OPcells30',
'KTN102OPcells31',
'KTN102OPcells32',
'KTN102OPcells33',
'KTN102OPcells34',
'KTN102OPcells35',
'KTN102OPcells36',
'KTN102OPcells37',
'KTN102OPcells38',
'KTN102OPcells39',
'KTN102OPcells4',
'KTN102OPcells40',
'KTN102OPcells41',
'KTN102OPcells42',
'KTN102OPcells43',
'KTN102OPcells44',
'KTN102OPcells45',
'KTN102OPcells46',
'KTN102OPcells47',
'KTN102OPcells48',
'KTN102OPcells49',
'KTN102OPcells5',
'KTN102OPcells50',
'KTN102OPcells51',
'KTN102OPcells52',
'KTN102OPcells53',
'KTN102OPcells54',
'KTN102OPcells55',
'KTN102OPcells56',
'KTN102OPcells57',
'KTN102OPcells58',
'KTN102OPcells59',
'KTN102OPcells6',
'KTN102OPcells60',
'KTN102OPcells61',
'KTN102OPcells62',
'KTN102OPcells63',
'KTN102OPcells64',
'KTN102OPcells65',
'KTN102OPcells66',
'KTN102OPcells67',
'KTN102OPcells68',
'KTN102OPcells69',
'KTN102OPcells7',
'KTN102OPcells70',
'KTN102OPcells71',
'KTN102OPcells72',
'KTN102OPcells73',
'KTN102OPcells74',
'KTN102OPcells75',
'KTN102OPcells76',
'KTN102OPcells77',
'KTN102OPcells78',
'KTN102OPcells79',
'KTN102OPcells8',
'KTN102OPcells80',
'KTN102OPcells81',
'KTN102OPcells82',
'KTN102OPcells83',
'KTN102OPcells84',
'KTN102OPcells85',
'KTN102OPcells86',
'KTN102OPcells87',
'KTN102OPcells88',
'KTN102OPcells89',
'KTN102OPcells9',
'KTN102OPcells90',
'KTN102OPcells91',
'KTN102OPcells92',
'KTN102OPcells93',
'KTN102OPcells94',
'KTN102OPcells95',
'KTN102OPcells96',
'KTN102OPcells97',
'KTN102OPcells98',
'KTN102OPcells99',
'KTN102_0_10',
'KTN102_0_10_B2',
'KTN102_0_11',
'KTN102_0_11_B2',
'KTN102_0_12',
'KTN102_0_12_B2',
'KTN102_0_13',
'KTN102_0_14',
'KTN102_0_14_B2',
'KTN102_0_15_B2',
'KTN102_0_16',
'KTN102_0_16_B2',
'KTN102_0_17',
'KTN102_0_17_B2',
'KTN102_0_18',
'KTN102_0_18_B2',
'KTN102_0_19',
'KTN102_0_19_B2',
'KTN102_0_1_B2',
'KTN102_0_2',
'KTN102_0_20',
'KTN102_0_20_B2',
'KTN102_0_21',
'KTN102_0_22',
'KTN102_0_22_B2',
'KTN102_0_23',
'KTN102_0_23_B2',
'KTN102_0_24',
'KTN102_0_24_B2',
'KTN102_0_25',
'KTN102_0_25_B2',
'KTN102_0_26',
'KTN102_0_26_B2',
'KTN102_0_27',
'KTN102_0_27_B2',
'KTN102_0_28',
'KTN102_0_28_B2',
'KTN102_0_29',
'KTN102_0_29_B2',
'KTN102_0_2_B2',
'KTN102_0_3',
'KTN102_0_30',
'KTN102_0_30_B2',
'KTN102_0_31',
'KTN102_0_31_B2',
'KTN102_0_32',
'KTN102_0_32_B2',
'KTN102_0_33',
'KTN102_0_33_B2',
'KTN102_0_34',
'KTN102_0_34_B2',
'KTN102_0_35',
'KTN102_0_35_B2',
'KTN102_0_36',
'KTN102_0_36_B2',
'KTN102_0_37',
'KTN102_0_37_B2',
'KTN102_0_38',
'KTN102_0_38_B2',
'KTN102_0_39',
'KTN102_0_39_B2',
'KTN102_0_4',
'KTN102_0_40',
'KTN102_0_40_B2',
'KTN102_0_41',
'KTN102_0_41_B2',
'KTN102_0_42',
'KTN102_0_43',
'KTN102_0_43_B2',
'KTN102_0_44',
'KTN102_0_44_B2',
'KTN102_0_45',
'KTN102_0_45_B2',
'KTN102_0_46',
'KTN102_0_46_B2',
'KTN102_0_47',
'KTN102_0_47_B2',
'KTN102_0_48',
'KTN102_0_48_B2',
'KTN102_0_4_B2',
'KTN102_0_5',
'KTN102_0_5_B2',
'KTN102_0_6',
'KTN102_0_6_B2',
'KTN102_0_7',
'KTN102_0_7_B2',
'KTN102_0_8',
'KTN102_0_8_B2',
'KTN102_0_9',
'KTN102_0_9_B2',
'KTN102_0_Pop',
'KTN102_2_01',
'KTN102_2_02',
'KTN102_2_03',
'KTN102_2_04',
'KTN102_2_05',
'KTN102_2_06',
'KTN102_2_07',
'KTN102_2_08',
'KTN102_2_09',
'KTN102_2_10',
'KTN102_2_10_B2',
'KTN102_2_11',
'KTN102_2_11_B2',
'KTN102_2_12',
'KTN102_2_12_B2',
'KTN102_2_13',
'KTN102_2_13_B2',
'KTN102_2_14',
'KTN102_2_14_B2',
'KTN102_2_15',
'KTN102_2_15_B2',
'KTN102_2_16',
'KTN102_2_16_B2',
'KTN102_2_17',
'KTN102_2_17_B2',
'KTN102_2_18',
'KTN102_2_18_B2',
'KTN102_2_19',
'KTN102_2_19_B2',
'KTN102_2_1_B2',
'KTN102_2_20',
'KTN102_2_20_B2',
'KTN102_2_21',
'KTN102_2_22',
'KTN102_2_22_B2',
'KTN102_2_23',
'KTN102_2_23_B2',
'KTN102_2_24',
'KTN102_2_24_B2',
'KTN102_2_25',
'KTN102_2_25_B2',
'KTN102_2_26',
'KTN102_2_27',
'KTN102_2_27_B2',
'KTN102_2_28',
'KTN102_2_28_B2',
'KTN102_2_29',
'KTN102_2_2_B2',
'KTN102_2_30',
'KTN102_2_30_B2',
'KTN102_2_31',
'KTN102_2_31_B2',
'KTN102_2_32',
'KTN102_2_32_B2',
'KTN102_2_33',
'KTN102_2_33_B2',
'KTN102_2_34',
'KTN102_2_34_B2',
'KTN102_2_35',
'KTN102_2_35_B2',
'KTN102_2_36',
'KTN102_2_36_B2',
'KTN102_2_37',
'KTN102_2_37_B2',
'KTN102_2_38',
'KTN102_2_38_B2',
'KTN102_2_39',
'KTN102_2_39_B2',
'KTN102_2_3_B2',
'KTN102_2_40',
'KTN102_2_40_B2',
'KTN102_2_41',
'KTN102_2_41_B2',
'KTN102_2_42',
'KTN102_2_42_B2',
'KTN102_2_43',
'KTN102_2_43_B2',
'KTN102_2_44',
'KTN102_2_44_B2',
'KTN102_2_45',
'KTN102_2_45_B2',
'KTN102_2_46',
'KTN102_2_46_B2',
'KTN102_2_47',
'KTN102_2_47_B2',
'KTN102_2_48_B2',
'KTN102_2_4_B2',
'KTN102_2_5_B2',
'KTN102_2_6_B2',
'KTN102_2_7_B2',
'KTN102_2_8_B2',
'KTN102_2_9_B2',
'KTN102_2_Pop',
'KTN102_OP_1',
'KTN102_OP_10',
'KTN102_OP_11',
'KTN102_OP_12',
'KTN102_OP_13',
'KTN102_OP_14',
'KTN102_OP_15',
'KTN102_OP_16',
'KTN102_OP_17',
'KTN102_OP_18',
'KTN102_OP_19',
'KTN102_OP_2',
'KTN102_OP_20',
'KTN102_OP_21',
'KTN102_OP_22',
'KTN102_OP_23',
'KTN102_OP_24',
'KTN102_OP_25',
'KTN102_OP_26',
'KTN102_OP_27',
'KTN102_OP_28',
'KTN102_OP_29',
'KTN102_OP_30',
'KTN102_OP_31',
'KTN102_OP_32',
'KTN102_OP_33',
'KTN102_OP_34',
'KTN102_OP_35',
'KTN102_OP_36',
'KTN102_OP_37',
'KTN102_OP_38',
'KTN102_OP_39',
'KTN102_OP_4',
'KTN102_OP_40',
'KTN102_OP_41',
'KTN102_OP_42',
'KTN102_OP_43',
'KTN102_OP_44',
'KTN102_OP_45',
'KTN102_OP_46',
'KTN102_OP_47',
'KTN102_OP_48',
'KTN102_OP_5',
'KTN102_OP_6',
'KTN102_OP_7',
'KTN102_OP_8',
'KTN102_OP_9',
'KTN102_OP_Pop']
Based on quick browsing of the values, it seems like the library names have the following patterns:
- KTN102Blood, KTN1020, KTN1022, KTN102OP seem to refer to bulk samples
- For KTN1020 and KTN102OP, there seem to be library names with the following suffix ‘cells[0-9]+’. These are either DNA or RNA data for population KTN1020 and KTN102OP.
- The other library names have the following pattern: KTN102_[0|2|OP]_[0-9|A-Z|a-Z]+
- Some or all of these files are likely to be single cell data, but we do not know which correspond to RNA and which to DNA samples
- We may be able to use the number of fastq.gz files attached to each library to help identify RNA from DNA samples since RNA samples should be single stranded.
- We may also use the file sizes to our advantage.
# let's check the data sizes for the files to find the bulk samples
ret = list(map(lambda row: np.asarray(str(row).split(";"), dtype=int), list(df2['fastq_bytes'])))
ret = np.asarray(ret)
ret2 = list(map(lambda row: np.sum(row)/np.power(10,9), ret)) # in GB
ret2 = np.asarray(ret2)
df2[ret2 > 1]
study_accession | sample_accession | secondary_sample_accession | experiment_accession | run_accession | tax_id | scientific_name | instrument_model | library_name | library_layout | fastq_bytes | fastq_ftp | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | PRJNA396019 | SAMN07457099 | SRS2412441 | SRX3067795 | SRR5906250 | 9606 | Homo sapiens | Illumina HiSeq 4000 | KTN102Blood | PAIRED | 46088894;3107017681;3884543372 | ftp.sra.ebi.ac.uk/vol1/fastq/SRR590/000/SRR590... |
1 | PRJNA396019 | SAMN07457098 | SRS2412440 | SRX3067794 | SRR5906251 | 9606 | Homo sapiens | Illumina HiSeq 4000 | KTN102OP | PAIRED | 55533991;3032757154;3772164832 | ftp.sra.ebi.ac.uk/vol1/fastq/SRR590/001/SRR590... |
2 | PRJNA396019 | SAMN07457097 | SRS2412442 | SRX3067793 | SRR5906252 | 9606 | Homo sapiens | Illumina HiSeq 4000 | KTN1022 | PAIRED | 43518505;2498303748;3015988613 | ftp.sra.ebi.ac.uk/vol1/fastq/SRR590/002/SRR590... |
3 | PRJNA396019 | SAMN07457096 | SRS2412443 | SRX3067792 | SRR5906253 | 9606 | Homo sapiens | Illumina HiSeq 4000 | KTN1020 | PAIRED | 41991292;3179486152;3911231851 | ftp.sra.ebi.ac.uk/vol1/fastq/SRR590/003/SRR590... |
ret2[ret2 > 1] # large files probably correspond to bulk, matching our intuition
array([ 7.03764995, 6.86045598, 5.55781087, 7.13270929])
# let's check the number of files associated with each row
num_files = np.array(list(map(lambda row: row.shape[0], ret)))
s1 = sorted(df2[num_files == 1]["library_name"])
print(len(s1)) # 143
s1
143
['KTN102_0_10',
'KTN102_0_11',
'KTN102_0_12',
'KTN102_0_13',
'KTN102_0_14',
'KTN102_0_16',
'KTN102_0_17',
'KTN102_0_18',
'KTN102_0_19',
'KTN102_0_2',
'KTN102_0_20',
'KTN102_0_21',
'KTN102_0_22',
'KTN102_0_23',
'KTN102_0_24',
'KTN102_0_25',
'KTN102_0_26',
'KTN102_0_27',
'KTN102_0_28',
'KTN102_0_29',
'KTN102_0_3',
'KTN102_0_30',
'KTN102_0_31',
'KTN102_0_32',
'KTN102_0_33',
'KTN102_0_34',
'KTN102_0_35',
'KTN102_0_36',
'KTN102_0_37',
'KTN102_0_38',
'KTN102_0_39',
'KTN102_0_4',
'KTN102_0_40',
'KTN102_0_41',
'KTN102_0_42',
'KTN102_0_43',
'KTN102_0_44',
'KTN102_0_45',
'KTN102_0_46',
'KTN102_0_47',
'KTN102_0_48',
'KTN102_0_5',
'KTN102_0_6',
'KTN102_0_7',
'KTN102_0_8',
'KTN102_0_9',
'KTN102_0_Pop',
'KTN102_2_01',
'KTN102_2_02',
'KTN102_2_03',
'KTN102_2_04',
'KTN102_2_05',
'KTN102_2_06',
'KTN102_2_07',
'KTN102_2_08',
'KTN102_2_09',
'KTN102_2_10',
'KTN102_2_11',
'KTN102_2_12',
'KTN102_2_13',
'KTN102_2_14',
'KTN102_2_15',
'KTN102_2_16',
'KTN102_2_17',
'KTN102_2_18',
'KTN102_2_19',
'KTN102_2_20',
'KTN102_2_21',
'KTN102_2_22',
'KTN102_2_23',
'KTN102_2_24',
'KTN102_2_25',
'KTN102_2_26',
'KTN102_2_27',
'KTN102_2_28',
'KTN102_2_29',
'KTN102_2_30',
'KTN102_2_31',
'KTN102_2_32',
'KTN102_2_33',
'KTN102_2_34',
'KTN102_2_35',
'KTN102_2_36',
'KTN102_2_37',
'KTN102_2_38',
'KTN102_2_39',
'KTN102_2_40',
'KTN102_2_41',
'KTN102_2_42',
'KTN102_2_43',
'KTN102_2_44',
'KTN102_2_45',
'KTN102_2_46',
'KTN102_2_47',
'KTN102_2_Pop',
'KTN102_OP_1',
'KTN102_OP_10',
'KTN102_OP_11',
'KTN102_OP_12',
'KTN102_OP_13',
'KTN102_OP_14',
'KTN102_OP_15',
'KTN102_OP_16',
'KTN102_OP_17',
'KTN102_OP_18',
'KTN102_OP_19',
'KTN102_OP_2',
'KTN102_OP_20',
'KTN102_OP_21',
'KTN102_OP_22',
'KTN102_OP_23',
'KTN102_OP_24',
'KTN102_OP_25',
'KTN102_OP_26',
'KTN102_OP_27',
'KTN102_OP_28',
'KTN102_OP_29',
'KTN102_OP_30',
'KTN102_OP_31',
'KTN102_OP_32',
'KTN102_OP_33',
'KTN102_OP_34',
'KTN102_OP_35',
'KTN102_OP_36',
'KTN102_OP_37',
'KTN102_OP_38',
'KTN102_OP_39',
'KTN102_OP_4',
'KTN102_OP_40',
'KTN102_OP_41',
'KTN102_OP_42',
'KTN102_OP_43',
'KTN102_OP_44',
'KTN102_OP_45',
'KTN102_OP_46',
'KTN102_OP_47',
'KTN102_OP_48',
'KTN102_OP_5',
'KTN102_OP_6',
'KTN102_OP_7',
'KTN102_OP_8',
'KTN102_OP_9',
'KTN102_OP_Pop']
s2 = sorted(df2[num_files == 2]["library_name"])
print(len(s2)) # 275
s2
275
['KTN1020cells1',
'KTN1020cells10',
'KTN1020cells11',
'KTN1020cells12',
'KTN1020cells13',
'KTN1020cells14',
'KTN1020cells15',
'KTN1020cells16',
'KTN1020cells17',
'KTN1020cells18',
'KTN1020cells19',
'KTN1020cells2',
'KTN1020cells20',
'KTN1020cells21',
'KTN1020cells22',
'KTN1020cells23',
'KTN1020cells24',
'KTN1020cells25',
'KTN1020cells26',
'KTN1020cells27',
'KTN1020cells28',
'KTN1020cells29',
'KTN1020cells3',
'KTN1020cells30',
'KTN1020cells31',
'KTN1020cells32',
'KTN1020cells33',
'KTN1020cells34',
'KTN1020cells35',
'KTN1020cells36',
'KTN1020cells37',
'KTN1020cells38',
'KTN1020cells39',
'KTN1020cells4',
'KTN1020cells40',
'KTN1020cells41',
'KTN1020cells42',
'KTN1020cells43',
'KTN1020cells44',
'KTN1020cells45',
'KTN1020cells46',
'KTN1020cells47',
'KTN1020cells48',
'KTN1020cells49',
'KTN1020cells5',
'KTN1020cells50',
'KTN1020cells51',
'KTN1020cells52',
'KTN1020cells53',
'KTN1020cells54',
'KTN1020cells55',
'KTN1020cells56',
'KTN1020cells57',
'KTN1020cells58',
'KTN1020cells59',
'KTN1020cells6',
'KTN1020cells60',
'KTN1020cells61',
'KTN1020cells62',
'KTN1020cells63',
'KTN1020cells64',
'KTN1020cells65',
'KTN1020cells66',
'KTN1020cells67',
'KTN1020cells68',
'KTN1020cells69',
'KTN1020cells7',
'KTN1020cells70',
'KTN1020cells71',
'KTN1020cells72',
'KTN1020cells73',
'KTN1020cells74',
'KTN1020cells75',
'KTN1020cells76',
'KTN1020cells77',
'KTN1020cells78',
'KTN1020cells79',
'KTN1020cells8',
'KTN1020cells80',
'KTN1020cells81',
'KTN1020cells82',
'KTN1020cells83',
'KTN1020cells84',
'KTN1020cells85',
'KTN1020cells86',
'KTN1020cells87',
'KTN1020cells88',
'KTN1020cells89',
'KTN1020cells9',
'KTN1020cells90',
'KTN1020cells91',
'KTN102OPcells1',
'KTN102OPcells10',
'KTN102OPcells100',
'KTN102OPcells101',
'KTN102OPcells102',
'KTN102OPcells103',
'KTN102OPcells104',
'KTN102OPcells105',
'KTN102OPcells106',
'KTN102OPcells107',
'KTN102OPcells108',
'KTN102OPcells109',
'KTN102OPcells11',
'KTN102OPcells110',
'KTN102OPcells111',
'KTN102OPcells112',
'KTN102OPcells113',
'KTN102OPcells114',
'KTN102OPcells115',
'KTN102OPcells116',
'KTN102OPcells117',
'KTN102OPcells118',
'KTN102OPcells119',
'KTN102OPcells12',
'KTN102OPcells120',
'KTN102OPcells121',
'KTN102OPcells122',
'KTN102OPcells123',
'KTN102OPcells124',
'KTN102OPcells125',
'KTN102OPcells126',
'KTN102OPcells127',
'KTN102OPcells128',
'KTN102OPcells129',
'KTN102OPcells13',
'KTN102OPcells130',
'KTN102OPcells131',
'KTN102OPcells132',
'KTN102OPcells133',
'KTN102OPcells134',
'KTN102OPcells135',
'KTN102OPcells136',
'KTN102OPcells137',
'KTN102OPcells138',
'KTN102OPcells139',
'KTN102OPcells14',
'KTN102OPcells140',
'KTN102OPcells141',
'KTN102OPcells142',
'KTN102OPcells143',
'KTN102OPcells144',
'KTN102OPcells145',
'KTN102OPcells146',
'KTN102OPcells147',
'KTN102OPcells148',
'KTN102OPcells149',
'KTN102OPcells15',
'KTN102OPcells150',
'KTN102OPcells151',
'KTN102OPcells152',
'KTN102OPcells153',
'KTN102OPcells154',
'KTN102OPcells155',
'KTN102OPcells156',
'KTN102OPcells157',
'KTN102OPcells158',
'KTN102OPcells159',
'KTN102OPcells16',
'KTN102OPcells160',
'KTN102OPcells161',
'KTN102OPcells162',
'KTN102OPcells163',
'KTN102OPcells164',
'KTN102OPcells165',
'KTN102OPcells166',
'KTN102OPcells167',
'KTN102OPcells168',
'KTN102OPcells169',
'KTN102OPcells17',
'KTN102OPcells170',
'KTN102OPcells171',
'KTN102OPcells172',
'KTN102OPcells173',
'KTN102OPcells174',
'KTN102OPcells175',
'KTN102OPcells176',
'KTN102OPcells177',
'KTN102OPcells178',
'KTN102OPcells179',
'KTN102OPcells18',
'KTN102OPcells180',
'KTN102OPcells181',
'KTN102OPcells182',
'KTN102OPcells183',
'KTN102OPcells184',
'KTN102OPcells19',
'KTN102OPcells2',
'KTN102OPcells20',
'KTN102OPcells21',
'KTN102OPcells22',
'KTN102OPcells23',
'KTN102OPcells24',
'KTN102OPcells25',
'KTN102OPcells26',
'KTN102OPcells27',
'KTN102OPcells28',
'KTN102OPcells29',
'KTN102OPcells3',
'KTN102OPcells30',
'KTN102OPcells31',
'KTN102OPcells32',
'KTN102OPcells33',
'KTN102OPcells34',
'KTN102OPcells35',
'KTN102OPcells36',
'KTN102OPcells37',
'KTN102OPcells38',
'KTN102OPcells39',
'KTN102OPcells4',
'KTN102OPcells40',
'KTN102OPcells41',
'KTN102OPcells42',
'KTN102OPcells43',
'KTN102OPcells44',
'KTN102OPcells45',
'KTN102OPcells46',
'KTN102OPcells47',
'KTN102OPcells48',
'KTN102OPcells49',
'KTN102OPcells5',
'KTN102OPcells50',
'KTN102OPcells51',
'KTN102OPcells52',
'KTN102OPcells53',
'KTN102OPcells54',
'KTN102OPcells55',
'KTN102OPcells56',
'KTN102OPcells57',
'KTN102OPcells58',
'KTN102OPcells59',
'KTN102OPcells6',
'KTN102OPcells60',
'KTN102OPcells61',
'KTN102OPcells62',
'KTN102OPcells63',
'KTN102OPcells64',
'KTN102OPcells65',
'KTN102OPcells66',
'KTN102OPcells67',
'KTN102OPcells68',
'KTN102OPcells69',
'KTN102OPcells7',
'KTN102OPcells70',
'KTN102OPcells71',
'KTN102OPcells72',
'KTN102OPcells73',
'KTN102OPcells74',
'KTN102OPcells75',
'KTN102OPcells76',
'KTN102OPcells77',
'KTN102OPcells78',
'KTN102OPcells79',
'KTN102OPcells8',
'KTN102OPcells80',
'KTN102OPcells81',
'KTN102OPcells82',
'KTN102OPcells83',
'KTN102OPcells84',
'KTN102OPcells85',
'KTN102OPcells86',
'KTN102OPcells87',
'KTN102OPcells88',
'KTN102OPcells89',
'KTN102OPcells9',
'KTN102OPcells90',
'KTN102OPcells91',
'KTN102OPcells92',
'KTN102OPcells93',
'KTN102OPcells94',
'KTN102OPcells95',
'KTN102OPcells96',
'KTN102OPcells97',
'KTN102OPcells98',
'KTN102OPcells99']
s3 = sorted(df2[num_files == 3]["library_name"])
print(len(s3)) # 93
s3
93
['KTN1020',
'KTN1022',
'KTN102Blood',
'KTN102OP',
'KTN102_0_10_B2',
'KTN102_0_11_B2',
'KTN102_0_12_B2',
'KTN102_0_14_B2',
'KTN102_0_15_B2',
'KTN102_0_16_B2',
'KTN102_0_17_B2',
'KTN102_0_18_B2',
'KTN102_0_19_B2',
'KTN102_0_1_B2',
'KTN102_0_20_B2',
'KTN102_0_22_B2',
'KTN102_0_23_B2',
'KTN102_0_24_B2',
'KTN102_0_25_B2',
'KTN102_0_26_B2',
'KTN102_0_27_B2',
'KTN102_0_28_B2',
'KTN102_0_29_B2',
'KTN102_0_2_B2',
'KTN102_0_30_B2',
'KTN102_0_31_B2',
'KTN102_0_32_B2',
'KTN102_0_33_B2',
'KTN102_0_34_B2',
'KTN102_0_35_B2',
'KTN102_0_36_B2',
'KTN102_0_37_B2',
'KTN102_0_38_B2',
'KTN102_0_39_B2',
'KTN102_0_40_B2',
'KTN102_0_41_B2',
'KTN102_0_43_B2',
'KTN102_0_44_B2',
'KTN102_0_45_B2',
'KTN102_0_46_B2',
'KTN102_0_47_B2',
'KTN102_0_48_B2',
'KTN102_0_4_B2',
'KTN102_0_5_B2',
'KTN102_0_6_B2',
'KTN102_0_7_B2',
'KTN102_0_8_B2',
'KTN102_0_9_B2',
'KTN102_2_10_B2',
'KTN102_2_11_B2',
'KTN102_2_12_B2',
'KTN102_2_13_B2',
'KTN102_2_14_B2',
'KTN102_2_15_B2',
'KTN102_2_16_B2',
'KTN102_2_17_B2',
'KTN102_2_18_B2',
'KTN102_2_19_B2',
'KTN102_2_1_B2',
'KTN102_2_20_B2',
'KTN102_2_22_B2',
'KTN102_2_23_B2',
'KTN102_2_24_B2',
'KTN102_2_25_B2',
'KTN102_2_27_B2',
'KTN102_2_28_B2',
'KTN102_2_2_B2',
'KTN102_2_30_B2',
'KTN102_2_31_B2',
'KTN102_2_32_B2',
'KTN102_2_33_B2',
'KTN102_2_34_B2',
'KTN102_2_35_B2',
'KTN102_2_36_B2',
'KTN102_2_37_B2',
'KTN102_2_38_B2',
'KTN102_2_39_B2',
'KTN102_2_3_B2',
'KTN102_2_40_B2',
'KTN102_2_41_B2',
'KTN102_2_42_B2',
'KTN102_2_43_B2',
'KTN102_2_44_B2',
'KTN102_2_45_B2',
'KTN102_2_46_B2',
'KTN102_2_47_B2',
'KTN102_2_48_B2',
'KTN102_2_4_B2',
'KTN102_2_5_B2',
'KTN102_2_6_B2',
'KTN102_2_7_B2',
'KTN102_2_8_B2',
'KTN102_2_9_B2']
- Rows that are associated with single file seem to be for scRNA
- Rows associated with two files seem to be for scDNA
- Rows associated with three files seem to be for bulk and/or single cell data (this will have to be determined by reading the paper) but it’s clear from the size of the files, which ones correspond to bulk data.
Details from the paper
The goal of this paper is to study the resistance to chemotherapy in triple negative breast cancer. To that end, they applied scDNA, scRNA, and bulk exome sequencing to profile longitudinal samples from 20 TNBC patients during chemotherapy.
Deep exome sequencing revealed chemo led to clonal extinction (removal of cancer clone) for 10 patients and 10 patients where clone persisted after chemo. In 8 patients, more detailed study was carried out using scDNA-seq to analyze 900 cells and scRNA-seq to analyze 6,862 cells.
An excerpt from the paper:
… single-nucleus RNA-sequencing (SNRS) method. SNRS performs automated imaging and selection of up to 1,800 single nuclei in parallel for 3’ mRNA profiling. We profiled the transcriptomes of 3,370 single nuclei isolated from two matched longitudinal samples per patient from the four clonal extinction patients (P1, P2, P6, P9).
From this excerpt, it appears that RNA sequencing produces sequencing from only 3’ end, which supports the suspicion that records associated with a single file are RNA-seq data.