我有兩個dfs,並尋找一種方法來根據df2中的行選擇(和計數)df1行。根據第二列中的行選擇行
這是我的DF1:
Chromosome Start position End position Reference Variant reads \
0 chr1 109419841 109419841 C T 1
1 chr1 197008365 197008365 C T 1
variation reads % variation gDNA nomencl \
0 1 100 Chr1(GRCh37):g.109419841C>T
1 1 100 Chr1(GRCh37):g.197008365C>T
cDNA nomencl ... exon transcript ID inheritance \
0 NM_013296.4:c.-258C>T ... 2 NM_013296.4 Autosomal recessive
1 NM_001994.2:c.*143G>A ... UTR NM_001994.2 Autosomal recessive
test type Phenotype male coverage male ratio covered \
0 Unknown Deafness, autosomal recessief 0 0
1 Unknown Factor 13 deficientie 0 0
female coverage female ratio covered ratio M:F
0 1 1 0.0
1 1 1 0.0
DF1有這些列:
Chromosome 10561 non-null object
Start position 10561 non-null int64
End position 10561 non-null int64
Reference 10415 non-null object
Variant 10536 non-null object
reads 10561 non-null int64
variation reads 10561 non-null int64
% variation 10561 non-null int64
gDNA nomencl 10561 non-null object
cDNA nomencl 10446 non-null object
protein nomencl 9997 non-null object
classification 10561 non-null object
status 10561 non-null object
gene 10560 non-null object
Sanger sequencing list 10561 non-null object
exon 10502 non-null object
transcript ID 10460 non-null object
inheritance 8259 non-null object
test type 10561 non-null object
Phenotype 10380 non-null object
male coverage 10561 non-null int64
male ratio covered 10561 non-null int64
female coverage 10561 non-null int64
female ratio covered 10561 non-null int64
,這是DF2:
Chromosome Startposition Endposition Bases Meancoverage \
0 chr1 11073785 11074022 27831.0 117.927966
1 chr1 11076901 11077064 11803.0 72.411043
Mediancoverage Ratiocovered>10X Ratiocovered>20X Genename Componentnr \
0 97.0 1.0 1.0 TARDBP 1
1 76.0 1.0 1.0 TARDBP 2
PositionGenes PositionGenome Position
0 TARDBP.1 chr1.11073785-11074022 comp.1_chr1.11073785-11074022
1 TARDBP.2 chr1.11076901-11077064 comp.2_chr1.11076901-11077064
我想選擇DF1這都行in df2
- 關於 '染色體'
- DF1相同的值[ '開始位置']> = df2.Startposition
- DF1 [ '結束位置'] < = df2.Endposition。
如果在df2的同一行中滿足這三個條件,我想選擇df1中的對應行。
我已經融合了'PositionGenome'中的'Chromosome','Startposition'和'Endposition'這三列來生成一個lambda函數,但並沒有提出任何東西。
因此,希望你能幫助我...
請檢查這個[答案](http://stackoverflow.com/a/34953669/2901002) – jezrael
@jezeral。如果我試着回答你的建議,我會得到pd.merge(df1,df2,on = ['Chromosome'])的內存錯誤。 df1有> 10.000行,而df2 2有> 600萬行。我已經將dfs減少到任務所需的少量列,但仍然會出現相同的錯誤。 – SGeuer
確實,在大型數據框中存在問題......不幸的是。 – jezrael