1
我有一個數據幀有一個樣本列包含重複樣本(以_2結尾)和一個詳細說明哪一個是原始樣本的列。新類別包含一種突變類型,致病性/可能致病性最具破壞性,而可能良性損害性最小。下面演示了我的數據框的簡化/基本版本。有條件刪除行不像預期的熊貓
df = pd.DataFrame(columns=['Sample', 'same','New Category'],
data=[
['HG_12_34', 'HG_12_34', 'Pathogenic/Likely Pathogenic'],
['HG_12_34_2', 'HG_12_34', 'Likely Benign'],
['KD_89_9', 'KD_89_9', 'Likely Benign'],
['KD_98_9_2', 'KD_89_9', 'Likely Benign'],
['LG_3_45', 'LG_3_45', 'Likely Benign'],
['LG_3_45_2', 'LG_3_45', 'VUS']
])
我希望有條件地刪除無論是樣品或取決於哪一個具有新類別,即損害最小的突變,如果一個樣本可能已經良重複的具有致病性/ Likley致病變種那麼它的重複我想要刪除/刪除樣本行。
我試圖通過傳遞數據框到一個函數,該函數返回一個表示要刪除的行的索引列表,然後我放下了它們。
def get_unwanted_duplicates_ix(df):
# filter df for samples that have a duplicate
same_only = df.groupby("same").filter(lambda x: len(x) > 1)
list_index_to_delete = []
for num in range(0,same_only.shape[0]-1):
row1 = same_only.irow(num)
row2 = same_only.irow(num+1)
index = list(same_only.index.values)[num]
if row1['Sample']+"_2" == row2['Sample'] or \
row1['Sample'] == row2['Sample']+"_2":
if row1['New Category'] == row2['New Category']:
list_index_to_delete.append(index+1)
elif row1['New Category'] == "Pathogenic/Likely Pathogenic" \
and row2['New Category'] != "Pathogenic/Likely Pathogenic":
list_index_to_delete.append(index+1)
elif row2['New Category'] == "Pathogenic/Likely Pathogenic" \
and row1['New Category'] != "Pathogenic/Likely Pathogenic":
list_index_to_delete.append(index)
elif row1['New Category'] == "VUS" \
and row2['New Category'] != "VUS":
list_index_to_delete.append(index+1)
elif row2['New Category'] == "VUS" \
and row1['New Category'] != "VUS":
list_index_to_delete.append(index)
elif row1['New Category'] == 'Likely Benign' \
and row2['New Category'] == 'Likely Benign':
list_index_to_delete.append(index+1)
else:
list_index_to_delete.append(index+1)
return list_index_to_delete
unwanted = get_unwanted_duplicates_ix(df)
df = df.drop(df.index[unwanted])
上述功能是一團糟,不出所料,不會像我所希望的那樣工作。正確的方向將是最讚賞的一點。
這就是你想要的,或者你想不是由'相同'列組?如果不是,請將所需的輸出添加到問題中。 –
我建議不要轉換和比較最大值(對於具有多個最大值的組將返回多個樣本),請按照新的類別代碼降序排序,然後應用'groupby('same')。first( )'而不是...(或者按升序排序,然後應用'.last()' - 無論你喜歡什麼) –
@JonClements謝謝,我已經更新了答案。 –