保留具有特定列的最大值的行

我是Python的新手，我想要執行以下操作。我有一個csv文件（input.csv），它包含一個標題行和4列。這個csv文件的一部分如下所示：保留具有特定列的最大值的行

gene-name p-value stepup(p-value) fold-change 
IFIT1 6.79175E-005 0.0874312 96.0464 
IFITM1 0.00304362 0.290752 86.3192 
IFIT1 0.000439152 0.145488 81.499 
IFIT3 5.87135E-005 0.0838258 77.1737 
RSAD2 6.7615E-006 0.0685623 141.898 
RSAD2 3.98875E-005 0.0760279 136.772 
IFITM1 0.00176673 0.230063 72.0445

我想只保留與倍數變化的最高值的行，並刪除含有相同基因名稱的所有其他行與倍數變化的較低值。例如，在這種情況下，我需要以下格式的CSV輸出文件：

gene-name p-value stepup(p-value) fold-change 
IFIT1 6.79175E-005 0.0874312 96.0464 
IFITM1 0.00304362 0.290752 86.3192 
RSAD2 6.7615E-006 0.0685623 141.898 
IFIT3 5.87135E-005 0.0838258 77.1737

，如果你提供我一個解決這個問題，我會感激你的。
非常感謝。

來源

2017-07-21 Python kindergarten developer

你到目前爲止嘗試過什麼？ –

你嘗試過什麼嗎？發佈您的代碼.... – Dadep

我嘗試首先按名稱排序，然後使用df.sort保留基因的第一個最高摺疊值，但沒有成功。 –

愚蠢的解決方案：走在文件中的每一行，做一個手動比較。假設：

每列由
預期結果的行數，以適應到內存中，因爲我們必須完成我們的整個搜索和沖洗結果到文件
不推遲，所以這個比例（速度）很差，因爲它在每條輸入線上完成了一個完整的結果列表。
如果稍後以某種方式發生相同的摺疊更改，您希望保留基因的第一行。

::這個

fi = open('inputfile.csv','r') # read 

header = fi.readline() 
# capture the header line ("gene-name p-value stepup(p-value) fold-change")  

out_a = [] # we will store the results in here 

for line in fi: # we can read a line this way too 
    temp_a = line.strip('\r\n').split(' ') 
    # strip the newlines, split the line into an array 

    try: 
     pos = [gene[0] for gene in out_a].index(temp_a[0]) 
     # try to see if the gene is already been seen before 
     # [0] is the first column (gene-name) 
     # return the position in out_a where the existing gene is 
    except ValueError: # python throws this if a value is not found 
     out_a.append(temp_a) 
     # add it to the list initially 
    else: # we found an existing gene 
     if float(temp_a[3]) > float(out_a[pos][3]): 
      # new line has higher fold-change (column 4) 
      out_a[pos] = temp_a 
      # so we replace 

fi.close() # we're done with our input file 
fo = open('outfile.csv','w') # prepare to write to output 
fo.write(header) # don't forget about our header 
for result in out_a: 
    # iterate through out_a and write each line to fo 
    fo.write(' '.join(result) + '\n') 
    # result is a list [XXXX,...,1234] 
    # we ' '.join(result) to turn it back into a line 
    # don't forget the '\n' which makes each result on a line 

fo.close()

一個優勢是它保留了輸入文件的基因的第一個遇到的順序。

來源

2017-07-21 18:24:15 cowbert

不幸的是我收到錯誤：temp_a [0] .append（temp_a） AttributeError：'str'object has no attribute'append' 爲什麼我們得到這個錯誤@cowbert？ –

重新加載頁面，這是由於錯字。 – cowbert

不幸的是我收到一個新的錯誤：如果float（temp_a [3]）> float（out_a [pos] [3]）： IndexError：列表索引超出範圍我們該如何解決？ –

嘗試使用熊貓：

的代碼是長，因爲我選擇了做它在同一行，但什麼代碼正在做的就是這個。我搶的最高fold-change的gene-name，然後我用!=經營者說，「搶了我一切，其中gene-name是不一樣的，我們只是做了計算gene-name

細分：

# gets the max value in fold-change 
max_value = df['fold-change'].max() 

# gets the gene name of that max value 
gene_name_max = df.loc[df['fold-change'] == max_value]['gene-name'] 

# reassigning so you see the progression of grabbing the name 
gene_name_max = gene_name_max.values[0] 

# the final output 
df.loc[(df['gene-name'] != gene_name_max)]

輸出：

gene-name p-value stepup(p-value) fold-change 
0 IFIT1 0.000068 0.087431 96.0464 
1 IFITM1 0.003044 0.290752 86.3192 
2 IFIT1 0.000439 0.145488 81.4990 
3 IFIT3 0.000059 0.083826 77.1737 
6 IFITM1 0.001767 0.230063 72.0445

編輯：

得到預期的OUTP ut使用groupby：

import pandas as pd 

df = pd.read_csv('YOUR_PATH_HERE') 
df.groupby(['gene-name'], sort=False)['fold-change'].max() 

# output below 
gene-name 
IFIT1  96.0464 
IFITM1  86.3192 
IFIT3  77.1737 
RSAD2  141.8980

來源

2017-07-21 17:46:44 MattR

我很抱歉，但這不是我想要的。我需要在每一行中獲得一個摺疊變化值最高的基因名稱。您的腳本不會刪除具有相同基因名稱和較低摺疊更改值的所有行。清楚了還是需要更多信息？ –

有點困惑......你需要每個基因名稱的最大值？ – MattR

@ManolisSemidalas根據您的預期產出進行更新。 – MattR

保留具有特定列的最大值的行

回答

相關問題