使用Python腳本對樣本ID進行排序

我有一個python腳本來合併具有相同格式的數據文件，僅刪除重複的標題，在每三行之間添加兩個新的空行，第一個實例是要包含的前四行頭。使用Python腳本對樣本ID進行排序

import glob 

read_files = glob.glob("*.txt") 

header_saved = False 
linecnt=0 
with open("merged_data.txt", "wb") as outfile: 
    for f in read_files: 
     with open(f, "rb") as infile: 
      header = next(infile) 
      if not header_saved: 
       outfile.write(header) 
       header_saved = True 
      for line in infile: 
       outfile.write(line) 
       linecnt=linecnt+1 
       if (linecnt%3)==0: 
        outfile.write("\n\n")

示例輸入文件文本（infile中1）：

Specimen_ID Measured_by_initals Measure_date Sex Beak_length Pronotal_width Right_fore_femur_length Right_fore_femur_width Left_fore_femur_length Left_fore_femur_width Right_hind_femur_length Right_hind_femur_width Left_hind_femur_length Left_hind_femur_width Right_hind_femur_area Left_hind_femur_area Right_hind_tibia_width Left_hind_tibia_width Notes 
a 1 30-Dec-16 M 4 4 4 4 4 4 4 4 4 4 4 4 4 4 
b 1 30-Dec-16 F 4 4 4 4 4 4 4 4 4 4 4 4 4 4 beak bent 
c 1 30-Dec-16 M 4 4 4 4 4 4 4 4 4 4 4 4 4 4 
d 1 30-Dec-16 F 4 4 4 4 4 4 4 4 4 4 4 4 4 4 
e 1 30-Dec-16 F 4 4 4 4 4 4 4 4 4 4 4 4 4 4 pronotum deformed 
f 1 30-Dec-16 F 4 4 4 4 4 4 4 4 4 4 4 4 4 4

示例輸入文件文本（infile中2）：

Specimen_ID Measured_by_initals Measure_date Sex Beak_length Pronotal_width Right_fore_femur_length Right_fore_femur_width Left_fore_femur_length Left_fore_femur_width Right_hind_femur_length Right_hind_femur_width Left_hind_femur_length Left_hind_femur_width Right_hind_femur_area Left_hind_femur_area Right_hind_tibia_width Left_hind_tibia_width Notes 
a 2 30-Dec-16 M 4.1 4.1 4.1 4.1 4.1 4.1 4.1 4.1 4.1 4.1 4.1 4.1 4.1 4.1 
b 2 30-Dec-16 F 4.1 4.1 4.1 4.1 4.1 4.1 4.1 4.1 4.1 4.1 4.1 4.1 4.1 4.1 
c 2 30-Dec-16 M 4.1 4.1 4.1 4.1 4.1 4.1 4.1 4.1 4.1 4.1 4.1 4.1 4.1 4.1 
d 2 30-Dec-16 F 4.1 4.1 4.1 4.1 4.1 4.1 4.1 4.1 4.1 4.1 4.1 4.1 4.1 4.1 
e 2 30-Dec-16 F 4.1 4.1 4.1 4.1 4.1 4.1 4.1 4.1 4.1 4.1 4.1 4.1 4.1 4.1 
f 2 30-Dec-16 F 4.1 4.1 4.1 4.1 4.1 4.1 4.1 4.1 4.1 4.1 4.1 4.1 4.1 4.1

現在我想修改腳本，以便它將按Specimen_ID對輸出進行排序，同時保持每三行之間有兩個空行（即每個唯一的Specimen_ID後應該有兩個空行）。任何關於排序行的建議？我在排序多維數據或python列表時看到很多，但在2D表格上沒有太多。

此外，我遇到了一些奇怪的行爲，如果我將數據從Excel導出到製表符分隔的txt文件中，此腳本將只會導致包含第一個infile的內容但不包含其他內容的輸出。但是，如果我將來自本網站的示例數據複製並粘貼到txt文件中，並將它們用作infiles，那麼我沒有任何問題。有誰知道我爲什麼遇到這個問題？

來源

2017-01-04 Mike F

你需要使用python標準庫嗎？通常當人們使用表格數據時，他們使用[pandas]（http://pandas.pydata.org/）。你要求的東西並不難，使用純python（只需使用'sorted'和一個自定義的'key'參數），但它可能會更快更清晰的熊貓。 – Paul

python是熊貓模塊嗎？ –

熊貓是一個python庫，你可以點擊鏈接。你用'pip install pandas'安裝它。當試圖看看如何做到這一點時，我意識到你的文本輸入格式有點模糊。看起來你使用的是空格分隔的表格輸入格式，但是在條目中有空白的空白，並且當缺少值時數據似乎沒有排隊 - 就像「彎曲的喙」似乎是在「Right_fore_femur_length」下，而不是「註釋」。如果可能的話，可能更好地將這些輸入生成爲csv。 – Paul

我將您的測試數據更改爲已逐行列出。這大致相當於什麼會被readlines方法（）返回：

data_1 = """ 
Specimen_ID Measured_by_initals Measure_date Sex Beak_length Pronotal_width Right_fore_femur_length Right_fore_femur_width Left_fore_femur_length Left_fore_femur_width Right_hind_femur_length Right_hind_femur_width Left_hind_femur_length Left_hind_femur_width Right_hind_femur_area Left_hind_femur_area Right_hind_tibia_width Left_hind_tibia_width Notes 
a 1 30-Dec-16 M 4 4 4 4 4 4 4 4 4 4 4 4 4 4 
b 1 30-Dec-16 F 4 4 4 4 4 4 4 4 4 4 4 4 4 4 beak bent 
c 1 30-Dec-16 M 4 4 4 4 4 4 4 4 4 4 4 4 4 4 
d 1 30-Dec-16 F 4 4 4 4 4 4 4 4 4 4 4 4 4 4 
e 1 30-Dec-16 F 4 4 4 4 4 4 4 4 4 4 4 4 4 4 pronotum deformed 
f 1 30-Dec-16 F 4 4 4 4 4 4 4 4 4 4 4 4 4 4 
""".split('\n')[1:-1] 

data_2 = """ 
Specimen_ID Measured_by_initals Measure_date Sex Beak_length Pronotal_width Right_fore_femur_length Right_fore_femur_width Left_fore_femur_length Left_fore_femur_width Right_hind_femur_length Right_hind_femur_width Left_hind_femur_length Left_hind_femur_width Right_hind_femur_area Left_hind_femur_area Right_hind_tibia_width Left_hind_tibia_width Notes 
a 2 30-Dec-16 M 4.1 4.1 4.1 4.1 4.1 4.1 4.1 4.1 4.1 4.1 4.1 4.1 4.1 4.1 
b 2 30-Dec-16 F 4.1 4.1 4.1 4.1 4.1 4.1 4.1 4.1 4.1 4.1 4.1 4.1 4.1 4.1 
c 2 30-Dec-16 M 4.1 4.1 4.1 4.1 4.1 4.1 4.1 4.1 4.1 4.1 4.1 4.1 4.1 4.1 
d 2 30-Dec-16 F 4.1 4.1 4.1 4.1 4.1 4.1 4.1 4.1 4.1 4.1 4.1 4.1 4.1 4.1 
e 2 30-Dec-16 F 4.1 4.1 4.1 4.1 4.1 4.1 4.1 4.1 4.1 4.1 4.1 4.1 4.1 4.1 
f 2 30-Dec-16 F 4.1 4.1 4.1 4.1 4.1 4.1 4.1 4.1 4.1 4.1 4.1 4.1 4.1 4.1 
""".split('\n')[1:-1]

這個程序就不再需要計算行，在所有數據的讀取寫入任何數據傳回之前：

headers = [] 
data = {} 

# Go through the data for each file and sort by specimen id 
for file_data in (data_1, data_2): 
    headers.append(file_data[0]) 
    for line in file_data[1:]: 
     # specimen id is first column of space separated data 
     specimen_id = line.split(' ', 1)[0].strip() 

     # store each line in a list per specimen id 
     if specimen_id not in data: 
      data[specimen_id] = [] 
     data[specimen_id].append(line) 

# output the merged data 
with open("merged_data.txt", "wb") as outfile: 
    for specimen_id in sorted(data): 
     outfile.write(headers[0] + '\n') 
     for line in data[specimen_id]: 
      outfile.write(line + '\n') 
     outfile.write("\n\n")

來源

2017-01-04 20:50:20

這似乎沒有排序數據... – Paul

它分組，然後通過樣本ID排序（在本例中爲a-f） –

啊，沒有看到那裏的「排序（數據）」，期待它在別處。很公平。 – Paul

我可能會推薦使用pandas來處理表格數據，因爲您可以使用from_csv()輕鬆讀取數據，然後調用sort_values(by='Specimen ID')，然後迭代輸出以打印出換行符。

假設這些輸入文件的製表符分隔的文件，這裏是你如何會在閱讀和排序，以pandas：

import pandas as pd 
import glob 
try: 
    from io import StringIO 
except ImportError: 
    from StringIO import StringIO 

dfs = [] 
for infile in glob.glob('*.txt'): 
    # Infile can be a file path or an open file object 
    df = pd.read_csv(infile, delimiter='\t') 
    dfs.append(df) 

df = pd.concat(dfs)  # Combine all the dataframes you loaded in. 

df.sort_values(by='Specimen_ID') 

# Write this to an intermediate StringIO object before the next step. 
o_s = StringIO() 
df.to_csv(o_s, sep='\t', index=False) 
o_s.seek(0) 
lines = o_s.readines() # Get CSV as a list of lines.

在這一點上，你想甩出來。沒有你的要求，他們每3行有一個空行，你只需要做df.to_csv('merged_text.csv', sep='\t', index=False)，你會很好（sep使它製表符分隔，index，因爲pandas會在你讀入時添加一個數字索引，不想說地寫出來，因爲它是沒有意義的），而是我們讀入行的列表，所以我們可以對他們進行迭代，並根據需要編寫額外的線路：

# This will read through o_s 3 lines at a time and then append a blank "line" 
# before writing it. 
with open('merged_data.txt', 'w') as f: 
    f.writelines(lines[0]) # Write the header line 
    for ii in range(1, len(lines) // 3): 
     # Write three lines at a time after the header, then an extra newline 
     f.writelines(lines[(3 * ii + 1):(3 * (ii + 1) + 1)] + ['\n'])

如果你不想爲此使用pandas，您可以嘗試csv模塊：

import csv 
from operator import itemgetter 

lines_in = [] 
header_line = None 
for infile in glob.glob('*.txt'): 
    with open(infile, 'r') as f: 
     reader = csv.reader(f, delimiter='\t') 
     first_line = next(reader) 
     if header_line is None: 
      header_line = first_line 

     # Append all the lines 
     lines_in += list(reader) 

# Making the assumption that Specimen_ID is always the first column 
lines = sorted(lines, key=itemgetter(0)) 

# Write this out as a well-formatted CSV 
o_s = StringIO() 
writer = csv.writer(o_s, delimiter='\t') 
writer.writerow(header_line) 
writer.writerows(lines) 

lines = o_s.readlines()

一旦你有lines，你可以使用我上面使用的相同的代碼將它寫入輸出文件。

來源

2017-01-05 01:10:17 Paul

使用Python腳本對樣本ID進行排序

回答

相關問題