使用python拆分排序的文件，更改值

我是python的新手。我的要求，這很簡單，如果我必須做用awk吧，就是如下，使用python拆分排序的文件，更改值

文件（test.txt的）下面提到的標籤分離，

1 a b c 
1 a d e 
1 b d e 
2 a b c 
2 a d e 
3 x y z

輸出，我希望它像

文件1.txt的應該有以下值

a b c 
a d e 
b d e

文件2.txt應低於值

a b c 
a d e

文件3.txt應該有以下值

x y z

原始文件的第一列進行排序。我不知道我必須拆分的行號。它必須在價值的變化上。用awk，我會寫它像

awk -F"\t" 'BEGIN {OFS="\t";} {print $2","$3","$4 > $1}' test.txt

（性能明智的，將蟒蛇更好？）

來源

2013-08-28 user2726640

性能真的是一個問題嗎？這需要多長時間？ – abarnert

如果這是一個製表符分隔的文件，它只有一列，因爲它沒有選項卡。 – abarnert

另外，你的awk腳本應該創建什麼？它絕對不會創建'file1.txt'等。 – abarnert

AWK是爲這個完美的，應該是快了很多。速度真的是一個問題，你的投入有多大？

$ awk '{print $2,$3,$4 > ("file"$1)}' OFS='\t' file

演示：

$ ls 
file 

$ cat file 
1 a b c 
1 a d e 
1 b d e 
2 a b c 
2 a d e 
3 x y z 

$ awk '{print $2,$3,$4 > ("file"$1)}' OFS='\t' file 

$ ls 
file file1 file2 file3 

$ cat file1 
a b c 
a d e 
b d e 

$ cat file2 
a b c 
a d e 

$ cat file3 
x y z

來源

2013-08-28 18:53:55

像這樣的東西應該做你想要什麼。

import itertools as it 

with open('test.txt') as in_file: 
    splitted_lines = (line.split(None, 1) for line in in_file) 
    for num, group in it.groupby(splitted_lines, key=lambda x: x[0]): 
     with open(num + '.txt', 'w') as out_file: 
      out_file.writelines(line for _, line in group)

的with語句可以安全地使用資源。在這種情況下，他們會自動關閉文件。
splitted_lines = (...)行創建一個迭代遍及每個行的字段，併產生一對第一個元素，其餘行。
itertools.groupby功能是完成大部分工作的功能。它遍歷文件的行並根據第一個元素對它們進行分組。
(line for _, line in group)迭代「分割線」。它只是放棄第一個元素，並只寫入其餘的行。（該_是一個標識符，任何其他。我可以用x或first，但我_經常用來表示一些你有分配，但你不使用）

我們可能可能會簡化代碼。例如最外層的with不太可能是有用的，因爲我們只是在閱讀模式下打開文件，而不是修改它。刪除它，我們可以脫下縮進：

import itertools as it 

splitted_lines = (line.split(None, 1) for line in open('test.txt')) 
for num, group in it.groupby(splitted_lines, key=lambda x: x[0]): 
    with open(num + '.txt', 'w') as out_file: 
     out_file.writelines(line for _, line in group)

我做了一個非常簡單的基準測試蟒蛇解決方案相對awk的解決方案。性能大致相同，蟒蛇稍微更快使用一個文件，其中每行有10個字段，並與100「線組」每個隨機大小介於2和30元素之間。

時序的Python代碼的：

In [22]: from random import randint 
    ...: 
    ...: with open('test.txt', 'w') as f: 
    ...:  for count in range(1, 101): 
    ...:   num_nums = randint(2, 30) 
    ...:   for time in range(num_nums): 
    ...:    numbers = (str(randint(-1000, 1000)) for _ in range(10)) 
    ...:    f.write('{}\t{}\n'.format(count, '\t'.join(numbers))) 
    ...:    

In [23]: %%timeit 
    ...: splitted_lines = (line.split(None, 1) for line in open('test.txt')) 
    ...: for num, group in it.groupby(splitted_lines, key=lambda x: x[0]): 
    ...:  with open(num + '.txt', 'w') as out_file: 
    ...:   out_file.writelines(line for _, line in group) 
    ...: 
10 loops, best of 3: 11.3 ms per loop

在awk定時：

$time awk '{print $2,$3,$4 > ("test"$1)}' OFS='\t' test.txt 

real 0m0.014s 
user 0m0.004s 
sys  0m0.008s

注意0.014s約爲14 ms。

無論如何，取決於操作系統的負載，時間可能會有所不同，並且它們同樣有效。在實踐中，幾乎所有的時間都是從文件中讀取/寫入文件，這是由python和awk高效地完成的。我相信使用C你不會看到巨大的速度收益。

來源

2013-08-28 19:01:04 Bakuriu

由於OP對性能特別感興趣，因此在這裏有一個明顯的加速：使用'x.split（None，1）'，所以你不需要重新加入'。添加另一個將每行分割一次的genexpr，因此您不需要再執行兩次也可能有所幫助。 – abarnert

@abarnert你是對的。更新。 – Bakuriu

你可以把python扔到一個文件中併發布'time python test.py'的結果嗎？ –

我的版本：

for line in open('text.txt', 'r'): 
    line = line.split(' ') 
    doc_name = line[0] 
    content = ' '.join(line[1:]) 

    f = open('file' + doc_name, 'a+') 
    f.write(content)

來源

2013-08-28 19:06:06 badc0re

如果你心裏有一個非常大的文件，awk將打開和關閉文件在每一行做的追加，不是嗎？如果這是一個問題，那麼C++就有速度和容器類來很好地處理任意數量的打開的輸出文件，這樣每個文件就會被打開和關閉一次。不過，這被標記爲Python，假設I/O時間將佔主導地位，這將會快得多。

一個版本，以避免在Python額外的開/關的開銷：

# iosplit.py 

def iosplit(ifile, ifname="", prefix=""): 
    ofiles = {} 
    try: 
     for iline in ifile: 
      tokens = [s.strip() for s in iline.split('\t')] 
      if tokens and tokens[0]: 
       ofname = prefix + str(tokens[0]) + ".txt" 
       if ofname in ofiles: 
        ofile = ofiles[ofname] 
       else: 
        ofile = open(ofname, "w+") 
        ofiles[ofname] = ofile 
       ofile.write('\t'.join(tokens[1:]) + '\n') 
    finally: 
     for ofname in ofiles: 
      ofiles[ofname].close() 

if __name__ == "__main__": 
    import sys 
    ifname = (sys.argv + ["test.txt"])[1] 
    prefix = (sys.argv + ["", ""])[2] 
    iosplit(open(ifname), ifname, prefix)

命令行的用法是蟒蛇iosplit.py

的默認爲空，將被預置到每個輸出文件名。調用程序提供了一個文件（或類似文件的對象），因此您可以使用StringIO對象或甚至字符串的列表/元組來驅動此文件。

警告：此操作將刪除行中製表符前後的空格。內部空間不會被觸及。所以當寫入1.txt時，「1 \ ta b \ t c \ t d」將被轉換爲「a b \ tc \ td」。

來源

2013-08-28 20:20:52

使用python拆分排序的文件，更改值

回答

相關問題