python，比較位於兩個不同文本文件的列中的字符串

-1

我有兩個文本文件，「animals.txt」和「colors.txt」，如下所示，其中每行的2個字符串由選項卡分隔。python，比較位於兩個不同文本文件的列中的字符串

「animals.txt」

12345 dog 

23456 sheep 

34567 pig

「colors.txt」

34567 pink 

12345 black 

23456 white

我想編寫Python代碼：

對於「animals.txt每行「取第一列中的字符串（12345，然後是23456，然後是34567）
將此字符串與st在「colors.txt」
第一列環如果找到一個匹配（12345 12345 ==等），將其寫入兩個輸出文件：

OUTPUT1，含有animals.txt的行+對應於該查詢值在colors.txt的第二列的值（12345）：

含有對應於所述查詢值colors.txt的第二列中的值的列表

12345 dog black 
23456 sheep white 
34567 pig pink

OUTPUT2（12345 ，然後是23456，然後是34567））：

black 
white 
pink

來源

2012-07-17 user1532389

你試過了什麼？ – Dhara 2012-07-17 16:57:06

你需要使用python嗎？如果你正在使用bash和你的輸入進行排序，這樣做：

$ join -t $'\t' <(sort animals.txt) <(sort colors.txt) > output1 
$ cut -f 3 output1 > output2

如果您還沒有支持進程替換一個殼，然後進行排序輸入文件並執行：

$ join -t '<tab>' animals.txt colors.txt > output1 
$ cut -f 3 output1 > output2

凡<tab>是一個實際的製表符。根據你的shell，你可以用ctrl-V後跟一個製表鍵來輸入它。（或使用切割不同的分隔符。）

來源

2012-07-17 17:03:23

您排序錯誤 - 「animals.txt」已經排序，「colors.txt」需要排序。請注意，在bash中，可以使用'$'\ t''來表示一個選項卡。由於只有一個文件需要排序，因此您可以執行'sort colors.txt |加入-t $'\ t'animals.txt -'。 – 2012-07-17 17:09:58

@sven感謝您指出'$'\ t''。 – 2012-07-17 17:21:52

如果順序並不重要，這將成爲一個非常簡單的問題：

with open('animals.txt') as f1, open('colors.txt') as f2: 
    animals = {} 
    for line in f1: 
     animal_id, animal_type = line.split('\t') 
     animals[animal_id] = animal_type 

    #animals = dict(map(str.split,f1)) would work instead of the above loop if there are no multi-word entries. 

    colors={} 
    for line in f2: 
     color_id, color_name = line.split('\t') 
     colors[color_id] = color_name 

    #colors = dict(map(str.split,f2)) would work instead of the above loop if there are no multi-word entries. 
    #Thanks @Sven for pointing this out. 

common=set(animals.keys()) & set(colors.keys()) #set intersection. 
with open('output1.txt','w') as f1, open('output2.txt','w') as f2: 
    for i in common: #sorted(common,key=int) #would work here to sort. 
     f1.write("%s\t%s\t%s\n"%(i,animals[i],colors[i]) 
     f2.write("%s"%colors[i])

你也許能更優雅地做到這一點有點通過defaultdict哪裏當遇到一個特定的鍵時，你會追加到一個列表中，然後當你寫輸出之前測試列表的長度是2時，但是我不相信這種方法更好。

來源

2012-07-17 17:21:39 mgilson

你也可以做'animals = dict（map（str.split，f1））'。 – 2012-07-17 17:38:57

@SvenMarnach - 好點。出於某種原因，我不傾向於使用它來經常創建字典。一個值得警惕的是，當涉及名稱中有空格的動物（例如「棕色斑點蜥蜴」）時，它是有點脆弱的。我的原始版本（使用裸「split」有類似的問題）。我已更新。 – mgilson 2012-07-17 17:46:31

下，在輸入文件的每一行完全一樣的例子是結構化的假設：

with open("c:\\python27\\output1.txt","w") as out1, \ 
    open("c:\\python27\\output2.txt","w") as out2: 

    for outline in [animal[0]+"\t"+animal[1]+"\t"+color[1] \ 
        for animal in [line.strip('\n').split("\t") \ 
        for line in open("c:\\python27\\animals.txt","r").readlines()] \ 
        for color in [line.strip('\n').split("\t") \ 
        for line in open("c:\\python27\\colors.txt","r").readlines()] \ 
        if animal[0] == color[0]]: 

     out1.write(outline+'\n') 
     out2.write(outline[outline.rfind('\t')+1:]+'\n')

我認爲這會爲你做。

也許不是最優雅/快速/清晰的方法 - 但很短。從技術上講，我相信這是4條線。

來源

2012-07-17 18:08:37 selllikesybok

我會用熊貓

animals, colors = read_table('animals.txt', index_col=0), read_table('colors.txt', index_col=0) 
df = animals.join(colors)

結果：

animals.join(colors) 
Out[73]: 
     animal color 
id 
12345 dog  black 
23456 sheep white 
34567 pig  pink

然後輸出顏色ID的順序文件：

df.color.to_csv(r'out.csv', index=False)

如果無法添加列標題爲文本文件，可以在導入時添加它們

animals = read_table('animals.txt', index_col=0, names=['id','animal'])

來源

2012-07-23 02:04:26 mrjoh3

python，比較位於兩個不同文本文件的列中的字符串

回答

相關問題