從多個文本文件構建字典映射

我有多個帶有ID和值的* .txt文件，我想創建一個唯一的字典。但是，某些文件中會重複某些ID，並且對於這些ID，我想要將這些值進行CONCATENATED。這是兩個文件的示例（但我有一堆文件，所以我認爲我需要glob.glob）:(注意某個文件中的所有'值'具有相同的長度，所以我可以將' - '添加爲。多次LEN（值）丟失從多個文本文件構建字典映射

文件1

ID01 
Hi 
ID02 
my 
ID03 
ni

文件2

ID02 
name 
ID04 
meet 
ID05 
your

慾望輸出：（注意，當沒有重複的ID，我要添加「 Na'或' - '，與len（value）相同）這是我的輸出：

ID01 
Hi---- 
ID02 
myname 
ID03 
ni---- 
ID04 
--meet 
ID05 
--your

我只想將輸出存儲在字典中。另外，我猜如果打開文件時打印文件，我可以知道打開哪個文件的順序，對吧？

這是我有：（我不能連接我的價值觀至今）

output={} 
list = [] 
for file in glob.glob('*.txt'):   
    FI = open(file,'r') 
    for line in FI.readlines(): 
     if (line[0]=='I'):  #I am interested in storing only the ones that start with I, for a future analysis. I know this can be done separating key and value with '\t'. Also, I am sure the next lines (values) does not start with 'I' 
      ID = line.rstrip() 
      output[ID] = '' 
      if ID not in list: 
       list.append(ID)  
     else: 
      output[ID] = output[ID] + line.rstrip() 

    if seqs_name in list: 
     seqs[seqs_name] += seqs[seqs_name] 

    print (file) 
    FI.close() 


print ('This is your final list: ') 
print (list) #so far, I am getting the right final list, with no repetitive ID 
print (output) #PROBLEM: the repetitive ID, is being concatenated twice the 'value' in the last file read.

此外，如何添加「 - 」時，不重複ID？我將非常感謝您的幫助。

總結：當密鑰在另一個文件中重複時，我無法連接值。如果密鑰不重複，我想添加' - '，這樣我以後可以打印文件名並知道某個ID在哪個文件中沒有值。

來源

2017-08-10 gusa10

幾個問題與您現有的代碼：

line[0] == 'ID'：line[0]返回字符，所以這比較始終爲false。改用str.startswidth(xxx)來檢查一個字符串是否以xxx開頭。
您沒有正確檢索ID之後的文本。最簡單的方法是致電next(f)。
您不需要第二個列表。此外，不要將變量list命名爲陰影內置。

import collections 

output = collections.defaultdict(str) 
for file in glob.glob('*.txt'):   
    with open(file, 'r') as f: 
    for line in f: 
     if line.startswith('ID'): 
      try: 
       text = next(f) 
       output[line.strip()] += text.strip() + ' ' 
      except StopIteration: 
       pass 

print(output)

它絕不會傷害到釣奇異常，則使用try-except。

來源

2017-08-10 07:08:36

好的，你的新版本正在工作:)。非常感謝！ – gusa10

如果您想在未連接值時添加' - '或'Na'，該怎麼辦？ – gusa10

@ gusa10每個線程一個問題請;）你可以考慮標記這接受，如果它幫助。至於添加Na，您必須獲取文本，然後檢查文本是否也以ID開頭。這意味着實際的文字不見了。 –

從多個文本文件構建字典映射

回答

相關問題