讀取和文件替換考慮到數據從第二個文件

我有這樣一個文件，其中包含句子，標記爲BOS（開始句）和EOS（完句子中）：讀取和文件替換考慮到數據從第二個文件

BOS 1 
1 word \t\t word \t word \t\t word \t 123 
1 word \t\t word \t word \t\t word \t 234 
1 word \t\t word \t word \t\t word \t 567 
EOS 1 

BOS 2 
2 word \t\t word \t word \t\t word \t 456 
2 word \t\t word \t word \t\t word \t 789 
EOS 2

而且第二個文件，其中第一個數字表示語句編號：

1, 123, 567 
2, 789

我想是讀第一和第二文件，如果在每行的末尾數字出現在第二個文件進行檢查。如果是這樣，我只想更改第一個文件行中的第四個單詞。因此，預期的輸出結果是：

所有的

BOS 1 
1 word \t\t word \t word \t\t NEW_WORD \t 123 
1 word \t\t word \t word \t\t word \t 234 
1 word \t\t word \t word \t\t NEW_WORD \t 567 
EOS 1 

BOS 2 
2 word \t\t word \t word \t\t word \t 456 
2 word \t\t word \t word \t\t NEW_WORD \t 789 
EOS 2

首先，我不知道如何讀的兩個文件，因爲他們有不同的行數。然後，我不知道如何遍歷行，例如第一個文件中的第一個句子，並同時迭代第二個文件第一行中的值進行比較。這是我到目前爲止：

def readText(filename1, filename2): 
    data1 = open(filename1).readlines() # the first file 

    data2 = open(filename2).readlines() # the second one 

    list2 = [] # a list to store the values of the second file 

    for line1, line2 in itertools.izip(data1, data2): 
    l1 = line1.split() 

    l2 = line2.split(', ') 

    find = re.findall(r'.*word\t\d\d\d', line1) # find the fourth word in a line, followed by a number 

    for l in l2: 
     list2.append(l) 

    for match in find: 
     m = match.split() # split the lines of the first file 

     if (m[0] == list2[0]): # for the same sentence number in the two files 
     result = re.sub(r'(.*)word\t%s' %m[5], r'\1NEW_WORD\t%s' %m[5],line1) 

if len(sys.argv)==3: 
    lines = readText(sys.argv[1], sys.argv[2]) 
else: 
    print("file.py inputfile1 inputfile2")

在此先感謝您的幫助！

來源

2017-01-04 isa

請修復您的縮進。並且在輸入文件實際製表符字符中還是'\ t'或只是'\ t'？ –

\ t是實際製表符 – isa

什麼是行和句子？是以'\ n'結尾的行還是以'\ n'結尾的句子？ –

僅供參考，我將第一個文件命名爲source.txt，第二個文件命名爲control.txt，輸出命名爲result.txt。
這是程序的骨架。

[modify_line(line) if line[0].isdigit() else line for line in source]

該代碼通過各行完整的或修改。如果一行以數字開頭，則它傳遞給modify_line，該行返回修改後的行或基於傳遞給它的行的原始行以及從control.txt獲取的某些輸入。
modify_line必須從control.txt獲取數據來檢查和修改傳遞給它的每一行。數據爲行起始數字和結束數字，例如[1, (123, 567)]。如果起始號碼匹配並且其中一個結尾號碼匹配，則線路會被更改。如果起始號碼不匹配，則從控制文件中讀取下一行起始號碼，因爲modify_line僅傳遞以數字開頭的行。
爲了保持狀態，我在這裏使用了closure。

import re 

def create_line_modification_function(fp, replacement_word): 

    def get_line_number_and_end_numbers(): 
     for line in fp: 
      if line.strip(): 
       line_number, rest = line.split(',', 1) 
       line_number = line_number.strip() 
       ends = [end.strip() for end in rest.split(',')] 
       yield line_number, ends 

    generate_line_numbers_and_ends = get_line_number_and_end_numbers() 
    # modify_line needs to change this. So this is in a list 
    line_number_and_ends = list(next(generate_line_numbers_and_ends, (None, None))) 
    # for safety check if we run out of line numbers in the control file 
    if line_number_and_ends[0] is None: 
     raise ValueError('{} reached EOF'.format(fp.name)) 
    # for optimization compile once here 
    pattern = re.compile(r'(.*)word(.*\d{3}$)') 


    def modify_line(line): 
     while True: 
      # for convenience unpack the list 
      line_number, ends = line_number_and_ends 
      if line.startswith(line_number): 
       for end in ends: 
        if line.rstrip().endswith(end): 
         return pattern.sub(r'\1{}\2'.format(replacement_word), line) 
       return line 
      # If we are here the line numbers from control.txt and source.txt don't match. 
      # So we have to read next line from control file 
      line_number_and_ends[0], line_number_and_ends[1] = next(generate_line_numbers_and_ends, (None, None)) 
      if line_number_and_ends[0] is None: 
       raise ValueError('{} reached EOF'.format(fp.name)) 

    return modify_line 

if __name__ == '__main__': 

    with open('source.txt') as source, open('control.txt') as ctl, open('result.txt', 'w') as target: 
     modify_line = create_line_modification_function(ctl, 'NEW_WORD') 
     target.writelines(modify_line(line) if line[0].isdigit() else line for line in source)

來源

2017-01-04 21:18:41

太棒了！謝謝！ – isa

讀取和文件替換考慮到數據從第二個文件

回答

相關問題