2017-01-04 19 views
0

我有這樣一個文件,其中包含句子,標記爲BOS(開始句)和EOS(完句子中):讀取和文件替換考慮到數據從第二個文件

BOS 1 
1 word \t\t word \t word \t\t word \t 123 
1 word \t\t word \t word \t\t word \t 234 
1 word \t\t word \t word \t\t word \t 567 
EOS 1 

BOS 2 
2 word \t\t word \t word \t\t word \t 456 
2 word \t\t word \t word \t\t word \t 789 
EOS 2 

而且第二個文件,其中第一個數字表示語句編號:

1, 123, 567 
2, 789 

我想是讀第一和第二文件,如果在每行的末尾數字出現在第二個文件進行檢查。如果是這樣,我只想更改第一個文件行中的第四個單詞。因此,預期的輸出結果是:

所有的
BOS 1 
1 word \t\t word \t word \t\t NEW_WORD \t 123 
1 word \t\t word \t word \t\t word \t 234 
1 word \t\t word \t word \t\t NEW_WORD \t 567 
EOS 1 

BOS 2 
2 word \t\t word \t word \t\t word \t 456 
2 word \t\t word \t word \t\t NEW_WORD \t 789 
EOS 2 

首先,我不知道如何讀的兩個文件,因爲他們有不同的行數。然後,我不知道如何遍歷行,例如第一個文件中的第一個句子,並同時迭代第二個文件第一行中的值進行比較。這是我到目前爲止:

def readText(filename1, filename2): 
    data1 = open(filename1).readlines() # the first file 

    data2 = open(filename2).readlines() # the second one 

    list2 = [] # a list to store the values of the second file 

    for line1, line2 in itertools.izip(data1, data2): 
    l1 = line1.split() 

    l2 = line2.split(', ') 

    find = re.findall(r'.*word\t\d\d\d', line1) # find the fourth word in a line, followed by a number 

    for l in l2: 
     list2.append(l) 

    for match in find: 
     m = match.split() # split the lines of the first file 

     if (m[0] == list2[0]): # for the same sentence number in the two files 
     result = re.sub(r'(.*)word\t%s' %m[5], r'\1NEW_WORD\t%s' %m[5],line1) 

if len(sys.argv)==3: 
    lines = readText(sys.argv[1], sys.argv[2]) 
else: 
    print("file.py inputfile1 inputfile2") 

在此先感謝您的幫助!

+0

請修復您的縮進。並且在輸入文件實際製表符字符中還是'\ t'或只是'\ t'? –

+0

\ t是實際製表符 – isa

+0

什麼是行和句子?是以'\ n'結尾的行還是以'\ n'結尾的句子? –

回答

0

僅供參考,我將第一個文件命名爲source.txt,第二個文件命名爲control.txt,輸出命名爲result.txt。
這是程序的骨架。

[modify_line(line) if line[0].isdigit() else line for line in source] 

該代碼通過各行完整的或修改。如果一行以數字開頭,則它傳遞給modify_line,該行返回修改後的行或基於傳遞給它的行的原始行以及從control.txt獲取的某些輸入。
modify_line必須從control.txt獲取數據來檢查和修改傳遞給它的每一行。數據爲行起始數字和結束數字,例如[1, (123, 567)]。如果起始號碼匹配並且其中一個結尾號碼匹配,則線路會被更改。如果起始號碼不匹配,則從控制文件中讀取下一行起始號碼,因爲modify_line僅傳遞以數字開頭的行。
爲了保持狀態,我在這裏使用了closure。

import re 

def create_line_modification_function(fp, replacement_word): 

    def get_line_number_and_end_numbers(): 
     for line in fp: 
      if line.strip(): 
       line_number, rest = line.split(',', 1) 
       line_number = line_number.strip() 
       ends = [end.strip() for end in rest.split(',')] 
       yield line_number, ends 

    generate_line_numbers_and_ends = get_line_number_and_end_numbers() 
    # modify_line needs to change this. So this is in a list 
    line_number_and_ends = list(next(generate_line_numbers_and_ends, (None, None))) 
    # for safety check if we run out of line numbers in the control file 
    if line_number_and_ends[0] is None: 
     raise ValueError('{} reached EOF'.format(fp.name)) 
    # for optimization compile once here 
    pattern = re.compile(r'(.*)word(.*\d{3}$)') 


    def modify_line(line): 
     while True: 
      # for convenience unpack the list 
      line_number, ends = line_number_and_ends 
      if line.startswith(line_number): 
       for end in ends: 
        if line.rstrip().endswith(end): 
         return pattern.sub(r'\1{}\2'.format(replacement_word), line) 
       return line 
      # If we are here the line numbers from control.txt and source.txt don't match. 
      # So we have to read next line from control file 
      line_number_and_ends[0], line_number_and_ends[1] = next(generate_line_numbers_and_ends, (None, None)) 
      if line_number_and_ends[0] is None: 
       raise ValueError('{} reached EOF'.format(fp.name)) 

    return modify_line 

if __name__ == '__main__': 

    with open('source.txt') as source, open('control.txt') as ctl, open('result.txt', 'w') as target: 
     modify_line = create_line_modification_function(ctl, 'NEW_WORD') 
     target.writelines(modify_line(line) if line[0].isdigit() else line for line in source) 
+0

太棒了!謝謝! – isa