2012-05-23 88 views
1

these guys的幫助下,我能夠生成以下代碼,它讀入兩個文件(即SA1.WRD和SA1.PHN),並將它們合併,對從字典切除字的子列表中的結果進行了比較:Python:比較具有相似名稱的文件(遞歸地)

進口SYS 進口OS 進口重新 進口itertools

#generator function to merge sound and word files 
def takeuntil(iterable, stop): 
    for x in iterable: 
     yield x 
     if x[1] == stop: 
      break 

#open a dictionary file and create subset of words 
class_defintion = re.compile('([1-2] [lnr] t en|[1-2] t en)') 
with open('TIMITDIC.TXT') as w_list: 
    entries = (line.split(' ', 1) for line in w_list) 
    comp_set = [ x[0] for x in entries if class_defintion.search(x[1]) ] 

#open word and sound files 
total_words = 0 
with open(sys.argv[1]) as unsplit_words, open(sys.argv[2]) as unsplit_sounds: 
    sounds = (line.split() for line in unsplit_sounds) 
    words = (line.split() for line in unsplit_words) 
    output = [ 
    (word, " ".join(sound for _, _, sound in 
     takeuntil(sounds, stop))) 
    for start, stop, word in words 
] 
for x in output: 
    total_words += 1 

#extract words from above into list of words in dictionary set 
glottal_environments = [ x for x in output if x[0] in comp_set ] 

我試圖修改部分#open a dictionary files後運行在一個大的目錄下有幾個子目錄ES。每個子目錄都包含.txt文件,.wav文件,.wrd和.phn文件。我只想打開.wrd和.phn文件,並且希望能夠一次打開兩個文件,並且只有在基本文件名匹配的情況下才能打開它們,即SA1.WRD和SA1.PHN,而不是SA1。 WRD和SI997.PHN。

我立即猜測是做這樣的事情:

for root, dir, files in os.walk(sys.argv[1]): 
    words = [f for f in files if f.endswith('.WRD')] 
    phones = [f for f in files if f.endswith('.PHN')] 
    phones.sort() 
    words.sort() 
    files = zip(words, phones) 

將返回:[('SA1.WRD', 'SA1.PHN'), ('SA2.WRD', 'SA2.PHN'), ('SI997.WRD', 'SI997.PHN')]

我的第一個問題是,無論我是在正確的軌道上,如果是這樣,我的第二個問題是我可以如何去對待這些元組中的每個項目作爲文件名來讀取。

感謝您提供的任何幫助。

EDIT

我想我可以把代碼塊劃分成用於循環:

for f in files: 
     #OPEN THE WORD AND PHONE FILES, COMAPRE THEM (TAKE A WORD COUNT) 
     total_words = 0 
     with open(f[0]) as unsplit_words, open(f[1]) as unsplit_sounds: 

     ... 

然而,這導致一個IOError,推測是由於周圍的每個項目單引號在每個元組中。

更新 我修改了我的原始腳本以包含os.path.join(root, f),如下所述。該腳本現在遍歷目錄樹中的所有文件,但它只處理它找到的最後兩個文件。這裏是print files輸出:

[] 
[('test/test1/SI997.WRD', 'test/test1/SI997.PHN')] 
[('test/test2/SI997.WRD', 'test/test2/SI997.PHN')] 
+0

您的解決方案,如果有.wrd文件和文件.phn之間的完美匹配纔有效。是否有可能會有.wrd文件沒有對應的.phn文件,反之亦然?如果是這樣,你需要重新思考你的方法。 – happydave

+0

不,在每個目錄中,這是一個.wrd和一個對應的.phn文件。沒有孤兒。 – UWLinguist

回答

1

我已就該文件系統測試的不同部分,但它更容易讓你在實際文件確認,以確認它的工作原理上的數據。

編輯允許包含路徑名

import sys 
import os 
import os.path 
import re 
import itertools 

#generator function to merge sound and word files 
def takeuntil(iterable, stop): 
    for x in iterable: 
     yield x 
     if x[1] == stop: 
      break 

def process_words_and_sounds(word_file, sound_file): 
    #open word and sound files 
    total_words = 0 
    with open(word_file) as unsplit_words, open(sound_file) as unsplit_sounds: 
     sounds = (line.split() for line in unsplit_sounds) 
     words = (line.split() for line in unsplit_words) 
     output = [ 
      (word, " ".join(sound for _, _, sound in 
          takeuntil(sounds, stop))) 
      for start, stop, word in words 
      ] 
     for x in output: 
      total_words += 1 
    return total_words, output 

for root, dir, files in os.walk(sys.argv[1]): 
    words = [ os.path.join(root, f) for f in files if f.endswith('.WRD')] 
    phones = [ os.path.join(root, f) for f in files if f.endswith('.PHN')] 
    phones.sort() 
    words.sort() 
    files = zip(words, phones) 
    # print files 

output = [] 
total_words = 0 
for word_sounds in files: 
    word_file, sound_file = word_sounds 
    word_count, output_subset = process_words_and_sounds(word_file, sound_file) 
    total_words += word_count 
    output.extend(output_subset) 

#open a dictionary file and create subset of words 
class_defintion = re.compile('([1-2] [lnr] t en|[1-2] t en)') 
with open('TIMITDIC.TXT') as w_list: 
    entries = (line.split(' ', 1) for line in w_list) 
    comp_set = [ x[0] for x in entries if class_defintion.search(x[1]) ] 

#extract words from above into list of words in dictionary set 
glottal_environments = [ x for x in output if x[0] in comp_set ] 
+0

謝謝!但是,按照我上面的編輯,我仍然得到IOError,它無法在我嘗試運行該腳本的目錄中找到該文件。 (IOError:[Errno 2]沒有這樣的文件或目錄:'SA1.PHN') – UWLinguist

+0

謝謝!我會在早上繼續努力。 – UWLinguist

+0

您是否嘗試過新版本?我添加了代碼來保留帶有單詞聲音元組的完整路徑名。這是ioerror – gauden

相關問題