在python中有很多txt文件的雙元克

我有一個包含70,429個文件（296.5 mb）的語料庫。我試圖通過使用整個語料庫來找到雙格。我寫了下面的代碼;在python中有很多txt文件的雙元克

allFiles = "" 
for dirName in os.listdir(rootDirectory): 
    for subDir in os.listdir(dirName): 
     for fileN in os.listdir(subDir): 
      FText = codecs.open(fileN, encoding="'iso8859-9'") 
      PText = FText.read() 
      allFiles += PText 
tokens = allFiles.split() 
finder = BigramCollocationFinder.from_words(tokens, window_size = 3) 
finder.apply_freq_filter(2) 
bigram_measures = nltk.collocations.BigramAssocMeasures() 
for k,v in finder.ngram_fd.most_common(100): 
    print(k,v)

有一個根目錄，根目錄包含子目錄，每個子目錄包含大量文件。我所做的是;

我讀取所有文件，並將上下文添加到名爲allFiles的字符串中。最後，我將字符串拆分爲令牌並調用相關的雙字母函數。問題是;

我跑了一天的程序，並沒有得到任何結果。有沒有更有效的方法來查找包含大量文件的語料庫中的bigrams？

任何意見和建議將不勝感激。提前致謝。

來源

2016-03-13 yns

要做的一件事就是在循環中的目錄遍歷期間處理每個文件並存儲'BigramCollocationFinder'的輸出。可能會非常緊張，但可能會更快？ – avip

通過嘗試將一個巨大的語料庫一次讀入內存，您正在吹出內存，迫使大量的交換使用，並放慢了一切。

NLTK提供了各種可以將您的單詞逐個返回的「語料庫閱讀器」，以便整個語料庫永遠不會同時存儲在內存中。

from nltk.corpus.reader import PlaintextCorpusReader 
reader = PlaintextCorpusReader(rootDirectory, "*/*/*", encoding="iso8859-9") 
finder = BigramCollocationFinder.from_words(reader.words(), window_size = 3) 
finder.apply_freq_filter(2) # Continue processing as before 
...

附錄：如果我理解你的陰莖佈局權這可能會實現你的方法有一個缺陷：你正在做的是從一個文檔的結束到下一個的開始跨越卦...這是你想擺脫的廢話。我推薦以下變體，它分別從每個文檔中收集三元組。

document_streams = (reader.words(fname) for fname in reader.fileids()) 
BigramCollocationFinder.default_ws = 3 
finder = BigramCollocationFinder.from_documents(document_streams)

來源

2016-03-13 22:30:27 alexis

考慮將您的進程與Python的「多進程」線程池（https://docs.python.org/2/library/multiprocessing.html）並行化，爲語料庫中的每個文件發出一個帶有{word：count}的字典到一些共享列表中。工作池完成後，在過濾之前合併字典，並按字出現次數進行合併。

來源

2016-03-13 20:12:08 manglano

在python中有很多txt文件的雙元克

回答

相關問題