如何循環遍歷一個語料庫中的文件：Python

我有其他方法需要與我的語料庫中的每個單獨的txt文件一起工作。我怎樣才能在他們之間循環？如何循環遍歷一個語料庫中的文件：Python

import nltk 
from nltk.corpus import PlaintextCorpusReader as pcr 

def main(): 
    cor = corpus() 
    # for every text file in the corpus: 
     #Do this method 

def corpus(): 
    corpus_root='corpus/' 
    corp = pcr(corpus_root,'.*\.txt') 
    corp = corp.raw() 
    return corp 

main()

來源

2015-04-28 ahmed

你可以在'corpus'中發佈文件結構嗎？另外，你打算如何處理這些文件？ – rickcnagy

這是一個nltk問題;從'pcr'的論據中可以清楚地看出結構。 – alexis

可以使用水珠

import glob 
glob.glob("corpus/*")

來源

2015-04-28 23:46:44 CristiFati

除非我記錯了，我覺得這是一個非常簡單的答案：

# for every text file in the corpus 
for text_file in cor: 
    # Do this method 
    my_method(text_file)

來源

2015-04-29 19:50:18 jksnw

的NLTK語料庫讀者有一種方法fileids()你應該使用：

mycorpus = pcr(corpus_root, r'.*\.txt') 

for fname in mycorpus.fileids(): 
    text = mycorpus.raw(fname) 
    sents = mycorpus.sents(fname) 
    # or whatever

當您使用文件名稱呼叫raw()，sents()words()，tagged_words()等時，您只會獲得指定文件的內容。如果你想要一個你的語料庫的多文件子集，你也可以傳遞一個文件名列表。

PS。它在這裏沒有什麼不同，但是你應該使用原始字符串作爲正則表達式（參見上文）

來源

2015-04-29 23:16:50 alexis

如何循環遍歷一個語料庫中的文件：Python

回答

相關問題