NLTK的XMLCorpusReader可用於多文件語料庫嗎？

我正在嘗試使用NLTK在New York Times Annotated Corpus上做一些工作，其中包含每篇文章的XML文件（以新聞行業文本格式NITF）。NLTK的XMLCorpusReader可用於多文件語料庫嗎？

我可以分析單個文件沒有問題，像這樣：

from nltk.corpus.reader import XMLCorpusReader 
reader = XMLCorpusReader('nltk_data/corpora/nytimes/1987/01/01', r'0000000.xml')

我需要，雖然對整個語料工作。我試過這樣做：

reader = XMLCorpusReader('corpora/nytimes', r'.*')

但這不會創建一個可用的讀者對象。例如

len(reader.words())

回報

raise TypeError('Expected a single file identifier string') 
TypeError: Expected a single file identifier string

如何閱讀本文集爲NLTK？

我是新來的NLTK，所以任何幫助，非常感謝。

來源

2011-07-26 NAD

我不是NLTK專家，因此可能有更簡單的方法來做到這一點，但天真地我建議您使用Python's glob module。它支持Unix-stle路徑名模式擴展。

from glob import glob 
texts = glob('nltk_data/corpora/nytimes/*')

這樣就會以列表形式爲您提供與指定表達式匹配的文件的名稱。然後取決於你想要他們中有多少/需要有開放的同時，你可以這樣做：

from nltk.corpus.reader import XMLCorpusReader 
for item_path in texts: 
    reader = XMLCorpusReader('nltk_data/corpora/nytimes/', item_path)

至於建議的@waffle悖論：，也可以大刀的texts此列表，滿足您的特定需要。

來源

2011-07-26 23:59:55

是的，你可以指定多個文件。（from：http://nltk.googlecode.com/svn/trunk/doc/api/nltk.corpus.reader.xmldocs.XMLCorpusReader-class.html）

這裏的問題是，我懷疑你所有的文件都包含在corpora/nytimes/year/month/date的文件結構中。 XMLCorpusReader不會遞歸地遍歷你的目錄。即上面的代碼XMLCorpusReader('corpora/nytimes', r'.*')，XMLCorpusReader只能看到corpora/nytimes/（即沒有，因爲只有文件夾）中的xml文件，而不在corpora/nytimes可能包含的任何子文件夾中。另外，您可能打算使用*.xml作爲第二個參數。

我建議您自己遍歷文件夾來構建絕對路徑（上述文檔指定fileids參數的顯式路徑將起作用），或者如果您有可用的年/月/日組合列表，則可以使用它來你的優勢。

來源

2011-07-26 23:55:22

感謝華夫悖論。這非常有幫助。 – NAD

以下是基於機器嚮往和華夫餅悖論的評論的解決方案。構建的使用glob文章列表，並將它們傳遞給XMLCorpusReader作爲一個列表：

from glob import glob 
import re 
years = glob('nltk_data/corpora/nytimes_test/*') 
year_months = [] 
for year in years: 
    year_months += glob(year+'/*') 
    print year_months 
days = [] 
for year_month in year_months: 
    days += glob(year_month+'/*') 
articles = [] 
for day in days: 
    articles += glob(day+'/*.xml') 
file_ids = [] 
for article in articles: 
    file_ids.append(re.sub('nltk_data/corpora/nytimes_test','',article)) 
reader = XMLCorpusReader('nltk_data/corpora/nytimes_test', articles)

來源

2011-07-27 15:12:39 NAD

NLTK的XMLCorpusReader可用於多文件語料庫嗎？

回答

相關問題