在NLTK中使用英國國家語料庫

我是NLTK（http://www.nltk.org/）和python的新手。我希望使用NLTK python庫，但使用BNC作爲語料庫。我不相信這個語料庫是通過NLTK數據下載分發的。有沒有辦法導入NLTK使用的BNC語料庫。如果是這樣，怎麼樣？我確實找到了一個名爲BNCCorpusReader的函數，但不知道如何使用它。另外，在BNC站點，我能夠下載語料庫（http://ota.ox.ac.uk/desc/2554）。在NLTK中使用英國國家語料庫

http://www.nltk.org/api/nltk.corpus.reader.html?highlight=bnc#nltk.corpus.reader.BNCCorpusReader.word

更新

我已經試過entrophy的建議，但得到以下錯誤：

raise IOError('No such file or directory: %r' % _path) 
OSError: No such file or directory: 'C:\\Users\\jason\\Documents\\NetBeansProjects\\DemoCollocations\\src\\Corpora\\bnc\\A\\A0\\A00.xml'

我的代碼在語料閱讀：

bnc_reader = BNCCorpusReader(root="Corpora/bnc", fileids=r'[A-K]/\w*/\w*\.xml')

而語料庫則是l ocated在： C：\用戶\傑森\文件\的NetBeansProjects \ DemoCollocations \ SRC \語料庫\ BNC \

來源

2017-04-19 jason

你的目的是什麼？你必須使用NLTK嗎？我不太熟悉Python並且從不使用NLTK，但是我使用Stanford Core NLP在Java中處理了BNC。我的目標是建立一個正確的語料庫來解析以獲得單詞對之間的依賴關係。所以，從BNC的xml文件開始，我用xml解析器重新創建了每個句子。然後我用Core NLP處理每個句子。如果你的目標只是導入語料庫，老實說我不能迴應你，但在最後的例子中，你仍然可以創建XML文本的txt格式，並將其傳遞給python，並最終通過字符串處理它。 –

@ s.dallapalma你好。我不需要使用NLTK，但我確實需要能夠使用某些庫來查找單詞的「搭配」。我看着斯坦福核心NLP，但被告知它沒有一個Collocations功能。 – jason

在問候NLTK爲搭配提取的例子的使用，看看下面的指南：A how-to guide by nltk on collocations extraction

就BNC語料庫讀者而言，所有的信息都在文檔中。

from nltk.corpus.reader.bnc import BNCCorpusReader 
from nltk.collocations import BigramAssocMeasures, BigramCollocationFinder 

# Instantiate the reader like this 
bnc_reader = BNCCorpusReader(root="/path/to/BNC/Texts", fileids=r'[A-K]/\w*/\w*\.xml') 

#And say you wanted to extract all bigram collocations and 
#then later wanted to sort them just by their frequency, this is what you would do. 
#Again, take a look at the link to the nltk guide on collocations for more examples. 

list_of_fileids = ['A/A0/A00.xml', 'A/A0/A01.xml'] 
bigram_measures = BigramAssocMeasures() 
finder = BigramCollocationFinder.from_words(bnc_reader.words(fileids=list_of_fileids)) 
scored = finder.score_ngrams(bigram_measures.raw_freq) 

print(scored)

的輸出將是這個樣子：

[(('of', 'the'), 0.004902261167963723), (('in', 'the'),0.003554139346773699), 
(('.', 'The'), 0.0034315828175746064), (('Gift', 'Aid'), 0.0019609044671854894), 
((',', 'and'), 0.0018996262025859428), (('for', 'the'), 0.0018383479379863962), ... ]

如果你想用分數來排序，你可以嘗試這樣的事情

sorted_bigrams = sorted(bigram for bigram, score in scored) 

print(sorted_bigrams)

由於：

[('!', 'If'), ('!', 'Of'), ('!', 'Once'), ('!', 'Particularly'), ('!', 'Raising'), 
('!', 'YOU'), ('!', '‘'), ('&', 'Ealing'), ('&', 'Public'), ('&', 'Surrey'), 
('&', 'TRAINING'), ("'", 'SPONSORED'), ("'S", 'HOME'), ("'S", 'SERVICE'), ... ]

來源

2017-04-29 02:02:14 entrophy

感謝您的回覆。我嘗試了您提供的代碼，但是我在加載語料庫時遇到問題。我相信這可能是由於我缺乏Python經驗。我將編輯我的代碼並添加錯誤詳細信息，如果您可以提供幫助，我將不勝感激。 – jason

根目錄是 /文本。所以你應該改變代碼以將讀者實體化到這個'bnc_reader = BNCCorpusReader（root =「Corpora/bnc/Texts」，fileids = r'[AK]/\ w */\ w * \。xml'）' – entrophy

啊，那是做的。我認爲這是因爲我下載的文件夾結構不同。我使用： bnc_reader = BNCCorpusReader（root =「Corpora/bnc/2554/download/Texts」，fileids = r'[A-K]/\ w */\ w * \。xml'） – jason

在NLTK中使用英國國家語料庫

回答

相關問題